I am writing an ETL in kettle pentaho to create a table from various sources including google analytics.
so Table 1 = All data from website joined to google analytics information
Table 2 = All Duplicate data from Table 1 joined to google analytics information
my problem is that some info on table 1 has the google analytics information missing but table 2 shows some data for Google Analytics on the same reference_number
So what I want to do is lookup [reference_number] from table 1 to table 2 and populate table 1 where some columns are null from info on table 2
Quick example EDIT*
Table 1 (Main Table) * *This table has an index built in on website_reference number (Unique)*
website_Reference_number GA_info_1 GA_info_2
A1 null null
A2 x y
Table 2 (Duplicates from Table 1)
eventlabel GA_info_1 GA_info_2
A1 z z
A2 x y
Table 1 (Main Table)
Ref_number GA_info_1 GA_info_2
A1 z z
A2 x y
UPDATE mytable LEFT JOIN mytable AS table2 ON mytable.Ref_number = table2.Ref_number SET mytable.GA_info_1 = COALESCE ( mytable.GA_info_1, table2.GA_info_1 ), mytable.GA_info_2 = COALESCE ( mytable.GA_info_2, table2.GA_info_2 ) WHERE mytable.GA_info_1 IS NULL OR mytable.GA_info_2 IS NULL
Put all the fields which might be null into the where clause.
If the field is not null it will not be updated because it is the first argument in the
coalesce function, if it is null it will be updated by the field of the other table.
edit: Also you can try it like this:
UPDATE mytable INNER JOIN mytable AS table2 ON mytable.Ref_number = table2.Ref_number SET mytable.GA_info_1 = COALESCE ( mytable.GA_info_1, table2.GA_info_1 ), mytable.GA_info_2 = COALESCE ( mytable.GA_info_2, table2.GA_info_2 ) WHERE CONCAT(mytable.GA_info_1, mytable.GA_info_2) IS NULL
For the performance issue: (as already mentioned in the comments)
Since you are not using the primary or foreign keys to join the tables, you would have to set an index on your Ref_number columns to speed up the join.