Addie Addie - 5 months ago 47
SQL Question

How to Reference First Non-null String in a Column - Cloudera Impala / Apache Hive / Spark SQL

I am using Impala SQL. I currently have a database with 3 columns:

Account
,
Date
,
Type
.

Under
Type
there are various data strings describing the associated type, but some are equal to
'UNKNOWN'
and some are
null
.

I'd like to create another column
Fixed_Type
. The values in
Fixed_Type
should come from the
Type
column.


  • If the value in
    Type
    is either
    null
    or
    'UNKNOWN'
    , it should get last valid value in the
    Type
    column, partitioned by account and ordered by date.

  • If the partition begins with
    null
    or
    'UNKNOWN'
    , then the value in
    Fixed_Type
    should be the first valid value from
    Type
    .



For example:

Account | Date | Type | Fixed_Type
1 Jan data1 data1
1 Feb 'UNKNOWN' data1
1 Mar null data1
2 Apr data2 data2
2 May null data2
2 Jun null data2
2 Jul data3 data3
3 Feb 'UNKNOWN' data4
3 Mar 'UNKNOWN' data4
3 Apr data4 data4


I started doing this in Oracle but then realized there is no functionality analogous to
IGNORE NULLS
implemented in Impala.

This is what I was thinking to do in Oracle (I realize this only handles forward fill on nulls):

select account, date, type,
case when type is null
then last_value(type ignore nulls)
over (partition by account order by date)
else type
end as fixed_type

MT0 MT0
Answer

Oracle Setup:

CREATE TABLE Table_Name ( Acct, Dt, Type ) AS
SELECT 1, DATE '2016-01-01', 'Data1'   FROM DUAL UNION ALL
SELECT 1, DATE '2016-02-01', 'UNKNOWN' FROM DUAL UNION ALL
SELECT 1, DATE '2016-03-01', NULL      FROM DUAL UNION ALL
SELECT 2, DATE '2016-04-01', 'Data2'   FROM DUAL UNION ALL
SELECT 2, DATE '2016-05-01', NULL      FROM DUAL UNION ALL
SELECT 2, DATE '2016-06-01', NULL      FROM DUAL UNION ALL
SELECT 2, DATE '2016-07-01', 'Data3'   FROM DUAL UNION ALL
SELECT 3, DATE '2016-02-01', 'UNKNOWN' FROM DUAL UNION ALL
SELECT 3, DATE '2016-03-01', 'UNKNOWN' FROM DUAL UNION ALL
SELECT 3, DATE '2016-04-01', 'Data4'   FROM DUAL;

Query:

SELECT Acct,
       Dt,
       Type,
       Fixed_Type
FROM   (
  SELECT r.Acct,
         r.Dt,
         r.Type,
         t.type AS fixed_type,
         ROW_NUMBER() OVER ( PARTITION BY r.Acct, r.dt
                             ORDER BY SIGN( ABS( t.dt - r.dt ) ),
                                      SIGN( t.dt - r.dt ),
                                      ABS( t.dt - r.dt ) ) AS rn
  FROM   table_name r
         LEFT OUTER JOIN
         table_name t
         ON (    r.acct = t.acct
             AND t.type IS NOT NULL
             AND t.type <> 'UNKNOWN' )
)
WHERE   rn = 1
ORDER BY acct, dt;

Explanation:

If you join the table to itself so both tables have the same account number then you can compare each row per account with all the other rows in that same account. However, we aren't interested in comparing to all the rows but just to the rows that aren't NULL or 'UNKNOWN' so we get the join condition:

ON (    r.acct = t.acct
    AND t.type IS NOT NULL
    AND t.type <> 'UNKNOWN' )

A LEFT OUTER JOIN is used just in case there is an account number that has all NULL or 'UNKNOWN' values for its type so that rows aren't excluded.

Then it is a matter of finding the row that is most recent. In Oracle, if you subtract one date from another then you get the number of days (or fraction of days) difference - so:

  • SIGN( ABS( t.dt - r.dt ) ) will give 0 if the two dates are identical or 1 if they are different. Ordering by this first means that if there is a value with the same date then it will be preferred over non-identical dates;
  • SIGN( t.dt - r.dt ) will return 0 if the two dates are identical (but that's already been filtered on in the previous statement) or -1 if the compared date is before the current row or +1 if it is after - this is used to prefer a before date to an after date.
  • ABS( t.dt - r.dt ) will order the dates by closest together.

So the ORDER BY clause effectively states: ORDER BY identical dates first, then dates before (closest to r.dt first) and finally dates after (closest to r.dt first).

Then that is all placed in an in-line view and filtered to get the best match for each row (WHERE rn = 1).

Output:

      ACCT DT                  TYPE    FIXED_TYPE
---------- ------------------- ------- ----------
         1 2016-01-01 00:00:00 Data1   Data1      
         1 2016-02-01 00:00:00 UNKNOWN Data1      
         1 2016-03-01 00:00:00         Data1      
         2 2016-04-01 00:00:00 Data2   Data2      
         2 2016-05-01 00:00:00         Data2      
         2 2016-06-01 00:00:00         Data2      
         2 2016-07-01 00:00:00 Data3   Data3      
         3 2016-02-01 00:00:00 UNKNOWN Data4      
         3 2016-03-01 00:00:00 UNKNOWN Data4      
         3 2016-04-01 00:00:00 Data4   Data4      
Comments