DavidJ DavidJ - 1 month ago 18
SQL Question

Impute via fill-forward/LOCF a column over a range of sequential rows in SQL?

When formatting data for time-series analysis, a common need is to impute missing values by filling-forward values over time (also called Last-Observation-Carried-Forward / LOCF ).

While data-analysis environments typically provide that functionality (e.g. Pandas

fillna()
), for larger data-sets it can be more efficient to compute it in SQL (e.g. utilizing a data-parallel data-warehouse appliance).

For example, consider:

| UNIT | TIME | VALUE |
|------|------|-------|
| 1 | 1 | NULL |
| 1 | 2 | .5 |
| 1 | 3 | NULL |
| 1 | 4 | NULL |
| 1 | 5 | .2 |
| 1 | 6 | NULL |
| 2 | 1 | .6 |
| 2 | 2 | NULL |


which, after filling-forward the VALUE column over TIME (independently for each UNIT) yields:

| UNIT | TIME | VALUE |
|------|------|-------|
| 1 | 1 | NULL |
| 1 | 2 | .5 |
| 1 | 3 | .5 |
| 1 | 4 | .5 |
| 1 | 5 | .2 |
| 1 | 6 | .2 |
| 2 | 1 | .6 |
| 2 | 2 | .6 |


(Notice the initial NULL for UNIT 1 can't be imputed as there is no prior value)

Time could also be a timestamp or datetime type column.

Answer

The following query structure will achieve fill-forward if using a PostgreSQL flavoured SQL dialect (e.g. Netezza PureData) for a datetime index (assuming past data). It will also work for multi-column index/keys.

Given the following parameters:

  • <key_cols> - list of columns uniquely identifying each time-series sample (e.g. UNIT, TIME )
  • <impute_col> - column in which values need to be imputed (e.g. VALUE )
  • <impute_over_range_col> - the sequential range column for the time-series (e.g. TIME)

and deriving:

  • <keys_no_range> - key columns except for <impute_over_range_col>

SELECT DISTINCT T1.<key_cols>, 
                COALESCE(T1.<impute_col>, T2.<impute_col>) AS <impute_col>
FROM table T1
LEFT OUTER JOIN (SELECT T1.<key_cols>,
                     T1.<impute_col>,
                     LEAD(T1.<impute_over_range_col>,1) 
                         OVER (PARTITION BY T1.<keys_no_range> 
                               ORDER BY T1.<key_cols>)
                         AS NEXT_RANGE
                     FROM table T1
                     WHERE T1.<impute_col> IS NOT NULL
                     ORDER BY T1.<key_cols>
                ) T2
              ON (T1.<impute_over_range_col> BETWEEN T2.<impute_over_range_col> 
                                             AND COALESCE(NEXT_RANGE, CURRENT_DATE)) 
              AND T1.<keys_no_range>[0] = T2.<keys_no_range>[0]
              AND T1.<keys_no_range>[1] = T2.<keys_no_range>[1]
              -- ... for each col in <keys_no_range>

Concretely, for the example in the question:

SELECT DISTINCT T1.UNIT, T1.TIME, 
                COALESCE(T1.VALUE, T2.VALUE) AS VALUE
FROM table T1
LEFT OUTER JOIN (SELECT T1.UNIT, T1.TIME,
                     T1.VALUE,
                     LEAD(T1.TIME,1) 
                         OVER (PARTITION BY T1.UNIT 
                               ORDER BY T1.UNIT, T1.TIME)
                         AS NEXT_RANGE
                     FROM table T1
                     WHERE T1.VALUE IS NOT NULL
                     ORDER BY T1.UNIT, T1.TIME
                ) T2
              ON (T1.TIME BETWEEN T2.TIME
                           AND COALESCE(NEXT_RANGE, CURRENT_DATE)) 
              AND T1.UNIT = T2.UNIT

Here is an SQLFiddle of the above query: http://sqlfiddle.com/#!15/d589b/1

Comments