Haroon Rashid Haroon Rashid -3 years ago 53
R Question

Handle Continous Missing values in time-series data

I have a time-series data as shown below.

2015-04-26 23:00:00 5704.27388916015661380
2015-04-27 00:00:00 4470.30868326822928793
2015-04-27 01:00:00 4552.57241617838553793
2015-04-27 02:00:00 4570.22250032825650123
2015-04-27 03:00:00 NA
2015-04-27 04:00:00 NA
2015-04-27 05:00:00 NA
2015-04-27 06:00:00 12697.37724086216439900
2015-04-27 07:00:00 5538.71119009653739340
2015-04-27 08:00:00 81.95060647328695325
2015-04-27 09:00:00 8550.65816895300667966
2015-04-27 10:00:00 2925.76573206583680076


How should I handle Continous NA values. In cases where I have only one NA, I use to take the average of extreme values of NA entry. Are there any standard approaches to deal with continuous missing values?

Answer Source

The zoo package has several functions for dealing with NA values. One of the following functions might suit your needs:

  • na.locf: Last observation carried forward. Using the parameter fromLast = TRUE corresponds to next observation carried backward (NOCB).
  • na.aggregate: Replace the NA's with some aggregated value. The default aggregation function is the mean, but you can specify other functions as well. See ?na.aggregate for more info.
  • na.approx: NA's are replaced with linear interpolated values.

You can compare the outcomes to see what these functions do:

library(zoo)
df$V.loc <- na.locf(df$V2)
df$V.agg <- na.aggregate(df$V2)
df$V.app <- na.approx(df$V2)

this results in:

> df
                    V1          V2       V.loc       V.agg       V.app
1  2015-04-26 23:00:00  5704.27389  5704.27389  5704.27389  5704.27389
2  2015-04-27 00:00:00  4470.30868  4470.30868  4470.30868  4470.30868
3  2015-04-27 01:00:00  4552.57242  4552.57242  4552.57242  4552.57242
4  2015-04-27 02:00:00  4570.22250  4570.22250  4570.22250  4570.22250
5  2015-04-27 03:00:00          NA  4570.22250  5454.64894  6602.01119
6  2015-04-27 04:00:00          NA  4570.22250  5454.64894  8633.79987
7  2015-04-27 05:00:00          NA  4570.22250  5454.64894 10665.58856
8  2015-04-27 06:00:00 12697.37724 12697.37724 12697.37724 12697.37724
9  2015-04-27 07:00:00  5538.71119  5538.71119  5538.71119  5538.71119
10 2015-04-27 08:00:00    81.95061    81.95061    81.95061    81.95061
11 2015-04-27 09:00:00  8550.65817  8550.65817  8550.65817  8550.65817
12 2015-04-27 10:00:00  2925.76573  2925.76573  2925.76573  2925.76573

Used data:

df <- structure(list(V1 = structure(c(1430082000, 1430085600, 1430089200, 1430092800, 1430096400, 1430100000, 1430103600, 1430107200, 1430110800, 1430114400, 1430118000, 1430121600), class = c("POSIXct", "POSIXt"), tzone = ""), V2 = c(5704.27388916016, 4470.30868326823, 4552.57241617839, 4570.22250032826, NA, NA, NA, 12697.3772408622, 5538.71119009654, 81.950606473287, 8550.65816895301, 2925.76573206584)), .Names = c("V1", "V2"), row.names = c(NA, -12L), class = "data.frame")
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download