toasteez toasteez - 1 year ago 84
Python Question

Pandas Dataframe.Interpolate() gives different values for same index date

Using the DataFrame

date_index value
2013-01-01 0.50
2013-01-01 0.50
2013-01-01 0.50
2013-01-01 0.50
2013-01-02 1.50
2013-01-02 1.50
2013-01-02 1.50
2013-01-02 1.50
2013-01-03 0.98
2013-01-03 0.98
2013-01-03 0.98
2013-01-03 0.98
2013-01-04 NaN
2013-01-04 1.00
2013-01-04 NaN
2013-01-04 NaN
2013-01-05 1.90
2013-01-05 1.90
2013-01-05 1.90
2013-01-05 1.90
2013-01-06 2.50
2013-01-06 2.50
2013-01-06 2.50
2013-01-06 2.50
2013-01-07 2.89
2013-01-07 2.89
2013-01-07 2.89
2013-01-07 2.89
2013-01-08 NaN
2013-01-08 NaN
2013-01-08 NaN
2013-01-08 NaN
2013-01-09 3.90
2013-01-09 3.90
2013-01-09 3.90
2013-01-09 3.90
2013-01-10 5.00
2013-01-10 5.00
2013-01-10 5.00
2013-01-10 5.00


copy the above to clipboard

import pandas as pd
df = pd.read_clipboard()
df = df.set_index('date_index')


Use Interpolate to fill nan's

x = df.interpolate(method='linear', axis=0, limit=None, inplace=False, limit_direction='both', downcast=None)


I would have expected the interpolation to consider the same x point and as such y value should be same for each x. However this is not the case.

Dates with nan's are 2013-01-04 and 2013-01-08

Before

2013-01-04 NaN
2013-01-04 1.00
2013-01-04 NaN
2013-01-04 NaN
2013-01-08 NaN
2013-01-08 NaN
2013-01-08 NaN
2013-01-08 NaN


After

2013-01-04 0.990
2013-01-04 1.000
2013-01-04 1.300
2013-01-04 1.600
2013-01-08 3.092
2013-01-08 3.294
2013-01-08 3.496
2013-01-08 3.698


Am I understanding the use of the interpolation correctly? I expected the result of

2013-01-04 1.000
2013-01-04 1.000
2013-01-04 1.000
2013-01-04 1.000
2013-01-08 3.945
2013-01-08 3.945
2013-01-08 3.945
2013-01-08 3.945

Answer Source

The problem is that you may duplicated index values, and so the interpolation method thinks that these are additional steps to do the linear interpolation. So instead of going from 1 to 1.9, it goes 1 --> 1.3 --> 1.6 --> 1.9.

Here is a workaround solution that works by removing the duplicated index entries:

df = pd.read_clipboard()
uniqDates = df['date_index'].unique()
df = df.set_index('date_index')
df2 = df.dropna()
df2 = df2[-df2.index.duplicated()]
df2 = df2.reindex(uniqDates)
df2 = df2.interpolate(method='linear', axis=0, limit=None, inplace=False, limit_direction='both', downcast=None)
interpDict = df2['value'].to_dict()
df['value'] = [interpDict[x] for x in df.index]

Alternatively

You could cast the index values to datetime objects, and then use the method='time' interpolation:

df = pd.read_clipboard()
df['date_index'] = pd.to_datetime(df['date_index'])
df = df.set_index('date_index')
x = df.interpolate(method='time')
x.index = [x.strftime('%Y-%m-%d') for x in x.index]
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download