Isaac Isaac - 1 month ago 7
Python Question

Search and replace missing lines in the data structure by pandas

My data files have many missing lines (these are time series data). I would like to find

The location of these missing lines (the missing of values at time points) and replace them with the lines with nan.

For this purpose, I used pandas. However, it took too much time. So, I even couldn't check whether my code is really working or not.

for INPUT in FileList:

data=pd.read_csv(INPUT,sep='#',index_col=[1],parse_dates=[1])
print data
start = data.head(1).index
end = data.tail(1).index

timestamp = pd.date_range('2013-07-01 00:00:00','2013-09-30 23:59:00',freq="T")

N=timestamp.size

cr =pd.DataFrame(NA,columns=data.columns,index=timestamp)

for i in range(N):

s="%04d%02d%02d%02d%02d"%(timestamp.year[i],timestamp.month[i],timestamp.hour[i],
timestamp.minute[i])
try:
cr.ix[timestamp[i]]=data.ix[s]
except ValueError:
print INPUT,s,"ValueError"
except KeyError:
print INPUT,s,"KeyError"
output = FileList[i] + "result"


If you have any idea to reduce running time of this code, I will really appreciate it.

Answer

Create a sample series by taking 100k rows from the regular index, simulating your read_csv

In [22]: pd.set_option('max_rows',10)

In [23]: index = pd.date_range('2013-07-01 00:00:00','2013-09-30 23:59:00',freq="T")

In [24]: df = DataFrame(np.random.randn(100000),index=index.take(np.sort(np.random.choice(range(len(index)),100000,replace=False))))

In [25]: df
Out[25]: 
                            0
2013-07-01 00:00:00  0.009764
2013-07-01 00:01:00  0.900823
2013-07-01 00:02:00 -1.718848
2013-07-01 00:04:00  0.197763
2013-07-01 00:05:00 -0.269920
...                       ...
2013-09-30 23:54:00  0.368947
2013-09-30 23:55:00  1.488926
2013-09-30 23:57:00  0.749575
2013-09-30 23:58:00  0.964609
2013-09-30 23:59:00  0.112557

[100000 rows x 1 columns]

Reindex

In [26]: df.reindex(index)
Out[26]: 
                            0
2013-07-01 00:00:00  0.009764
2013-07-01 00:01:00  0.900823
2013-07-01 00:02:00 -1.718848
2013-07-01 00:03:00       NaN
2013-07-01 00:04:00  0.197763
...                       ...
2013-09-30 23:55:00  1.488926
2013-09-30 23:56:00       NaN
2013-09-30 23:57:00  0.749575
2013-09-30 23:58:00  0.964609
2013-09-30 23:59:00  0.112557

[132480 rows x 1 columns]

You can also do df.asfreq('T'), though that has slightly different end-point semantics (e.g. the series will start/stop at the last period in your sample)

Comments