Dmitry Polonskiy Dmitry Polonskiy - 1 month ago 8
Python Question

Reindexing and filling in missing dates

I have a DataFrame that is in this format.

country_txt multiple success nkill nwound property dates \
1970Q1 Italy 0 1 0.0 0.0 0 1970-01-01
1970Q1 Italy 0 0 0.0 0.0 1 1970-01-01
1970Q4 Italy 0 0 0.0 0.0 1 1970-04-01
1971Q1 Italy 0 1 0.0 0.0 1 1971-01-01
1971Q3 Italy 0 1 0.0 0.0 1 1971-03-01


The index of this DataFrame is the year followed by the quarter which I created by using
PeriodIndex
. The
dates
column symbolizes the year and the quarter as well, the day of the month is irrelevant. I want to sum up all the other columns for each quarter. Normally thats not a problem since I can just do
italy.groupby('dates').sum()
. However, the output I get is this

multiple success nkill nwound property
dates
1970-01-01 0 1 0.0 0.0 1
1970-04-01 0 0 0.0 0.0 1
1971-01-01 0 1 0.0 0.0 1


The problem is now I want to fill in missing values for each quarter that does not appear so the output would look something like this instead,

multiple success nkill nwound property
dates
1970-01-01 0 1 0.0 0.0 1
1970-02-01 0 0 0.0 0.0 0
1970-03-01 0 0 0.0 0.0 0
1970-04-01 0 0 0.0 0.0 1
1971-01-01 0 1 0.0 0.0 1


I have looked up previous questions and found that people recommend I do this
df.reindex(pd.date_range("1970-01-01", "2015-12-31"), fill_value = 0)
. The problem is this, this will give my data lots of extra rows since I would have data on a daily basis, which would then mean I have to find a way to sum the data by quarter all over again. So how would I achieve this goal without creating all these daily 0's and finding a way to sum everything up again by quarter?

For reference, the
dates
column was created by first changing the original months column into a value from 1-4 (to signify quarter) and then converted to timedelta format by doing this

df['dates'] = df.iyear.astype(str).str.cat(df.imonth.astype(str))
df['dates'] = pd.to_datetime(df['dates'], format = '%Y%m')

Answer

UPDATE: parsing real dates from the string index. It should work for older Pandas versions as well:

In [212]: df.set_index(pd.to_datetime(df.index)).resample('QS').sum().fillna(0)
Out[212]:
            multiple  success  nkill  nwound  property
idx
1970-01-01       0.0      1.0    0.0     0.0       1.0
1970-04-01       0.0      0.0    0.0     0.0       0.0
1970-07-01       0.0      0.0    0.0     0.0       0.0
1970-10-01       0.0      0.0    0.0     0.0       1.0
1971-01-01       0.0      1.0    0.0     0.0       1.0
1971-04-01       0.0      0.0    0.0     0.0       0.0
1971-07-01       0.0      1.0    0.0     0.0       1.0

OLD answer (assuming the dates column contains real dates)

try this:

In [205]: df.resample('QS', on='dates').sum().fillna(0)
Out[205]:
            multiple  success  nkill  nwound  property
dates
1970-01-01       0.0      1.0    0.0     0.0       1.0
1970-04-01       0.0      0.0    0.0     0.0       1.0
1970-07-01       0.0      0.0    0.0     0.0       0.0
1970-10-01       0.0      0.0    0.0     0.0       0.0
1971-01-01       0.0      2.0    0.0     0.0       2.0

or resampled "monthly":

In [207]: df.resample('QS', on='dates').sum().resample('MS').sum().fillna(0)
Out[207]:
            multiple  success  nkill  nwound  property
dates
1970-01-01       0.0      1.0    0.0     0.0       1.0
1970-02-01       0.0      0.0    0.0     0.0       0.0
1970-03-01       0.0      0.0    0.0     0.0       0.0
1970-04-01       0.0      0.0    0.0     0.0       1.0
1970-05-01       0.0      0.0    0.0     0.0       0.0
1970-06-01       0.0      0.0    0.0     0.0       0.0
1970-07-01       0.0      0.0    0.0     0.0       0.0
1970-08-01       0.0      0.0    0.0     0.0       0.0
1970-09-01       0.0      0.0    0.0     0.0       0.0
1970-10-01       0.0      0.0    0.0     0.0       0.0
1970-11-01       0.0      0.0    0.0     0.0       0.0
1970-12-01       0.0      0.0    0.0     0.0       0.0
1971-01-01       0.0      2.0    0.0     0.0       2.0