piRSquared piRSquared - 7 months ago 25
Python Question

DataFrame take every 3rd row and forward fill

I have a

DataFrame
with
'Date'
and
'Id'
in the index and
'Portfolio'
in the columns. Values are weights of security within the portfolio. Within the dates level of the index, I'd like to take every 3rd date and forward fill the security weight to the date subsequent to the next "every third" date.

Setup



This is a generic
DataFrame
producer. with
df
asigned at the end.

import pandas as pd
import numpy as np
from string import uppercase

def generic_portfolio_df(start, end, freq, num_port, num_sec, seed=314):
np.random.seed(seed)
portfolios = pd.Index(['Portfolio {}'.format(i) for i in uppercase[:num_port]],
name='Portfolio')
securities = ['s{:02d}'.format(i) for i in range(num_sec)]
dates = pd.date_range(start, end, freq=freq)
return pd.DataFrame(np.random.rand(len(dates) * num_sec, num_port),
index=pd.MultiIndex.from_product([dates, securities],
names=['Date', 'Id']),
columns=portfolios
).groupby(level=0).apply(lambda x: x / x.sum())

df = generic_portfolio_df('2014-12-31', '2015-05-30', 'BM', 3, 5)


df
looks like this:

Portfolio Portfolio A Portfolio B Portfolio C
Date Id
2014-12-31 s00 0.326164 0.201597 0.085340
s01 0.278614 0.314448 0.266392
s02 0.258958 0.089224 0.293570
s03 0.092760 0.262511 0.084208
s04 0.043503 0.132221 0.270490
2015-01-30 s00 0.094124 0.041722 0.248013
s01 0.197860 0.346862 0.265287
s02 0.232504 0.261939 0.125719
s03 0.193050 0.286359 0.337316
s04 0.282462 0.063118 0.023664
2015-02-27 s00 0.266900 0.484163 0.074970
s01 0.239319 0.083138 0.123289
s02 0.067958 0.262626 0.262548
s03 0.181974 0.108668 0.301149
s04 0.243849 0.061405 0.238044
2015-03-31 s00 0.321438 0.149010 0.125168
s01 0.217779 0.067209 0.040285
s02 0.173066 0.293539 0.417372
s03 0.048929 0.415637 0.216490
s04 0.238788 0.074605 0.200685
2015-04-30 s00 0.089122 0.135514 0.234565
s01 0.048235 0.028141 0.327739
s02 0.026016 0.039664 0.073588
s03 0.413139 0.397875 0.323671
s04 0.423487 0.398807 0.040437
2015-05-29 s00 0.135831 0.071604 0.235099
s01 0.240086 0.242436 0.131698
s02 0.304451 0.380368 0.101653
s03 0.213468 0.035276 0.372894
s04 0.106164 0.270317 0.158656


Question




Within the dates level of the index, I'd like to take every 3rd date and forward fill the security weight to the date subsequent to the next "every third" date.


I want it to look like:

Portfolio Portfolio A Portfolio B Portfolio C
Date Id
2014-12-31 s00 0.326164 0.201597 0.085340
s01 0.278614 0.314448 0.266392
s02 0.258958 0.089224 0.293570
s03 0.092760 0.262511 0.084208
s04 0.043503 0.132221 0.270490
2015-01-30 s00 0.326164 0.201597 0.085340
s01 0.278614 0.314448 0.266392
s02 0.258958 0.089224 0.293570
s03 0.092760 0.262511 0.084208
s04 0.043503 0.132221 0.270490
2015-02-27 s00 0.326164 0.201597 0.085340
s01 0.278614 0.314448 0.266392
s02 0.258958 0.089224 0.293570
s03 0.092760 0.262511 0.084208
s04 0.043503 0.132221 0.270490
2015-03-31 s00 0.321438 0.149010 0.125168
s01 0.217779 0.067209 0.040285
s02 0.173066 0.293539 0.417372
s03 0.048929 0.415637 0.216490
s04 0.238788 0.074605 0.200685
2015-04-30 s00 0.321438 0.149010 0.125168
s01 0.217779 0.067209 0.040285
s02 0.173066 0.293539 0.417372
s03 0.048929 0.415637 0.216490
s04 0.238788 0.074605 0.200685
2015-05-29 s00 0.321438 0.149010 0.125168
s01 0.217779 0.067209 0.040285
s02 0.173066 0.293539 0.417372
s03 0.048929 0.415637 0.216490
s04 0.238788 0.074605 0.200685


Conclusion



Though I'm still interested in other's answers. I chose Alexander's answer over my own for the following reason:

%%timeit
df = generic_portfolio_df('2014-12-31', '2015-05-30', 'BM', 3, 5)
df = df.unstack()
df.iloc[3:] = np.nan
df = df.ffill(limit=3).stack()

100 loops, best of 3: 11.6 ms per loop

%%timeit
df = generic_portfolio_df('2014-12-31', '2015-05-30', 'BM', 3, 5)
df0 = df.loc[pd.IndexSlice[::3, :], :]
diff = df.index.difference(df0.index)
df.ix[diff] = np.nan
df.groupby(level=1).ffill(limit=3)

100 loops, best of 3: 21 ms per loop


Apparently, using
stack
and
unstack
is more efficient.

Answer
# Create Boolean index of rows to delete (every third row is marked as False).
idx = len(df.unstack())
idx = [i % 3 > 0 for i in range(idx)]
>>> idx
[False, True, True, False, True, True]

# Unstack the dataframe so you just have a column of dates 
df = df.unstack()

# Delete those in the `idx` index.
df.loc[idx, :] = np.nan

# Forward fill the retained dates, and then restack your dataframe.
df = df.ffill().stack()

>>> df.tail()
Portfolio       Portfolio A  Portfolio B  Portfolio C
Date       Id                                        
2015-05-29 s00     0.321438     0.149010     0.125168
           s01     0.217779     0.067209     0.040285
           s02     0.173066     0.293539     0.417372
           s03     0.048929     0.415637     0.216490
           s04     0.238788     0.074605     0.200685
Comments