adele - 1 year ago 69
Python Question

# Probability of next day being specific values using Python Pandas

Using the following dataframe as an example, which specifies different directions for a stock market day. What are the most Pythonic approaches for capturing stats showing what is the most likely day type that will follow each individual day?

So in this example df we have simple day types listed as 'Down','Up' and 'Flat'.

`````` rng = pd.DataFrame({'day_direction':['Down', 'Down', 'Up', 'Down', 'Up', 'Up', 'Flat', 'Down', 'Down', 'Down', 'Up', 'Up', 'Up','Down',]},
index = pd.date_range('4/2/2014', periods=14, freq='D'))
rng
``````

Desired output would be something like this:

``````day_direction     Next_day

Down          Down 60%
Up   15%
Flat 35%

Up            Up   55%
Flat 15%
Down 30%

Flat          Flat 20%
Down 20%
Up   60%
``````

Could this be achieved with a groupby or similar? Would be great to see any possible solutions for this.

Yes it is possible. First you can create a second column which contains for each day what happens on the next day (and off course you drop the last row):

``````rng['day2'] = rng['day_direction'].shift(-1)
rng = rng.iloc[:-1]

day_direction   day2
2014-04-02  Down            Down
2014-04-03  Down            Up
2014-04-04  Up              Down
2014-04-05  Down            Up
2014-04-06  Up              Up
2014-04-07  Up              Flat
2014-04-08  Flat            Down
2014-04-09  Down            Down
2014-04-10  Down            Down
2014-04-11  Down            Up
2014-04-12  Up              Up
2014-04-13  Up              Up
2014-04-14  Up              Down
``````

Then you use `groupby` on your two columns:

``````output = rng.groupby(['day_direction','day2']).size()

day_direction  day2
Down           Down    3
Up      3
Flat           Down    1
Up             Down    2
Flat    1
Up      3
``````

Note that you have to rescale it afterwards to get the probability.

EDIT: other method that goes to the rescaled output:

``````rng['day2'] = rng['day_direction'].shift(-1)
rng = rng.iloc[:-1]
rng['N'] = 1
output = rng.groupby(['day_direction','day2'],as_index = False).sum()
for direction in ['Down','Flat','Up']:
output.ix[output.day_direction == direction,'N'] /= output[output.day_direction == direction].N.sum()
output.set_index(['day_direction','day2'],inplace = True)
output['N'] = [str(int(100*w))+'%' for w in output.values]

N
day_direction   day2
Down            Down    50%
Up      50%
Flat            Down    100%
Up              Down    33%
Flat    16%
Up      50%
``````

EDIT: This second method as a function:

``````def my_function(data, col_name, list_values):
"""
- The name of the column containing the values to predict on, as a string
- The list of the possible values
"""
data = data[[col_name]]
data['day2'] = data[col_name].shift(-1)
data.dropna(inplace=True)
data['N'] = 1
output = data.groupby([col_name,'day2'],as_index = False).agg({'N':np.sum})
for direction in list_values:
output.ix[output[col_name] == direction,'N'] /= output[output[col_name] == direction].N.sum()
output.set_index([col_name,'day2'],inplace=True)
output['N'] = [str(int(100*w))+'%' for w in output.values]
return output
``````
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download