adele adele - 1 year ago 43
Python Question

Probability of next day being specific values using Python Pandas

Using the following dataframe as an example, which specifies different directions for a stock market day. What are the most Pythonic approaches for capturing stats showing what is the most likely day type that will follow each individual day?

So in this example df we have simple day types listed as 'Down','Up' and 'Flat'.

rng = pd.DataFrame({'day_direction':['Down', 'Down', 'Up', 'Down', 'Up', 'Up', 'Flat', 'Down', 'Down', 'Down', 'Up', 'Up', 'Up','Down',]},
index = pd.date_range('4/2/2014', periods=14, freq='D'))

Desired output would be something like this:

day_direction Next_day

Down Down 60%
Up 15%
Flat 35%

Up Up 55%
Flat 15%
Down 30%

Flat Flat 20%
Down 20%
Up 60%

Could this be achieved with a groupby or similar? Would be great to see any possible solutions for this.

Answer Source

Yes it is possible. First you can create a second column which contains for each day what happens on the next day (and off course you drop the last row):

rng['day2'] = rng['day_direction'].shift(-1)
rng = rng.iloc[:-1]

            day_direction   day2
2014-04-02  Down            Down
2014-04-03  Down            Up
2014-04-04  Up              Down
2014-04-05  Down            Up
2014-04-06  Up              Up
2014-04-07  Up              Flat
2014-04-08  Flat            Down
2014-04-09  Down            Down
2014-04-10  Down            Down
2014-04-11  Down            Up
2014-04-12  Up              Up
2014-04-13  Up              Up
2014-04-14  Up              Down

Then you use groupby on your two columns:

output = rng.groupby(['day_direction','day2']).size()

day_direction  day2
Down           Down    3
               Up      3
Flat           Down    1
Up             Down    2
               Flat    1
               Up      3

Note that you have to rescale it afterwards to get the probability.

EDIT: other method that goes to the rescaled output:

rng['day2'] = rng['day_direction'].shift(-1)
rng = rng.iloc[:-1]
rng['N'] = 1
output = rng.groupby(['day_direction','day2'],as_index = False).sum()
for direction in ['Down','Flat','Up']:
    output.ix[output.day_direction == direction,'N'] /= output[output.day_direction == direction].N.sum()
output.set_index(['day_direction','day2'],inplace = True)
output['N'] = [str(int(100*w))+'%' for w in output.values]

day_direction   day2    
Down            Down    50%
                Up      50%
Flat            Down    100%
Up              Down    33%
                Flat    16%
                Up      50%

EDIT: This second method as a function:

def my_function(data, col_name, list_values):
    Inputs: - Your dataframe
            - The name of the column containing the values to predict on, as a string
            - The list of the possible values
    data = data[[col_name]]
    data['day2'] = data[col_name].shift(-1)
    data['N'] = 1
    output = data.groupby([col_name,'day2'],as_index = False).agg({'N':np.sum})
    for direction in list_values:
        output.ix[output[col_name] == direction,'N'] /= output[output[col_name] == direction].N.sum()
    output['N'] = [str(int(100*w))+'%' for w in output.values]
    return output