Danwise Danwise - 5 months ago 62
Python Question

Multi-dimensional/Nested DataFrame/Dataset/Panel in Pandas

I would like to store some multidimensional data in a pandas dataframe or panel such that I would like to be able to return for example:


  1. All the times for Runner A, Race A

  2. All the times(and names) for Race A for a certain year say 2015



Example data would look something like this, note that not all runners will have data for all years or all races.

Could anyone suggest a good way to do this with Pandas or any other way?

Name | Gender | Age
Runner A | Male | 35
Race A
Year | Time
2015 | 2:35:09
2014 | 2:47:34
2013 | 2:50:12
Race B
Year | Time
2013 | 1:32:07

Runner B | Male | 29
Race A
Year | Time
2015 | 3:05:56

Runner C | Female | 32
Race B
Year | Time
1998 | 1:29:43


EDIT: I tried to keep the example data simple but in reality I have a fair amount of data in the Runner Profile which I'd prefer not to store on every line.

In addition I have another level of data for certain races. So for Race A/2015 for example I would like to have another level of data for split times, average paces etc.

Name | Gender | Age | etc...
Runner A | Male | 35 | etc...
Race A
Year | Time
2015 | 2:35:09
Split 1 Distance | Split 1 Time | Split 1 Pace | etc...

Answer

I think you can use Multiindex and then select data by slicers:

import pandas as pd

df = pd.DataFrame({'Time': {('Runner A', 'Male', 35, 'Race A', 2014): '2:47:34', ('Runner C', 'Female', 32, 'Race B', 1998): '1:29:43', ('Runner B', 'Male', 29, 'Race A', 2015): '3:05:56', ('Runner A', 'Male', 35, 'Race A', 2013): '2:50:12', ('Runner A', 'Male', 35, 'Race B', 2013): '1:32:07', ('Runner A', 'Male', 35, 'Race A', 2015): '2:35:09'}})
print (df)
                                   Time
Runner A Male   35 Race A 2013  2:50:12
                          2014  2:47:34
                          2015  2:35:09
                   Race B 2013  1:32:07
Runner B Male   29 Race A 2015  3:05:56
Runner C Female 32 Race B 1998  1:29:43

#index has to be fully lexsorted
df.sort_index(inplace=True)
print (df)
                                   Time
Runner A Male   35 Race A 2013  2:50:12
                          2014  2:47:34
                          2015  2:35:09
                   Race B 2013  1:32:07
Runner B Male   29 Race A 2015  3:05:56
Runner C Female 32 Race B 1998  1:29:43

idx = pd.IndexSlice
print (df.loc[idx['Runner A',:,:,'Race A',:],:])
                                 Time
Runner A Male 35 Race A 2013  2:50:12
                        2014  2:47:34
                        2015  2:35:09

print (df.loc[idx[:,:,:,'Race A',2015],:])
                                 Time
Runner A Male 35 Race A 2015  2:35:09
Runner B Male 29 Race A 2015  3:05:56