cinico cinico - 3 months ago 8
Python Question

How to organize data of several datasets into the same dataframe using pandas in Python?

I am having trouble in keeping some data organized as I want in a data frame using pandas in Python.

I would like to have a single data frame where the data would be organized in three columns (e.g.

Time
,
V
and
I
).

However, I would like to have the data of different samples in the same data frame so that I could easily select the data from
Sample#1
or
Sample#2
.

What I came up to was something like this:

df1 = pd.DataFrame({'Time': np.arange(0,10,0.5), 'V': np.random.rand(20), 'I': np.random.rand(20)})
df1['Sample']= 'sample_1'

df2 = pd.DataFrame({'Time': np.arange(0,10,0.5), 'V': np.random.rand(20), 'I': np.random.rand(20)})
df2['Sample']= 'sample_2'

df = df1.append(df2)


Notice that I added another columns called
Sample
to keep track of which data corresponds to which sample.

But then I don't know how to call the data from
sample_1
or
sample_2
from df

How can I do this and is this the right way to organize my data? Should I be using
MultiIndex
?

Answer

Yes, MultiIndex is one possible solution:

np.random.seed(1)
df1 = pd.DataFrame({'Time': np.arange(0,10,0.5), 
                    'V': np.random.rand(20), 
                    'I': np.random.rand(20)})

np.random.seed(2)
df2 = pd.DataFrame({'Time': np.arange(0,10,0.5), 
                    'V': np.random.rand(20), 
                    'I': np.random.rand(20)})

#print (df1)
#print (df2)

You can concat all DataFrames to one and in parameter keys specify each source DataFrame:

print (pd.concat([df1, df2], keys=('sample_1','sample_2')))
                    I  Time         V
sample_1 0   0.800745   0.0  0.417022
         1   0.968262   0.5  0.720324
         2   0.313424   1.0  0.000114
         3   0.692323   1.5  0.302333
         4   0.876389   2.0  0.146756
         5   0.894607   2.5  0.092339
         6   0.085044   3.0  0.186260
         7   0.039055   3.5  0.345561
         8   0.169830   4.0  0.396767
         9   0.878143   4.5  0.538817
         10  0.098347   5.0  0.419195
         11  0.421108   5.5  0.685220
         12  0.957890   6.0  0.204452
         13  0.533165   6.5  0.878117
         14  0.691877   7.0  0.027388
         15  0.315516   7.5  0.670468
         16  0.686501   8.0  0.417305
         17  0.834626   8.5  0.558690
         18  0.018288   9.0  0.140387
         19  0.750144   9.5  0.198101
sample_2 0   0.505246   0.0  0.435995
         1   0.065287   0.5  0.025926
         2   0.428122   1.0  0.549662
         3   0.096531   1.5  0.435322
         4   0.127160   2.0  0.420368
         5   0.596745   2.5  0.330335
         6   0.226012   3.0  0.204649
         7   0.106946   3.5  0.619271
         8   0.220306   4.0  0.299655
         9   0.349826   4.5  0.266827
         10  0.467787   5.0  0.621134
         11  0.201743   5.5  0.529142
         12  0.640407   6.0  0.134580
         13  0.483070   6.5  0.513578
         14  0.505237   7.0  0.184440
         15  0.386893   7.5  0.785335
         16  0.793637   8.0  0.853975
         17  0.580004   8.5  0.494237
         18  0.162299   9.0  0.846561
         19  0.700752   9.5  0.079645

Select data is possible by xs - see cross section:

print (df.xs('sample_1', level=0))
           I  Time         V
0   0.800745   0.0  0.417022
1   0.968262   0.5  0.720324
2   0.313424   1.0  0.000114
3   0.692323   1.5  0.302333
4   0.876389   2.0  0.146756
5   0.894607   2.5  0.092339
6   0.085044   3.0  0.186260
7   0.039055   3.5  0.345561
8   0.169830   4.0  0.396767
9   0.878143   4.5  0.538817
10  0.098347   5.0  0.419195
11  0.421108   5.5  0.685220
12  0.957890   6.0  0.204452
13  0.533165   6.5  0.878117
14  0.691877   7.0  0.027388
15  0.315516   7.5  0.670468
16  0.686501   8.0  0.417305
17  0.834626   8.5  0.558690
18  0.018288   9.0  0.140387
19  0.750144   9.5  0.198101

If need select only some columns:

print (df.xs('sample_1', level=0)[['Time','I']])
    Time         I
0    0.0  0.800745
1    0.5  0.968262
2    1.0  0.313424
3    1.5  0.692323
4    2.0  0.876389
5    2.5  0.894607
6    3.0  0.085044
7    3.5  0.039055
8    4.0  0.169830
9    4.5  0.878143
10   5.0  0.098347
11   5.5  0.421108
12   6.0  0.957890
13   6.5  0.533165
14   7.0  0.691877
15   7.5  0.315516
16   8.0  0.686501
17   8.5  0.834626
18   9.0  0.018288
19   9.5  0.750144

Another solution is use IndexSlice - see using slicers

idx = pd.IndexSlice
print (df.loc[idx['sample_1',:], ['Time','I']])
             Time         I
sample_1 0    0.0  0.800745
         1    0.5  0.968262
         2    1.0  0.313424
         3    1.5  0.692323
         4    2.0  0.876389
         5    2.5  0.894607
         6    3.0  0.085044
         7    3.5  0.039055
         8    4.0  0.169830
         9    4.5  0.878143
         10   5.0  0.098347
         11   5.5  0.421108
         12   6.0  0.957890
         13   6.5  0.533165
         14   7.0  0.691877
         15   7.5  0.315516
         16   8.0  0.686501
         17   8.5  0.834626
         18   9.0  0.018288
         19   9.5  0.750144

If need remove first level of Multiindex:

idx = pd.IndexSlice
print (df.loc[idx['sample_1',:], ['Time','I']].reset_index(level=0, drop=True))
    Time         I
0    0.0  0.800745
1    0.5  0.968262
2    1.0  0.313424
3    1.5  0.692323
4    2.0  0.876389
5    2.5  0.894607
6    3.0  0.085044
7    3.5  0.039055
8    4.0  0.169830
9    4.5  0.878143
10   5.0  0.098347
11   5.5  0.421108
12   6.0  0.957890
13   6.5  0.533165
14   7.0  0.691877
15   7.5  0.315516
16   8.0  0.686501
17   8.5  0.834626
18   9.0  0.018288
19   9.5  0.750144