sparrow sparrow - 4 months ago 14
Python Question

Apply a for loop to multiple DataFrames in Pandas

I have multiple DataFrames that I want to do the same thing too.

First I create a list of the DataFrames. All of them have the same column called 'result'.

df_list = [df1,df2,df3]


I want to keep only the rows in all the DataFrames with value 'passed' so I use a for loop on my list:

for df in df_list:
df =df[df['result'] == 'passed']


...this does not work, the values are not filtered out of each DataFrame.

If I filter each one separately then it does work.

df1 =df1[df1['result'] == 'passed']
df2 =df2[df2['result'] == 'passed']
df3 =df3[df3['result'] == 'passed']

Answer

This is because every time you do a subset like this df[<whatever>] you are returning a new dataframe, and assigning it to the df looping variable, which gets obliterated each time you go to the next iteration (although you do keep the last one). This similar to slicing lists:

>>> list1 = [1,2,3,4]
>>> list2 = [11,12,13,14]
>>> for lyst in list1,list2:
...   lyst = lyst[1:-1]
... 
>>> list1, list2
([1, 2, 3, 4], [11, 12, 13, 14])
>>> lyst
[12, 13]

Usually, you need to use a mutator method if you want to actually modify the lists in-place. Equivalently, with a dataframe, you could use assignment on an indexer, e.g. .loc/.ix/.iloc/ etc in combination with the .dropna method, being careful to pass the inplace=True argument. Suppose I have three dataframes and I want to only keep the rows where my second column is positive:

Warning: This way is not ideal, look at edit for better way

In [11]: df1
Out[11]: 
          0         1         2         3
0  0.957288 -0.170286  0.406841 -3.058443
1  1.762343 -1.837631 -0.867520  1.666193
2  0.618665  0.660312 -1.319740 -0.024854
3 -2.008017 -0.445997 -0.028739 -0.227665
4  0.638419 -0.271300 -0.918894  1.524009
5  0.957006  1.181246  0.513298  0.370174
6  0.613378 -0.852546 -1.778761 -1.386848
7 -1.891993 -0.304533 -1.427700  0.099904

In [12]: df2
Out[12]: 
          0         1         2         3
0 -0.521018  0.407258 -1.167445 -0.363503
1 -0.879489  0.008560  0.224466 -0.165863
2  0.550845 -0.102224 -0.575909 -0.404770
3 -1.171828 -0.912451 -1.197273  0.719489
4 -0.887862  1.073306  0.351835  0.313953
5 -0.517824 -0.096929 -0.300282  0.716020
6 -1.121527  0.183219  0.938509  0.842882
7  0.003498 -2.241854 -1.146984 -0.751192

In [13]: df3
Out[13]: 
          0         1         2         3
0  0.240411  0.795132 -0.305770 -0.332253
1 -1.162097  0.055346  0.094363 -1.254859
2 -0.493466 -0.717872  1.090417 -0.591872
3  1.021246 -0.060453 -0.013952  0.304933
4 -0.859882 -0.947950  0.562609  1.313632
5  0.917199  1.186865  0.354839 -1.771787
6 -0.694799 -0.695505 -1.077890 -0.880563
7  1.088068 -0.893466 -0.188419 -0.451623

In [14]: for df in df1, df2, df3:
   ....:     df.loc[:,:] = df.loc[df[1] > 0,:]
   ....:     df.dropna(inplace = True,axis =0)
   ....:     

In [15]: df1
dfOut[15]: 
          0         1         2         3
2  0.618665  0.660312 -1.319740 -0.024854
5  0.957006  1.181246  0.513298  0.370174

In [16]: df2
Out[16]: 
          0         1         2         3
0 -0.521018  0.407258 -1.167445 -0.363503
1 -0.879489  0.008560  0.224466 -0.165863
4 -0.887862  1.073306  0.351835  0.313953
6 -1.121527  0.183219  0.938509  0.842882

In [17]: df3
Out[17]: 
          0         1         2         3
0  0.240411  0.795132 -0.305770 -0.332253
1 -1.162097  0.055346  0.094363 -1.254859
5  0.917199  1.186865  0.354839 -1.771787

Edited to Add:

I think I found a better way just using the .drop method.

In [21]: df1
Out[21]: 
          0         1         2         3
0 -0.804913 -0.481498  0.076843  1.136567
1 -0.457197 -0.903681 -0.474828  1.289443
2 -0.820710  1.610072  0.175455  0.712052
3  0.715610 -0.178728 -0.664992  1.261465
4 -0.297114 -0.591935  0.487698  0.760450
5  1.035231 -0.108825 -1.058996  0.056320
6  1.579931  0.958331 -0.653261 -0.171245
7  0.685427  1.447411  0.001002  0.241999

In [22]: df2
Out[22]: 
          0         1         2         3
0  1.660864  0.110002  0.366881  1.765541
1 -0.627716  1.341457 -0.552313  0.578854
2  0.277738  0.128419 -0.279720 -1.197483
3 -1.294724  1.396698  0.108767  1.353454
4 -0.379995  0.215192  1.446584  0.530020
5  0.557042  0.339192 -0.105808 -0.693267
6  1.293941  0.203973 -3.051011  1.638143
7 -0.909982  1.998656 -0.057350  2.279443

In [23]: df3
Out[23]: 
          0         1         2         3
0 -0.002327 -2.054557 -1.752107 -0.911178
1 -0.998328 -1.119856  1.468124 -0.961131
2 -0.048568  0.373192 -0.666330  0.867719
3  0.533597 -1.222963  0.119789 -0.037949
4  1.203075 -0.773511  0.475809  1.352943
5 -0.984069 -0.352267 -0.313516  0.138259
6  0.114596  0.354404  2.119963 -0.452462
7 -1.033029 -0.787237  0.479321 -0.818260


In [25]: for df in df1,df2,df3:
   ....:     df.drop(df.index[df[1] < 0],axis=0,inplace=True)
   ....:     

In [26]: df1
Out[26]: 
          0         1         2         3
2 -0.820710  1.610072  0.175455  0.712052
6  1.579931  0.958331 -0.653261 -0.171245
7  0.685427  1.447411  0.001002  0.241999

In [27]: df2
Out[27]: 
          0         1         2         3
0  1.660864  0.110002  0.366881  1.765541
1 -0.627716  1.341457 -0.552313  0.578854
2  0.277738  0.128419 -0.279720 -1.197483
3 -1.294724  1.396698  0.108767  1.353454
4 -0.379995  0.215192  1.446584  0.530020
5  0.557042  0.339192 -0.105808 -0.693267
6  1.293941  0.203973 -3.051011  1.638143
7 -0.909982  1.998656 -0.057350  2.279443

In [28]: df3
Out[28]: 
          0         1         2         3
2 -0.048568  0.373192 -0.666330  0.867719
6  0.114596  0.354404  2.119963 -0.452462

Certainly faster:

In [8]: timeit.Timer(stmt="df.loc[:,:] = df.loc[df[1] > 0, :];df.dropna(inplace = True,axis =0)", setup="import pandas as pd,numpy as np; df = pd.DataFrame(np.random.random((8,4)))").timeit(10000)
Out[8]: 23.69621358400036

In [9]: timeit.Timer(stmt="df.drop(df.index[df[1] < 0],axis=0,inplace=True)", setup="import pandas as pd,numpy as np; df = pd.DataFrame(np.random.random((8,4)))").timeit(10000)
Out[9]: 11.476448250003159
Comments