Jaime Jaime - 2 months ago 11
Python Question

Looping over pandas data frame applying formula to each value

I have the following pandas data frame:

PC1 PC2 PC3 PC4 PC5 PC6 PC7
ind
NA06984 -0.0082 -0.0594 -0.0148 -0.0569 -0.1128 -0.0276 -0.0217
NA06986 -0.0131 -0.0659 -0.0426 0.0654 0.0473 0.0603 -0.0454
NA06989 -0.0073 -0.0551 -0.0457 0.0971 -0.0051 -0.0123 0.0035
NA06994 -0.0051 -0.0599 -0.0239 0.0930 0.0765 0.0321 0.0392
NA07000 -0.0046 -0.0362 0.0006 -0.0639 -0.0197 -0.0132 0.0631
NA07037 -0.0132 -0.0600 -0.0252 -0.0381 -0.0091 0.0005 0.0235
NA07048 -0.0128 -0.0653 -0.0234 -0.0417 0.0233 0.1034 0.0180
NA07051 -0.0028 -0.0591 -0.0117 -0.0791 -0.0387 0.0102 -0.0840
NA07056 -0.0121 -0.0389 0.0113 -0.0754 0.0226 -0.0304 -0.0490
NA07347 -0.0192 -0.0441 -0.0588 0.1099 -0.0414 0.0505 0.0295
NA07357 -0.0100 -0.0360 -0.0268 -0.0621 -0.0737 -0.0090 0.0379


and I would like to standardize the distributions of each column, i.e. applying the formula

column_i[row_j] - column_i.mean()) / column_i.std()

for every value in every column, and substitute the original data frame with these values.

So far I have come up with

for index, row in evec_pandas.iterrows():
new_row = None
evec_pandas.loc[index,'PC1'] = (row['PC1'] - evec_pandas['PC1'].mean()) / evec_pandas['PC1'].std()

print evec_pandas


but the results are

PC1 PC2 PC3 PC4 PC5 PC6 PC7
ind
NA06984 0.343471 -0.0594 -0.0148 -0.0569 -0.1128 -0.0276 -0.0217
NA06986 -0.330077 -0.0659 -0.0426 0.0654 0.0473 0.0603 -0.0454
NA06989 -0.003975 -0.0551 -0.0457 0.0971 -0.0051 -0.0123 0.0035
NA06994 0.008607 -0.0599 -0.0239 0.0930 0.0765 0.0321 0.0392
NA07000 0.003659 -0.0362 0.0006 -0.0639 -0.0197 -0.0132 0.0631
NA07037 -0.058300 -0.0600 -0.0252 -0.0381 -0.0091 0.0005 0.0235
NA07048 -0.028319 -0.0653 -0.0234 -0.0417 0.0233 0.1034 0.0180
NA07051 0.046818 -0.0591 -0.0117 -0.0791 -0.0387 0.0102 -0.0840
NA07056 -0.043817 -0.0389 0.0113 -0.0754 0.0226 -0.0304 -0.0490
NA07347 -0.071195 -0.0441 -0.0588 0.1099 -0.0414 0.0505 0.0295
NA07357 0.019495 -0.0360 -0.0268 -0.0621 -0.0737 -0.0090 0.0379


The first value is correct (0.343471), but the rest of the values in the PC1 column are not, and of course the rest of columns have no changes. If I use:

for index, row in evec_pandas.iterrows():
new_row = None
new_row = (row['PC1'] - evec_pandas['PC1'].mean()) / evec_pandas['PC1'].std()
print new_row


I do obtain the PC1 column as it should be, but as an independent object, not inside the data frame:

0.343471311655
-0.673732188246
0.530304607555
0.987008219756
1.09080449526
-0.694491443346
-0.611454422946
1.46447108706
-0.466139637246
-1.94004674935
-0.0301952801455


So I need to substitute PC1 with these values, and then do the same for each column; I had thought of something like

for index, column in evec_pandas.iteritems():
for index, row in evec_pandas.iterrows():
new_row = None
evec_pandas.loc[index,column] = (row[column] - evec_pandas[column].mean()) / evec_pandas[column].std()


But I understand it won't work like this. Any ideas?

The desired output would be:

PC1 PC2 PC3 PC4 PC5 PC6 PC7
NA06984 0.34347131 -0.5760881 0.439607045 -0.6710009 -1.8594019 -1.0130591 -0.50633142
NA06986 -0.67373219 -1.1365003 -0.929352573 0.9013689 1.0906816 1.0794999 -1.02745500
NA06989 0.53030461 -0.2053539 -1.082006343 1.3089251 0.1251327 -0.6488253 0.04777466
NA06994 0.98700822 -0.6191967 -0.008505635 1.2562128 1.6287356 0.4081670 0.83275827
NA07000 1.09080450 1.4241525 1.197951582 -0.7609975 -0.1438943 -0.6702508 1.35827952
NA07037 -0.69449144 -0.6278185 -0.072521733 -0.4292956 0.0514267 -0.3441068 0.48754139
NA07048 -0.61145442 -1.0847700 0.016115941 -0.4755796 0.6484455 2.1055441 0.36660554
NA07051 1.46447109 -0.5502229 0.592260816 -0.9564188 -0.4939979 -0.1131873 -1.87620479
NA07056 -0.46613964 1.1913658 1.724853306 -0.9088491 0.6355469 -1.0797163 -1.10661301
NA07347 -1.94004675 0.7430361 -1.727091631 1.4734904 -0.5437494 0.8461998 0.61947141
NA07357 -0.03019528 1.4413959 -0.151310775 -0.7378555 -1.1389255 -0.5702651 0.80417343

Answer

You can just do the following:

In [19]:
(df - df.mean())/df.std()

Out[19]:
              PC1       PC2       PC3       PC4       PC5       PC6       PC7
ind                                                                          
NA06984  0.343471 -0.576088  0.439607 -0.671001 -1.859402 -1.013059 -0.506331
NA06986 -0.673732 -1.136500 -0.929353  0.901369  1.090682  1.079500 -1.027455
NA06989  0.530305 -0.205354 -1.082006  1.308925  0.125133 -0.648825  0.047775
NA06994  0.987008 -0.619197 -0.008506  1.256213  1.628736  0.408167  0.832758
NA07000  1.090804  1.424152  1.197952 -0.760998 -0.143894 -0.670251  1.358280
NA07037 -0.694491 -0.627818 -0.072522 -0.429296  0.051427 -0.344107  0.487541
NA07048 -0.611454 -1.084770  0.016116 -0.475580  0.648445  2.105544  0.366606
NA07051  1.464471 -0.550223  0.592261 -0.956419 -0.493998 -0.113187 -1.876205
NA07056 -0.466140  1.191366  1.724853 -0.908849  0.635547 -1.079716 -1.106613
NA07347 -1.940047  0.743036 -1.727092  1.473490 -0.543749  0.846200  0.619471
NA07357 -0.030195  1.441396 -0.151311 -0.737856 -1.138926 -0.570265  0.804173

This will operate on the whole df so there is no need to iterate over rows/columns