CodingYourLife CodingYourLife - 2 months ago 7
Python Question

skflow pandas dataset mean down each 2 lines

I have a dataset X of lets say 2000 lines. I want to take each 2 lines and mean them col by col together. The result should be a 1000 line dataset (columns count should stay the same).

I did this in matlab already

#matlab function
function [ divisibleMatrix ] = meanDown( matrix )
%MEANDOWN takes a matrix and means every 2 lines (makes it half the size)

newSize = (floor(size(matrix, 1)/2)*2); %make it divisible
divisibleMatrix = matrix(1:newSize, 1:end);

D = divisibleMatrix;
m=size(divisibleMatrix, 1);
n=size(divisibleMatrix, 2);

% compute mean for two neighboring rows
D=reshape(D, 2, m/2*n);
D=(D(1,:)+D(2,:))/2;
D=reshape(D, m/2, n);

divisibleMatrix = D;
end

Answer

This is a 1-liner. The key is that groupby can take an arbitrary mapping of index -> group. df.index / 2 gives what you are looking.

In [1]: pd.options.display.max_rows=10

In [2]: df = DataFrame(np.random.randn(2000,10))

In [3]: df
Out[3]: 
             0         1         2         3         4         5         6         7         8         9
0     1.424278  1.341120 -1.926183  2.277194  0.257652 -1.837933  0.548063 -1.554667  0.485864  0.939497
1    -0.389531 -0.122452  0.514899  0.112404 -1.137853  0.814050  0.464444  0.180946 -0.873092  1.376984
2     1.244440 -1.358285 -1.167748 -1.103943  0.268973  0.954938 -1.041816 -0.549772  0.639713 -0.064106
3     0.907945 -0.705092 -2.251826  0.032511  0.132661  0.101646 -0.385823 -0.197524  0.726309 -0.044143
4     0.045390 -1.476742  0.511301  0.259116  0.255900 -1.621707  1.592440 -1.792673 -0.256589 -1.626885
...        ...       ...       ...       ...       ...       ...       ...       ...       ...       ...
1995  0.452570  0.097372  0.055521  0.387842  0.188056  2.392688  0.292957 -1.141517 -0.420548 -1.357877
1996 -2.155074  0.411274  0.357251 -0.326192 -0.493771  0.805255  0.156565  0.439860 -0.149214  0.329143
1997  1.141906 -0.595052  1.054630  0.705025  0.527523 -1.328829  0.726637 -0.889798  0.672279 -1.699829
1998  1.210885  0.550444  0.903205  1.240884  0.634060  0.595759  0.155567 -0.865876  0.197398  0.194864
1999 -0.273097  0.234418  1.172747  1.993209  0.271385  0.449079 -1.029834 -0.246728 -0.110820 -1.588270

[2000 rows x 10 columns]

In [4]: df.groupby(df.index/2).mean()
Out[4]: 
            0         1         2         3         4         5         6         7         8         9
0    0.517374  0.609334 -0.705642  1.194799 -0.440100 -0.511942  0.506254 -0.686860 -0.193614  1.158240
1    1.076193 -1.031689 -1.709787 -0.535716  0.200817  0.528292 -0.713820 -0.373648  0.683011 -0.054125
2   -0.189650 -0.652461 -0.496076 -0.129063  0.209076 -1.463476  0.549773 -1.228766  0.255020 -0.231682
3   -0.804283  0.985501  0.321846 -0.570661  0.023639  0.473073  1.636425 -0.336158  0.427294 -0.063739
4    0.982331  0.088111  1.601761 -0.193683 -0.488863  1.113968  1.099340 -0.785286  0.370041 -0.095078
..        ...       ...       ...       ...       ...       ...       ...       ...       ...       ...
995  0.244260 -0.754283 -1.318084 -1.157576 -0.159194 -0.245290  0.230198 -0.996492 -0.520177  0.125455
996 -0.604840 -0.628592  0.952476  1.049358 -0.392648 -0.121538  0.544432  0.309035  0.254711 -0.664254
997 -0.006366 -0.511019 -0.855803  0.103337 -1.131138  1.942504 -0.418524 -0.132304  0.266050 -0.055807
998 -0.506584 -0.091889  0.705941  0.189417  0.016876 -0.261787  0.441601 -0.224969  0.261533 -0.685343
999  0.468894  0.392431  1.037976  1.617047  0.452722  0.522419 -0.437133 -0.556302  0.043289 -0.696703

[1000 rows x 10 columns]

In [5]: df.index/2
Out[5]: 
Int64Index([  0,   0,   1,   1,   2,   2,   3,   3,   4,   4,
            ...
            995, 995, 996, 996, 997, 997, 998, 998, 999, 999], dtype='int64', length=2000)
Comments