Zanshin Zanshin - 11 days ago 3
Python Question

Dataframe does not return desired result

df = pd.DataFrame( {
'A': ['d','d','d','d','d','d','g','g','g','g','g','g','k'
,'k','k','k','k','k'],
'B': [5,5,6,7,5,6,-6,7,7,6,-7,7,-8,7,-6,6,-7,50],
'C': [1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2],
'S': [2012,2013,2014,2015,2016,2012,2012,2014,2015,2016
,2012,2013,2012,2013,2014,2015,2016,2014]
} );

df = (df.B + df.C).groupby([df.A,df.S]).agg(['sum','size'])
.unstack(fill_value=0)

df1 = df.groupby(level=0, axis=1).sum()

new_cols= list(zip(df1.columns.get_level_values(0),
['total'] * len(df.columns)))
df1.columns = pd.MultiIndex.from_tuples(new_cols)
df2 = pd.concat([df1,df], axis=1).sort_index(axis=1)

df2.columns = ['_'.join((col[0], str(col[1]))) for col in df2.columns]
df2.columns = df2.columns.str.replace('sum_','')
df2.columns = df2.columns.str.replace('size_','#')

df_without_2012 = df2.sort_index(axis=1).loc[:, '2013':'2016']
df2[((df2['2012'] < 0) | (df_without_2012.sum(axis=1) > 21))
& (df_without_2012 > 0).all(axis=1)]

#2012#2013#2014#2015#2016#total 2012 2013 2014 2015 2016 total
A
d 2 1 1 1 1 6 13 6 7 8 6 40
g 2 1 1 1 1 6 -11 8 8 8 7 20


2 questions;


  • df2
    needs to not return 'g', as in the last line the sum should be
    over 21 and it is not. what is incorrect?

  • secondly, I'd like to have
    the years next to the #years.



EDIT;

-11 8 8 8 7 20 FALSE
-9 8 8 8 7 22 TRUE
8 8 8 8 7 39 TRUE
4 4 4 4 4 20 FALSE



21 & ALL COLUMNS + or 2013:2016 +, 2012 -

Answer

Explanation is easy - total column is sum columns from 2012 to 2016.

print (-11 +  8  +  8  +  8  +  7)
20

And df_without_2012 are summed only columns from 2013 to 2016, so sum is 31:

print (8  +  8  +  8  +  7)
31

and

31 > 21
True

answer for second question is add .sort_index(axis=1, level=1):

df2 = pd.concat([df1,df], axis=1)
        .sort_index(axis=1)
        .sort_index(axis=1, level=1) 

print (df2)
   #2012  2012  #2013  2013  #2014  2014  #2015  2015  #2016  2016  #total  \
A                                                                            
d      2    13      1     6      1     7      1     8      1     6       6   
g      2   -11      1     8      1     8      1     8      1     7       6   

   total  
A         
d     40  
g     20  

EDIT:

If there is more masks, I think the best for testing is use:

df_without_2012 = df2.sort_index(axis=1).loc[:, '2013':'2016']
m1 = df2['2012'] < 0
m2 = df_without_2012.sum(axis=1) > **0**
m3 = (df_without_2012 > 0).all(axis=1)
m4 = df2.total > 21
print (m1)
print (m2)
print (m3)
print (m4)

mask = m1 & m2 **& m4** | m3 & m4
print (mask)

print (df2[mask])
Comments