Ian Fiddes Ian Fiddes - 1 month ago 5
Python Question

filtering a DataFrame based on rows being less than a percent of sum of any column

Here are my example data:

{'Rhesus': {('count', u'augCGP,transMap'): 6.0, ('count', u'augTM,transMap'): 11563.0, ('count', u'transMap'): 39930.0, ('count', u'augTM'): 5114.0, ('count', u'augCGP,augTM,augTMR,transMap'): 27.0, ('count', u'augCGP,augTMR'): 1.0, ('count', u'augTMR,transMap'): 145.0, ('count', u'augTMR'): 4217.0, ('count', u'augCGP,augTMR,transMap'): nan, ('count', u'augCGP,augTM,augTMR'): nan, ('count', u'augCGP'): 4239.0, ('count', u'augCGP,augTM,transMap'): 3.0, ('count', u'augTM,augTMR,transMap'): 6296.0, ('count', u'augTM,augTMR'): 3357.0}, 'Susie': {('count', u'augCGP,transMap'): 11.0, ('count', u'augTM,transMap'): 10821.0, ('count', u'transMap'): 41300.0, ('count', u'augTM'): 2894.0, ('count', u'augCGP,augTM,augTMR,transMap'): 43.0, ('count', u'augCGP,augTMR'): nan, ('count', u'augTMR,transMap'): 353.0, ('count', u'augTMR'): 5399.0, ('count', u'augCGP,augTMR,transMap'): 1.0, ('count', u'augCGP,augTM,augTMR'): 1.0, ('count', u'augCGP'): 2740.0, ('count', u'augCGP,augTM,transMap'): 2.0, ('count', u'augTM,augTMR,transMap'): 10196.0, ('count', u'augTM,augTMR'): 2789.0}, 'Clint': {('count', u'augCGP,transMap'): 16.0, ('count', u'augTM,transMap'): 17341.0, ('count', u'transMap'): 39284.0, ('count', u'augTM'): 2888.0, ('count', u'augCGP,augTM,augTMR,transMap'): 80.0, ('count', u'augCGP,augTMR'): 1.0, ('count', u'augTMR,transMap'): 144.0, ('count', u'augTMR'): 2881.0, ('count', u'augCGP,augTMR,transMap'): nan, ('count', u'augCGP,augTM,augTMR'): 1.0, ('count', u'augCGP'): 2338.0, ('count', u'augCGP,augTM,transMap'): 8.0, ('count', u'augTM,augTMR,transMap'): 8725.0, ('count', u'augTM,augTMR'): 1441.0}, 'Orangutan': {('count', u'augCGP,transMap'): 7.0, ('count', u'augTM,transMap'): 6568.0, ('count', u'transMap'): 46113.0, ('count', u'augTM'): 3656.0, ('count', u'augCGP,augTM,augTMR,transMap'): 17.0, ('count', u'augCGP,augTMR'): nan, ('count', u'augTMR,transMap'): 284.0, ('count', u'augTMR'): 5952.0, ('count', u'augCGP,augTMR,transMap'): 1.0, ('count', u'augCGP,augTM,augTMR'): 1.0, ('count', u'augCGP'): 5753.0, ('count', u'augCGP,augTM,transMap'): 3.0, ('count', u'augTM,augTMR,transMap'): 6567.0, ('count', u'augTM,augTMR'): 3520.0}, 'Gibbon': {('count', u'augCGP,transMap'): 5.0, ('count', u'augTM,transMap'): 6828.0, ('count', u'transMap'): 44285.0, ('count', u'augTM'): 4313.0, ('count', u'augCGP,augTM,augTMR,transMap'): 16.0, ('count', u'augCGP,augTMR'): nan, ('count', u'augTMR,transMap'): 187.0, ('count', u'augTMR'): 6550.0, ('count', u'augCGP,augTMR,transMap'): nan, ('count', u'augCGP,augTM,augTMR'): 1.0, ('count', u'augCGP'): 4178.0, ('count', u'augCGP,augTM,transMap'): nan, ('count', u'augTM,augTMR,transMap'): 5839.0, ('count', u'augTM,augTMR'): 3882.0}}


Which are a DataFrame that looks like:

>>> df
genome Clint Gibbon Orangutan Rhesus \
Transcript Modes
count augCGP 2338.0 4178.0 5753.0 4239.0
augCGP,augTM,augTMR 1.0 1.0 1.0 NaN
augCGP,augTM,augTMR,transMap 80.0 16.0 17.0 27.0
augCGP,augTM,transMap 8.0 NaN 3.0 3.0
augCGP,augTMR 1.0 NaN NaN 1.0
augCGP,augTMR,transMap NaN NaN 1.0 NaN
augCGP,transMap 16.0 5.0 7.0 6.0
augTM 2888.0 4313.0 3656.0 5114.0
augTM,augTMR 1441.0 3882.0 3520.0 3357.0
augTM,augTMR,transMap 8725.0 5839.0 6567.0 6296.0
augTM,transMap 17341.0 6828.0 6568.0 11563.0
augTMR 2881.0 6550.0 5952.0 4217.0
augTMR,transMap 144.0 187.0 284.0 145.0
transMap 39284.0 44285.0 46113.0 39930.0

genome Susie
Transcript Modes
count augCGP 2740.0
augCGP,augTM,augTMR 1.0
augCGP,augTM,augTMR,transMap 43.0
augCGP,augTM,transMap 2.0
augCGP,augTMR NaN
augCGP,augTMR,transMap 1.0
augCGP,transMap 11.0
augTM 2894.0
augTM,augTMR 2789.0
augTM,augTMR,transMap 10196.0
augTM,transMap 10821.0
augTMR 5399.0
augTMR,transMap 353.0
transMap 41300.0


As you can see, a few of these categories have very few entries. I want to filter each row (
Transcript Modes
) such that if they represent fewer than 1% of the total for every column, they are dropped. So, my resulting dataframe would look like:

>>> df
genome Clint Gibbon Orangutan Rhesus \
Transcript Modes
count augCGP 2338.0 4178.0 5753.0 4239.0
augTM 2888.0 4313.0 3656.0 5114.0
augTM,augTMR 1441.0 3882.0 3520.0 3357.0
augTM,augTMR,transMap 8725.0 5839.0 6567.0 6296.0
augTM,transMap 17341.0 6828.0 6568.0 11563.0
augTMR 2881.0 6550.0 5952.0 4217.0
transMap 39284.0 44285.0 46113.0 39930.0

genome Susie
Transcript Modes
count augCGP 2740.0
augTM 2894.0
augTM,augTMR 2789.0
augTM,augTMR,transMap 10196.0
augTM,transMap 10821.0
augTMR 5399.0
transMap 41300.0

Answer
row_sum = df.sum(axis=1)
total_sum = row_sum.sum()
print(df.loc[row_sum/total_sum > 0.01])

yields

                               Clint   Gibbon  Orangutan   Rhesus    Susie
count augCGP                  2338.0   4178.0     5753.0   4239.0   2740.0
      augTM                   2888.0   4313.0     3656.0   5114.0   2894.0
      augTM,augTMR            1441.0   3882.0     3520.0   3357.0   2789.0
      augTM,augTMR,transMap   8725.0   5839.0     6567.0   6296.0  10196.0
      augTM,transMap         17341.0   6828.0     6568.0  11563.0  10821.0
      augTMR                  2881.0   6550.0     5952.0   4217.0   5399.0
      transMap               39284.0  44285.0    46113.0  39930.0  41300.0
Comments