user1017373 user1017373 - 26 days ago 9
Python Question

Adding new columns to data frame based on the values of multiple columns

I have a data frame whose header looks like following,


df.head()
Out[660]:
Samples variable value Type
0 PE01I 267N12.3_Beta 0.066517 Beta
1 PE01R R267N12.3_Beta 0.061617 Beta
2 PE02I 267N12.3_Beta 0.071013 Beta
3 PE02R 267N12.3_Beta 0.056623 Beta
4 PE03I 267N12.3_Beta 0.071633 Beta
5 PE01I 267N12.3_FPKM 0.000000 FPKM
6 PE01R 267N12.3_FPKM 0.003430 FPKM
7 PE02I 267N12.3_FPKM 0.272144 FPKM
8 PE02R 267N12.3_FPKM 0.005753 FPKM
9 PE03I 267N12.3_FPKM 0.078708 FPKM


And I wanted to add new columns with header name as Beta and FPKM by using from column "Type"based on their corresponding values from column "value".
So far I tried this via following one-liner,

df['Beta'] = df['Type'].map(lambda x: df.value if x == "Beta" else "FPKM")


and it give sme following output,

Samples variable value Type Beta
0 PE01I 267N12.3_Beta 0.066517 Beta 0 0.066517 1 0.061617 2 0.07...
1 PE01R 267N12.3_Beta 0.061617 Beta 0 0.066517 1 0.061617 2 0.07...
2 PE02I 267N12.3_Beta 0.071013 Beta 0 0.066517 1 0.061617 2 0.07...
3 PE02R 267N12.3_Beta 0.056623 Beta 0 0.066517 1 0.061617 2 0.07...
4 PE03I 267N12.3_Beta 0.071633 Beta 0 0.066517 1 0.061617 2 0.07...


The column Beta has three values and all column is repeating.
What I am aiming is to have a data frame which looks like,

Samples variable Beta FPKM
PE01I 267N12.3_Beta 0.066517 0
PE01R 267N12.3_Beta 0.061617 0.00343
PE02I 267N12.3_Beta 0.071013 0.272144
PE02R 267N12.3_Beta 0.056623 0.005753
PE03I 267N12.3_Beta 0.071633 0.078708


Any help would be really great..
Thank you

Answer Source

I think you need unstack:

df1 = df.set_index(['Samples','Type']).unstack()
print (df1)
               variable                    value          
Type               Beta           FPKM      Beta      FPKM
Samples                                                   
PE01I     267N12.3_Beta  267N12.3_FPKM  0.066517  0.000000
PE01R    R267N12.3_Beta  267N12.3_FPKM  0.061617  0.003430
PE02I     267N12.3_Beta  267N12.3_FPKM  0.071013  0.272144
PE02R     267N12.3_Beta  267N12.3_FPKM  0.056623  0.005753
PE03I     267N12.3_Beta  267N12.3_FPKM  0.071633  0.078708

#remove Multiindex in columns
df1.columns = ['_'.join(col) for col in df1.columns]
df1.reset_index(inplace=True)
print (df1)
  Samples   variable_Beta  variable_FPKM  value_Beta  value_FPKM
0   PE01I   267N12.3_Beta  267N12.3_FPKM    0.066517    0.000000
1   PE01R  R267N12.3_Beta  267N12.3_FPKM    0.061617    0.003430
2   PE02I   267N12.3_Beta  267N12.3_FPKM    0.071013    0.272144
3   PE02R   267N12.3_Beta  267N12.3_FPKM    0.056623    0.005753
4   PE03I   267N12.3_Beta  267N12.3_FPKM    0.071633    0.078708

#if need remove column
print (df1.drop('variable_FPKM', axis=1))
  Samples   variable_Beta  value_Beta  value_FPKM
0   PE01I   267N12.3_Beta    0.066517    0.000000
1   PE01R  R267N12.3_Beta    0.061617    0.003430
2   PE02I   267N12.3_Beta    0.071013    0.272144
3   PE02R   267N12.3_Beta    0.056623    0.005753
4   PE03I   267N12.3_Beta    0.071633    0.078708

EDIT by comment:

If get error:

ValueError: Index contains duplicate entries, cannot reshape

it means you have duplicates values in index and aggragating is necessary.

You need pivot_table and if aggfunc is np.sum or np.mean (working with numeric), string columns are omited and function ''.join works only with string values and numeric are omited.

Call function twice with different aggfunc and then use concat:

import pandas as pd

df = pd.DataFrame({'Type': {0: 'Beta', 1: 'Beta', 2: 'Beta', 3: 'Beta', 4: 'Beta', 5: 'FPKM', 6: 'FPKM', 7: 'FPKM', 8: 'FPKM', 9: 'FPKM'}, 'value': {0: 0.066516999999999993, 1: 0.061616999999999998, 2: 0.071012999999999993, 3: 0.056623, 4: 0.071633000000000002, 5: 0.0, 6: 0.0034299999999999999, 7: 0.272144, 8: 0.0057530000000000003, 9: 0.078708}, 'variable': {0: '267N12.3_Beta', 1: 'R267N12.3_Beta', 2: '267N12.3_Beta', 3: '267N12.3_Beta', 4: '267N12.3_Beta', 5: '267N12.3_FPKM', 6: '267N12.3_FPKM', 7: '267N12.3_FPKM', 8: '267N12.3_FPKM', 9: '267N12.3_FPKM'}, 'Samples': {0: 'PE01I', 1: 'PE01I', 2: 'PE02I', 3: 'PE02R', 4: 'PE03I', 5: 'PE01I', 6: 'PE01R', 7: 'PE02I', 8: 'PE02R', 9: 'PE03I'}})

#changed value in second row in column Samples
print (df)
  Samples  Type     value        variable
0   PE01I  Beta  0.066517   267N12.3_Beta
1   PE01I  Beta  0.061617  R267N12.3_Beta
2   PE02I  Beta  0.071013   267N12.3_Beta
3   PE02R  Beta  0.056623   267N12.3_Beta
4   PE03I  Beta  0.071633   267N12.3_Beta
5   PE01I  FPKM  0.000000   267N12.3_FPKM
6   PE01R  FPKM  0.003430   267N12.3_FPKM
7   PE02I  FPKM  0.272144   267N12.3_FPKM
8   PE02R  FPKM  0.005753   267N12.3_FPKM
9   PE03I  FPKM  0.078708   267N12.3_FPKM
df1 = df.pivot_table(index='Samples', columns=['Type'], aggfunc=','.join)
print (df1)
                             variable               
Type                             Beta           FPKM
Samples                                             
PE01I    267N12.3_Beta,R267N12.3_Beta  267N12.3_FPKM
PE01R                            None  267N12.3_FPKM
PE02I                   267N12.3_Beta  267N12.3_FPKM
PE02R                   267N12.3_Beta  267N12.3_FPKM
PE03I                   267N12.3_Beta  267N12.3_FPKM

df2 = df.pivot_table(index='Samples', columns=['Type'], aggfunc=np.mean)
print (df2)
            value          
Type         Beta      FPKM
Samples                    
PE01I    0.064067  0.000000
PE01R         NaN  0.003430
PE02I    0.071013  0.272144
PE02R    0.056623  0.005753
PE03I    0.071633  0.078708

df3 = pd.concat([df1, df2], axis=1)
df3.columns = ['_'.join(col) for col in df3.columns]
df3.reset_index(inplace=True)
print (df3)
  Samples                 variable_Beta  variable_FPKM  value_Beta  value_FPKM
0   PE01I  267N12.3_Beta,R267N12.3_Beta  267N12.3_FPKM    0.064067    0.000000
1   PE01R                          None  267N12.3_FPKM         NaN    0.003430
2   PE02I                 267N12.3_Beta  267N12.3_FPKM    0.071013    0.272144
3   PE02R                 267N12.3_Beta  267N12.3_FPKM    0.056623    0.005753
4   PE03I                 267N12.3_Beta  267N12.3_FPKM    0.071633    0.078708