Rudiger Wolf Rudiger Wolf - 3 months ago 13
Python Question

pandas row operation to keep only the right most non zero value per row

How to keep the right most number in each row in a dataframe?

a = [[1, 2, 0], [1, 3, 0], [1, 0, 0]]
df = pd.DataFrame(a, columns=['col1','col2','col3'])
df

col1 col2 col3
row0 1 2 NaN
row1 1 3 0
row2 1 0 0


Then after transformation

col1 col2 col3
row0 0 2 0
row1 0 3 0
row2 1 0 0


Based on the suggestion by divakar I've come up with the following:

import pandas as pd
a = [[1, 2, 0, None],
[1, 3, 0,0],
[1, 0, 0,0],
[1, 0, 0,0],
[1, 0, 0,0],
[0, 0, 0,1]]

df = pd.DataFrame(a, columns=['col1','col2','col3','col4'])
df.fillna(value=0,inplace=True) # Get rid of non numeric items
a

[[1, 2, 0, None],
[1, 3, 0, 0],
[1, 0, 0, 0],
[1, 0, 0, 0],
[1, 0, 0, 0],
[0, 0, 0, 1]]

# Return index of first occurrence of maximum over requested axis.
# 0 or 'index' for row-wise, 1 or 'columns' for column-wise
df.idxmax(1)

0 col2
1 col2
2 col1
3 col1
4 col1
5 col4
dtype: object


Create a matrix to mask values

numberOfRows = df.shape[0]
df_mask= pd.DataFrame(columns=df.columns,index=np.arange(0, numberOfRows))
df_test.fillna(value=0,inplace=True) # Get rid of non numeric items

# Add mask entries
for row,col in enumerate(df.idxmax(1)):
df_mask.loc[row,col] = 1

df_result=df*df_mask
df_result

col1 col2 col3 col4
0 0 2 0 0.0
1 0 3 0 0.0
2 1 0 0 0.0
3 1 0 0 0.0
4 1 0 0 0.0
5 0 0 0 1.0

Answer

Working at NumPy level, here's one vectorized approach using broadcasting -

np.where(((a!=0).cumsum(1).argmax(1))[:,None] == np.arange(a.shape[1]),a,0)

Sample run -

In [7]: a # NumPy array
Out[7]: 
array([[1, 2, 0],
       [1, 3, 0],
       [1, 0, 0]])

In [8]: np.where(((a!=0).cumsum(1).argmax(1))[:,None] == np.arange(a.shape[1]),a,0)
Out[8]: 
array([[0, 2, 0],
       [0, 3, 0],
       [1, 0, 0]])

Porting it to pandas, we would have an implementation like so -

df = df*(df.values.argmax(1)[:,None] == np.arange(df.shape[1]))

Sample run -

In [70]: df
Out[70]: 
   col1  col2  col3  col4
0     1     2     0   0.0
1     1     3     0   0.0
2     1     0     0   0.0
3     1     0     0   0.0
4     1     0     0   0.0
5     0     0     0   1.0

In [71]: df*(df.values.argmax(1)[:,None] == np.arange(df.shape[1]))
Out[71]: 
   col1  col2  col3  col4
0     0     2     0   0.0
1     0     3     0   0.0
2     1     0     0   0.0
3     1     0     0   0.0
4     1     0     0   0.0
5     0     0     0   1.0