selwyth selwyth - 6 months ago 232
Python Question

Iterate over pandas columns with row-wise comparisons

I have many columns in a dataframe, and I want to compare the values in each column to a specific column. For example, say I want to, for every column in this dataframe, sum the cases where both the column value and the label are equal to 1:

col1 | col2 | col3 | ... | label
1 | 0 | 0 | ... | 1
0 | 0 | 1 | ... | 0


When I try to do this with something like
df.apply(lambda x: x.label==1, axis=1)
, I can select the label column with
x.label
, but how do I select the column itself?

I can do this using a for loop that iterates over the column names, but am wondering if there's a more pandas-like way to do it without using a loop.

results = []
for col in df.columns:
val = len(df[(df[col]==1) & (df.label==1)])
results.append(val)

Answer

Just filter by label and sum what is left:

df.loc[df['label'] == 1].sum()

Example:

df = pd.DataFrame(np.random.randint(2, size=(10, 4)),
                  columns=['col1', 'col2', 'col3', 'label'])
print(df)

   col1  col2  col3  label
0     0     0     1      1
1     1     1     0      0
2     1     1     0      0
3     0     0     0      0
4     0     0     1      0
5     0     0     0      1
6     1     0     1      1
7     0     1     1      0
8     0     0     0      0
9     0     0     0      0

results = []
for col in df.columns:
    val = len(df[(df[col]==1) & (df.label==1)])
    results.append(val)
results

[1, 0, 2, 3]

df.loc[df['label'] == 1].sum().tolist()

[1, 0, 2, 3]

EDIT:

If not everything is 0 or 1 but you still want to sum the cases where both the column value and the label are equal to 1, after filtering by label make everyting which is not 0 or 1 to be 0 and sum what is left:

df = pd.DataFrame(np.random.randint(3, size=(10, 4)),
                  columns=['col1', 'col2', 'col3', 'label'])
print(df)

   col1  col2  col3  label
0     0     0     2      1
1     1     0     0      2
2     2     1     0      2
3     1     1     1      0
4     0     0     2      1
5     2     2     1      2
6     0     2     1      1
7     1     1     0      0
8     1     0     0      2
9     0     2     1      2

results = []
for col in df.columns:
    val = len(df[(df[col]==1) & (df.label==1)])
    results.append(val)
results

[0, 0, 1, 3]

df.loc[df['label'] == 1][df == 1].sum().fillna(0).tolist()

[0.0, 0.0, 1.0, 3.0]