user3329648 user3329648 - 4 months ago 30
Python Question

Most common value from multiple colums in Pandas

I have a series of data in an irregular number of columns and I need to determine the most common value from split sections across multiple columns using pandas. An example of what I mean is if I had information of what kind of cheese my coworkers had with their lunches each day:

Idx Name Cheese1 Cheese2 Cheese3
0 Evan Gouda NaN NaN
1 John Cheddar Havarti Blue
2 Evan Cheddar Gouda NaN
3 John Havarti Swiss NaN


I'm looking for some kind of functionality that will give the resulting pivot table:

Name Cheese Pct
Evan Gouda .66
John Havarti .4


I also don't know how many columns will need to be included each time I run the script, only that they're all of the format "Cheese"+index. If John shows up with four cheeses the next day, I will need to add a fourth column and the analysis script needs to be able to handle that.

Answer
import io
import pandas as pd

data = io.StringIO("""\
Idx Name   Cheese1   Cheese2   Cheese3
0   Evan   Gouda     NaN       NaN
1   John   Cheddar   Havarti   Blue
2   Evan   Cheddar   Gouda     NaN
3   John   Havarti   Swiss     NaN
4   Rick   NaN       NaN       NaN
""")
df = pd.read_csv(data, delim_whitespace=True)

def top_cheese(g):
    cheese_cols = [col for col in g.columns if col.startswith('Cheese')]
    try:
        out = (g[cheese_cols].stack().value_counts(normalize=True)
                             .reset_index().iloc[0])
        out.index = ['Cheese', 'Pct']
        return out
    except IndexError:
        return pd.Series({'Cheese': 'None', 'Pct': 0})


output = df.groupby('Name').apply(top_cheese)
print(output)

Output:

       Cheese       Pct
Name                   
Evan    Gouda  0.666667
John  Havarti  0.400000
Rick     None  0.000000
Comments