Josh Josh - 3 months ago 26
Python Question

Split pandas dataframe by column variable

I've got a dataframe that I'd like to split by a column variable like the example below:

gender|height|weight
:---- | :--- | :----
male | 42.8 | 157.5
:---- | :--- | :----
male | 41.3 | 165.6
:---- | :--- | :----
female| 48.4 | 144.2


My desired outcome is:

df_male:

gender|height|weight
:---- | :--- | :----
male | 42.8 | 157.5
:---- | :--- | :----
male | 41.3 | 165.6


df_female:

gender|height|weight
:---- | :--- | :----
female| 48.4 | 144.2


The catch is that I'd like to be able to do this with a variable that has anywhere from 5-25 categories.

My thought is that there should be a way to loop over the original dataframe and spit out multiple dataframes but I'm open to all possible solutions

Answer

The following will produce a list containing one dataframe for each value of the gender column:

import io
import pandas as pd

data = io.StringIO('''\
gender height weight
male     42.8  157.5
male     41.3  165.6
female   48.4  144.2
''')
df = pd.read_csv(data, delim_whitespace=True)

dfs = [rows for _, rows in df.groupby('gender')]

dfs is a list of length 2, with the following elements:

print(dfs[0])

#    gender  height  weight
# 2  female    48.4   144.2

print(dfs[1])

#    gender  height  weight
# 0    male    42.8   157.5
# 1    male    41.3   165.6

It might be even better to create a dictionary with keys the distinct values in the gender column and values the dataframes:

dfs = [{gender: rows for gender, rows in df.groupby('gender')}

results in the following dictionary:

{'female':     gender  height  weight
            2  female    48.4   144.2,
 'male':       gender  height  weight
            0    male    42.8   157.5
            1    male    41.3   165.6}