Paul Stoner Paul Stoner - 13 days ago 8
Python Question

pandas Goup and Rename DataFrame Columns names

Working with housing value data from Zillow, my dataframe has a series of column names in the form yyyy-mm, example

2001-01|2000-02|2000-03|2000-04|...|2016-08


Each of these columns contains the average home value for that period for a number of city, state combinations

RegionName State 2000-01 2000-02 2000-03
Philadelphia PA 53100 53200 53400


I need to group the columns but their respective quarter, provide a mean value and return this to a new dataframe. So for the example provided I would have something like the following

RegionName State 2000q1 2000q2 2000q3 2000q4
Philadelphia PA 53233.33 ... ... ...


I have no idea how to begin to tackle this. I am able to pull off the column names, get the last two characters, and map those to a dictionary containing the quarter number, but that is it. Not sure how to reassemble the data

Any help is greatly appreciated

Answer

This is similar to MaxU's answer, but shows a way to groupby both the date columns and the non-date columns (RegionName, State).

import numpy as np
import pandas as pd
df = pd.DataFrame({'2000-01': [53100], '2000-02': [53200], '2000-03': [53400], 'RegionName': ['Philadelphia'], 'State': ['PA']})

melted = pd.melt(df, id_vars=['RegionName', 'State'], var_name='date')
melted['date'] = pd.PeriodIndex(melted['date'], freq='Q')
result = melted.groupby(['RegionName', 'State', 'date']).mean()
result = result['value'].unstack('date')

yields

date                2000Q1
RegionName   State        
Philadelphia PA      53233

First, use pd.melt to coalesce all the date columns into a single column:

import numpy as np
import pandas as pd
df = pd.DataFrame({'2000-01': [53100], '2000-02': [53200], '2000-03': [53400], 'RegionName': ['Philadelphia'], 'State': ['PA']})

melted = pd.melt(df, id_vars=['RegionName', 'State'], var_name='date')
#      RegionName State     date  value
# 0  Philadelphia    PA  2000-01  53100
# 1  Philadelphia    PA  2000-02  53200
# 2  Philadelphia    PA  2000-03  53400

Next, use pd.PeriodIndex to convert the date (strings?) to pd.Periods. Notice that pd.PeriodIndex can convert different date strings to the same Period:

melted['date'] = pd.PeriodIndex(melted['date'], freq='Q')
#      RegionName State   date  value
# 0  Philadelphia    PA 2000Q1  53100
# 1  Philadelphia    PA 2000Q1  53200
# 2  Philadelphia    PA 2000Q1  53400

Finally, use groupby/mean to group together rows with the same RegionName, State and date and compute the mean for each group:

result = melted.groupby(['RegionName', 'State', 'date']).mean()
#                            value
# RegionName   State date         
# Philadelphia PA    2000Q1  53233

If you stop here, your DataFrame will be tidy (PDF) -- each row represents one "observation". This is often the best form for further computation. (It isn't by accident that first step in this solution was to convert the original DataFrame to tidy format -- notice that melted also has the property that each row represents one observation.)

However, if you'd like, you could move the date index level into separate columns:

result = result['value'].unstack('date')
Comments