codingknob codingknob - 4 months ago 16
Python Question

how to unstack (or pivot?) in pandas

I have a dataframe that looks like the following:

import pandas as pd
datelisttemp = pd.date_range('1/1/2014', periods=3, freq='D')
s = list(datelisttemp)*3
s.sort()
df = pd.DataFrame({'BORDER':['GERMANY','FRANCE','ITALY','GERMANY','FRANCE','ITALY','GERMANY','FRANCE','ITALY' ], 'HOUR1':[2 ,2 ,2 ,4 ,4 ,4 ,6 ,6, 6],'HOUR2':[3 ,3 ,3, 5 ,5 ,5, 7, 7, 7], 'HOUR3':[8 ,8 ,8, 12 ,12 ,12, 99, 99, 99]}, index=s)


This gives me:

Out[458]: df

BORDER HOUR1 HOUR2 HOUR3
2014-01-01 GERMANY 2 3 8
2014-01-01 FRANCE 2 3 8
2014-01-01 ITALY 2 3 8
2014-01-02 GERMANY 4 5 12
2014-01-02 FRANCE 4 5 12
2014-01-02 ITALY 4 5 12
2014-01-03 GERMANY 6 7 99
2014-01-03 FRANCE 6 7 99
2014-01-03 ITALY 6 7 99


I want the final dataframe to look something like:

HOUR GERMANY FRANCE ITALY
2014-01-01 1 2 2 2
2014-01-01 2 3 3 3
2014-01-01 3 8 8 8
2014-01-02 1 4 4 4
2014-01-02 2 5 5 5
2014-01-02 3 12 12 12
2014-01-03 1 6 6 6
2014-01-03 2 7 7 7
2014-01-03 3 99 99 99


I've done the following but I'm not quite there:

df['date_col'] = df.index

df2 = melt(df, id_vars=['date_col','BORDER'])
#Can I keep the same index after melt or do I have to set an index like below?
df2.set_index(['date_col', 'variable'], inplace=True, drop=True)
df2 = df2.sort()


df

Out[465]: df2

BORDER value
date_col variable
2014-01-01 HOUR1 GERMANY 2
HOUR1 FRANCE 2
HOUR1 ITALY 2
HOUR2 GERMANY 3
HOUR2 FRANCE 3
HOUR2 ITALY 3
HOUR3 GERMANY 8
HOUR3 FRANCE 8
HOUR3 ITALY 8
2014-01-02 HOUR1 GERMANY 4
HOUR1 FRANCE 4
HOUR1 ITALY 4
HOUR2 GERMANY 5
HOUR2 FRANCE 5
HOUR2 ITALY 5
HOUR3 GERMANY 12
HOUR3 FRANCE 12
HOUR3 ITALY 12
2014-01-03 HOUR1 GERMANY 6
HOUR1 FRANCE 6
HOUR1 ITALY 6
HOUR2 GERMANY 7
HOUR2 FRANCE 7
HOUR2 ITALY 7
HOUR3 GERMANY 99
HOUR3 FRANCE 99
HOUR3 ITALY 99


I thought I could unstack df2 to get something that resembles my final dataframe but I get all sorts of errors. I have also tried to pivot this dataframe but can't quite get what I want.

Answer

We want values (e.g. 'GERMANY') to become column names, and column names (e.g. 'HOUR1') to become values -- a swap of sorts.

The stack method turns column names into index values, and the unstack method turns index values into column names.

So by shifting the values into the index, we can use stack and unstack to perform the swap.

import pandas as pd

datelisttemp = pd.date_range('1/1/2014', periods=3, freq='D')
s = list(datelisttemp)*3
s.sort()
df = pd.DataFrame({'BORDER':['GERMANY','FRANCE','ITALY','GERMANY','FRANCE','ITALY','GERMANY','FRANCE','ITALY' ], 'HOUR1':[2 ,2 ,2 ,4 ,4 ,4 ,6 ,6, 6],'HOUR2':[3 ,3 ,3, 5 ,5 ,5, 7, 7, 7], 'HOUR3':[8 ,8 ,8, 12 ,12 ,12, 99, 99, 99]}, index=s)

df = df.set_index(['BORDER'], append=True)
df.columns.name = 'HOUR'
df = df.unstack('BORDER')
df = df.stack('HOUR')
df = df.reset_index('HOUR')
df['HOUR'] = df['HOUR'].str.replace('HOUR', '').astype('int')
print(df)

yields

BORDER      HOUR  FRANCE  GERMANY  ITALY
2014-01-01     1       2        2      2
2014-01-01     2       3        3      3
2014-01-01     3       8        8      8
2014-01-02     1       4        4      4
2014-01-02     2       5        5      5
2014-01-02     3      12       12     12
2014-01-03     1       6        6      6
2014-01-03     2       7        7      7
2014-01-03     3      99       99     99