Andres Andres - 6 months ago 15
Python Question

Pandas pivot or groupby for dynamically generated columns

I have a dataframe with sales information in a supermarket. Each row in the dataframe represents an item, with several characteristics as columns. The original DataFrame is something like this:

In [1]: import pandas as pd
my_data = [{'ticket_number' : '001', 'item' : 'tomato', 'ticket_price' : '21'},
{'ticket_number' : '001', 'item' : 'candy', 'ticket_price' : '21'},
{'ticket_number' : '001', 'item' : 'soup', 'ticket_price' : '21'},
{'ticket_number' : '002', 'item' : 'soup', 'ticket_price' : '12'},
{'ticket_number' : '002', 'item' : 'cola', 'ticket_price' : '12'},
{'ticket_number' : '003', 'item' : 'beef', 'ticket_price' : '56'},
{'ticket_number' : '003', 'item' : 'tomato', 'ticket_price' : '56'},
{'ticket_number' : '003', 'item' : 'pork', 'ticket_price' : '56'}]
df = pd.DataFrame(my_data)

In [2]: df
Out [2]:
ticket_number ticket_price item
0 001 21 tomato
1 001 21 candy
2 001 21 soup
3 002 12 soup
4 002 12 cola
5 003 56 beef
6 003 56 tomato
7 003 56 pork


I need a DataFrame where each row represents a ticket with all the items bought and the ticket price as columns. In this example:

ticket_number ticket_price item1 item2 item3
0 001 21 tomato candy soup
1 002 12 soup cola
2 003 56 beef tomato pork


I tried using
df.groupby(ticket_number).item.value_counts()
, but that does not create new columns. I have never used
pivot_table
, maybe it is useful.

Any help would be very appreciated.

Thanks!

Answer

One possible way to use groupby to make lists of it that can then be turned into columns:

In [24]: res = df.groupby(['ticket_number', 'ticket_price'])['item'].apply(list).apply(pd.Series)

In [25]: res
Out[25]:
                                 0       1     2
ticket_number ticket_price
001           21            tomato   candy  soup
002           12              soup    cola   NaN
003           56              beef  tomato  pork

Then, after cleaning up this result a bit:

In [27]: res.columns = ['item' + str(i + 1) for i in res.columns]

In [29]: res.reset_index()
Out[29]:
  ticket_number ticket_price   item1   item2 item3
0           001           21  tomato   candy  soup
1           002           12    soup    cola   NaN
2           003           56    beef  tomato  pork

Another possible way to create a new column which numbers the items in each group with groupby.cumcount:

In [38]: df['item_number'] = df.groupby('ticket_number').cumcount()

In [39]: df
Out[39]:
     item ticket_number ticket_price  item_number
0  tomato           001           21            0
1   candy           001           21            1
2    soup           001           21            2
3    soup           002           12            0
4    cola           002           12            1
5    beef           003           56            0
6  tomato           003           56            1
7    pork           003           56            2

And then do some reshaping:

In [40]: df.set_index(['ticket_number', 'ticket_price', 'item_number']).unstack(-1)
Out[40]:
                              item
item_number                      0       1     2
ticket_number ticket_price
001           21            tomato   candy  soup
002           12              soup    cola   NaN
003           56              beef  tomato  pork

From here, with some cleaning of the columns names, you can achieve the same as above.

The reshaping step with set_index and untack could also be done with pivot_table: df.pivot_table(columns=['item_number'], index=['ticket_number', 'ticket _price'], values='item', aggfunc='first')