jmhead jmhead - 1 month ago 9
Python Question

Pivoting a Pandas Dataframe containing strings - 'No numeric types to aggregate' error

There is a good number of questions about this error, but after looking around I'm still not able to find/wrap my mind around a solution yet.
I'm trying to pivot a data frame with strings, to get some row data to become columns, but not working out so far.

Shape of my df

<class 'pandas.core.frame.DataFrame'>
Int64Index: 515932 entries, 0 to 515931
Data columns (total 5 columns):
id 515932 non-null object
cc_contact_id 515932 non-null object
Network_Name 515932 non-null object
question 515932 non-null object
response_answer 515932 non-null object
dtypes: object(5)
memory usage: 23.6+ MB


Sample format

id contact_id question response_answer
16 137519 2206 State Ca
17 137520 2206 State Ca
18 137521 2206 State Ca
19 137522 2206 State Ca
20 137523 2208 City Lancaster
21 137524 2208 City Lancaster
22 137525 2208 City Lancaster
23 137526 2208 City Lancaster
24 137527 2208 Trip_End Location Home
25 137528 2208 Trip_End Location Home
26 137529 2208 Trip_End Location Home
27 137530 2208 Trip_End Location Home


What I would like to pivot to

id contact_id State City Trip_End Location
16 137519 2206 Ca None None None
20 137523 2208 None Lancaster None None
24 137527 2208 None None None Home
etc. etc.


Where the question values become the columns, with the response_answer being in it's corresponding column, and retaining the ids

What I have tried

unified_df = pd.DataFrame(unified_data, columns=target_table_headers, dtype=object)

pivot_table = unified_df.pivot_table('response_answer',['id','cc_contact_id'],'question')
# OR
pivot_table = unified_df.pivot_table('response_answer','question')


DataError: No numeric types to aggregate

What is the way to pivot a data frame with string values?

Answer

The default aggfunc in pivot_table is np.sum and it doesn't know what to do with strings and you haven't indicated what the index should be properly. Trying something like:

pivot_table = unified_df.pivot_table(index=['id', 'contact_id'],
                                     columns='question', 
                                     values='response_answer',
                                     aggfunc=lambda x: ' '.join(x))

This explicitly sets one row per id, contact_id pair and pivots the set of response_answer values on question. The aggfunc just assures that if you have multiple answers to the same question in the raw data that we just concatenate them together with spaces. The syntax of pivot_table might vary depending on your pandas version.

Here's a quick example:

In [24]: import pandas as pd

In [25]: import random

In [26]: df = pd.DataFrame({'id':[100*random.randint(10, 50) for _ in range(100)], 'question': [str(random.randint(0,3)) for _ in range(100)], 'response': [str(random.randint(100,120)) for _ in range(100)]})

In [27]: df.head()
Out[27]:
     id question response
0  3100        1      116
1  4500        2      113
2  5000        1      120
3  3900        2      103
4  4300        0      117

In [28]: df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 100 entries, 0 to 99
Data columns (total 3 columns):
id          100 non-null int64
question    100 non-null object
response    100 non-null object
dtypes: int64(1), object(2)
memory usage: 3.1+ KB

In [29]: df.pivot_table(index='id', columns='question', values='response', aggfunc=lambda x: ' '.join(x)).head()
Out[29]:
question        0        1    2        3
id
1000      110 120      NaN  100      NaN
1100          NaN  106 108  104      NaN
1200      104 113      119  NaN      101
1300          102      NaN  116  108 120
1400          NaN      NaN  116      NaN
Comments