nicolekanderson nicolekanderson - 1 year ago 48
Python Question

Pivot groups of row data into columns using Pandas

I've got data I'm reading in as a dataframe from a CSV using Pandas (in Python). The CSV looks basically like the following:

image img1.jpg
date Thursday, May 5
subject 'Unique subject line 1'
image img2.jpg
date Tuesday, May 17
subject 'Unique subject line 2'
image img3.jpg
date Monday, May 9
subject 'Unique subject line 3'

I'd like to get it into a dataframe where each unique grouping is in a single row, with the attributes that are currently listed as row data in column 1 as column names. So something like the following:

image date link subject
img1.jpg Thursday, May 5 'Unique subject line 1'
img2.jpg Tuesday, May 17 'Unique subject line 2'
img3.jpg Monday, May 9 'Unique subject line 3'

I've tried using
and also just creating an empty dataframe with the columns I want, but with both methods, I'm having trouble with either indexing or aggregating. I think this is because I'm not grouping by any one attribute, and I'm not trying to aggregate any numeric data.

It seems like it should be simple enough to reshape the data in this way, but I'm not sure how to define the groupings I want. Is there a way to do this using
, or would it be best to do it another way?

Answer Source

The issue is that, as the data currently is formatted, there isn't a unique way to group the images during a pivot. Any date could be grouped with img1.jpg during a pivot, as there isn't any additional data saying which date should correspond to each image.

To fix this, we just need to add an additional column with the grouping information. Judging by your output, the grouping essentially go in row order; the first 4 rows go together, the next 4 rows go together, etc. To enumerate repeats like this, numpy.repeat is useful, you just need to know the number of images and attributes. Some basic math allows us to get the number of images and number of attributes in general:

# Add an grouping column.
nbr_images = (df['col1'] == 'image').sum()
nbr_attributes = len(df)/nbr_images
df['image_group'] = np.repeat(range(nbr_images), nbr_attributes) 

Now, it's straightforward to pivot:

# Pivot the DataFrame.
pivoted_df = df.pivot(columns='col1', index='image_group', values='col2')

# Clear the index and column name. = None = None

The resulting output:

              date     image         link                subject
0  Thursday, May 5  img1.jpg  Unique subject line 1
1  Tuesday, May 17  img2.jpg  Unique subject line 2
2    Monday, May 9  img3.jpg  Unique subject line 3