Tyler Wood Tyler Wood - 29 days ago 5
Python Question

What's the best way to transform Array values in one column to columns of the original DataFrame?

I have a table where one of the columns is an Array of binary features, they are there when that feature is present.

I'd like to train a logistic model on these rows, but can't get the data in the required format where each feature value is it's own column with a 1 or 0 value.

Example:

id feature values
1 ['HasPaws', 'DoesBark', 'CanFetch']
2 ['HasPaws', 'CanClimb', 'DoesMeow']


I'd like to get it to the format of

id HasPaws DoesBark CanFetch CanClimb DoesMeow
1 1 1 1 0 0
2 1 0 0 1 0


It seems like there would be some functionality built in to accomplish this, but I can't think of what this transformation is called to do a better search on my own.

Answer

You can first convert lists to columns and then use get_dummies() method:

In [115]: df
Out[115]:
   id                 feature_values
0   1  [HasPaws, DoesBark, CanFetch]
1   2  [HasPaws, CanClimb, DoesMeow]

In [126]: (pd.get_dummies(df.set_index('id').feature_values.apply(pd.Series))
     ...:    .rename(columns=lambda x: x.split('_')[1])
     ...:    .reset_index()
     ...: )
     ...:
Out[126]:
   id  HasPaws  CanClimb  DoesBark  CanFetch  DoesMeow
0   1        1         0         1         1         0
1   2        1         1         0         0         1