Mayank Jain Mayank Jain - 3 years ago 183
Python Question

Converting pandas series to iterable of iterables

I am trying to use

MultiLabelBinarizer
in sklearn. I have a pandas series and I want to feed that series as input to
MultiLabelBinarizer
's fit function. However, I see that MultiLabelBinarizer's fit needs an input of form
iterable of iterables
. I am not sure how can I convert pandas series to required type.

import pandas as pd
from sklearn.preprocessing import MultiLabelBinarizer

data = pd.read_csv("somecsvFile")
y = pd.DataFrame(data['class'])

mlb = MultiLabelBinarizer()
y = mlb.fit(???)


I tried converting it to numpy array, tried using iter function of pandas, but nothing seems to be working.

Please suggest me some way.

Thanks

Edit1: Output of
print(data['class'].head(10))
is:

0 func
1 func
2 func
3 non func
4 func
5 func
6 non func
7 non func
8 non func
9 func
Name: status_group, dtype: object

Answer Source

How to workaround the fact that MultiLabelBinarizer's fit needs an input of form iterable of iterables:

In [8]: df
Out[8]:
      class
0      func
1      func
2      func
3  non func
4      func
5      func
6  non func
7  non func
8  non func
9      func

In [10]: import pandas as pd
    ...: from sklearn.preprocessing import MultiLabelBinarizer

In [11]: y = df['class'].str.split(expand=False)   # <--- NOTE !!!

In [12]: mlb = MultiLabelBinarizer()
    ...: y = mlb.fit_transform(y)
    ...:

In [13]: y
Out[13]:
array([[1, 0],
       [1, 0],
       [1, 0],
       [1, 1],
       [1, 0],
       [1, 0],
       [1, 1],
       [1, 1],
       [1, 1],
       [1, 0]])

UPDATE: as proposed by @unutbu you can use pd.get_dummies()

In [21]: pd.get_dummies(df['class'])
Out[21]:
   func  non func
0     1         0
1     1         0
2     1         0
3     0         1
4     1         0
5     1         0
6     0         1
7     0         1
8     0         1
9     1         0
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download