Dawny33 Dawny33 - 4 months ago 35
Python Question

Inconsistent labeling in sklearn LabelEncoder?

I have applied a

LabelEncoder()
on a dataframe, which returns the following:


enter image description here


The
order/new_cart
s have different label-encoded numbers, like
70, 64, 71, etc


Is this inconsistent labeling, or did I do something wrong somewhere?

Answer

LabelEncoder works on one-dimensional arrays. If you apply it to multiple columns, it will be consistent within columns but not across columns.

As a workaround, you can convert the dataframe to a one dimensional array and call LabelEncoder on that array.

Assume this is the dataframe:

df
Out[372]: 
   0  1  2
0  d  d  a
1  c  a  c
2  c  c  b
3  e  e  d
4  d  d  e
5  d  b  e
6  e  e  b
7  a  e  b
8  b  c  c
9  e  a  b

With ravel and then reshaping:

pd.DataFrame(LabelEncoder().fit_transform(df.values.ravel()).reshape(df.shape), columns = df.columns)
Out[373]: 
   0  1  2
0  3  3  0
1  2  0  2
2  2  2  1
3  4  4  3
4  3  3  4
5  3  1  4
6  4  4  1
7  0  4  1
8  1  2  2
9  4  0  1

Edit:

If you want to store the labels, you need to save the LabelEncoder object.

le = LabelEncoder()
df2 = pd.DataFrame(le.fit_transform(df.values.ravel()).reshape(df.shape), columns = df.columns)

Now, le.classes_ gives you the classes (starting from 0).

le.classes_
Out[390]: array(['a', 'b', 'c', 'd', 'e'], dtype=object)

If you want to access the integer by label, you can construct a dict:

dict(zip(le.classes_, np.arange(len(le.classes_))))
Out[388]: {'a': 0, 'b': 1, 'c': 2, 'd': 3, 'e': 4}

You can do the same with transform method, without building a dict:

le.transform('c')
Out[395]: 2
Comments