Phill Donn - 1 year ago 495
Python Question

# Pandas realization of leave one out encoding for categorical features

I have recently watched a video from Owen Zhang kaggle rank 1 competitor:
https://youtu.be/LgLcfZjNF44
where he explains a technique of encoding categorical features to numerical which is called leave one out encoding. What he does to a categorical feature is associate a value with each observation, which is the average of the response for all other observations with same category.

I've been trying to implement this strategy in python using pandas. Although I have managed to build a successful code the fact that my data set is of size of tens of millions its performance is very slow.
If someone could bring up a faster solution I'd be very grateful.

This is my code so far:

``````def categ2numeric(data, train=True):
def f(series):
indexes = series.index.values
pomseries = pd.Series()
for i, index in enumerate(indexes):
pom = np.delete(indexes, i)
pomseries.loc[index] = series[pom].mean()
series = pomseries
return series

if train:
categ = data.groupby(by=['Cliente_ID'])['Demanda_uni_equil'].apply(f)
``````

And I need to turn this Series:

``````            159812     28.0
464556     83.0
717223     45.0
1043801    21.0
1152917     7.0
Name: 26, dtype: float32
``````

to this:

``````            159812     39.00
464556     25.25
717223     34.75
1043801    40.75
1152917    44.25
dtype: float64
``````

Or mathematically element with index 159812 is equal to the average of all the other elements or:

39 = (83 + 45 + 21 + 7) / 4

Answer Source

Replace each element of the Series with difference between the sum of the Series and the element, then divide by the length of the series minus 1. Assuming `s` is your Series:

``````s = (s.sum() - s)/(len(s) - 1)
``````

The resulting output:

``````159812     39.00
464556     25.25
717223     34.75
1043801    40.75
1152917    44.25
``````
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download