Phill Donn - 7 months ago 153

Python Question

I have recently watched a video from Owen Zhang kaggle rank 1 competitor:

https://youtu.be/LgLcfZjNF44

where he explains a technique of encoding categorical features to numerical which is called leave one out encoding. What he does to a categorical feature is associate a value with each observation, which is the average of the response for all other observations with same category.

I've been trying to implement this strategy in python using pandas. Although I have managed to build a successful code the fact that my data set is of size of tens of millions its performance is very slow.

If someone could bring up a faster solution I'd be very grateful.

This is my code so far:

`def categ2numeric(data, train=True):`

def f(series):

indexes = series.index.values

pomseries = pd.Series()

for i, index in enumerate(indexes):

pom = np.delete(indexes, i)

pomseries.loc[index] = series[pom].mean()

series = pomseries

return series

if train:

categ = data.groupby(by=['Cliente_ID'])['Demanda_uni_equil'].apply(f)

And I need to turn this Series:

`159812 28.0`

464556 83.0

717223 45.0

1043801 21.0

1152917 7.0

Name: 26, dtype: float32

to this:

`159812 39.00`

464556 25.25

717223 34.75

1043801 40.75

1152917 44.25

dtype: float64

Or mathematically element with index 159812 is equal to the average of all the other elements or:

39 = (83 + 45 + 21 + 7) / 4

Answer

Replace each element of the Series with difference between the sum of the Series and the element, then divide by the length of the series minus 1. Assuming `s`

is your Series:

```
s = (s.sum() - s)/(len(s) - 1)
```

The resulting output:

```
159812 39.00
464556 25.25
717223 34.75
1043801 40.75
1152917 44.25
```