Chichi - 3 months ago 18

Python Question

Regression algorithms seem to be working on features represented as numbers.

For example this table:

It is quite clear how to do regression on this data and predict price.

But for now I want to do regression on data that contain strings:

How can I do regression on this data? Do I have to transform all this string data to numbers manually? I mean if I have to create some encoding rules and according to that rules transform all data to numeric values. Is there any more or less simple way to transform string data to numbers without having to create own encoding rules manually? May be there are some libraries in Python that can be used for that routine? Are there some risks that regression model will be somehow incorrect due to "bad encoding"?

Answer

Yes, you will have to convert everything to numbers. That requires thinking about what these attributes represent.

Usually there are three possibilities:

- One-Hot encoding for categorical data
- Arbitrary numbers for ordinal data
- Use something like group means for categorical data (e. g. mean prices for city districts).

You have to be carefull to not infuse information you do not have in the application case.

If you have categorical data, you can create dummy variables with 0/1 values for each possible value.

E. g.

```
idx color
0 blue
1 green
2 green
3 red
```

to

```
idx blue green red
0 1 0 0
1 0 1 0
2 0 1 0
3 0 0 1
```

This can easily be done with pandas:

```
import pandas as pd
data = pd.DataFrame({'color': ['blue', 'green', 'green', 'red']})
print(pd.get_dummies(data))
```

will result in:

```
color_blue color_green color_red
0 1 0 0
1 0 1 0
2 0 1 0
3 0 0 1
```

Create a mapping of your sortable categories, e. g. old < renovated < new → 0, 1, 2

This is also possible with pandas:

```
data = pd.DataFrame({'q': ['old', 'new', 'new', 'ren']})
data['q'] = data['q'].astype('category')
data['q'] = data['q'].cat.reorder_categories(['old', 'ren', 'new'], ordered=True)
data['q'] = data['q'].cat.codes
print(data['q'])
```

Result:

```
0 0
1 2
2 2
3 1
Name: q, dtype: int8
```

You could use the mean for each category over past (known events).

Say you have a DataFrame with the last known mean prices for cities:

```
prices = pd.DataFrame({
'city': ['A', 'A', 'A', 'B', 'B', 'C'],
'price': [1, 1, 1, 2, 2, 3],
})
mean_price = prices.groupby('city').mean()
data = pd.DataFrame({'city': ['A', 'B', 'C', 'A', 'B', 'A']})
print(data.merge(mean_price, on='city', how='left'))
```

Result:

```
city price
0 A 1
1 B 2
2 C 3
3 A 1
4 B 2
5 A 1
```