KengoTokukawa KengoTokukawa - 3 months ago 19x
Python Question

How to apply linear regresssion of sklearn for some string variable

I am going to predict the box office of a movie using logistic regression.
I got some train data including the actors and directors. This is my datas:

Director1|Actor1|300 million
Director2|Actor2|500 million

I am going to encode the directors and actors using integers.

1|1|300 million
2|2|300 million

Which means that
X={[1,1],[2,2]} y=[300,500]

Does that work?


You cannot use categorical variables in linear regression like that. Linear regression treats all variables like numerical variables. Therefore, if you code Director1 as 1 and Director2 as 2, linear regression will try to find a relationship based on that coding scheme. It will assume Director2 is twice the size of Director1. In reality, those numbers don't mean anything. You may code them as 143 and 9879, there shouldn't be any difference. They don't have any numerical meaning. In order to make sure linear regression treats them correctly, you need to use dummy variables.

With dummy variables, you have a variable for every category level. For example, if you have 3 directors, you will have 3 variables: D1, D2 and D3. D1 will have the value 1 if the corresponding movie was directed by Director1, and 0 otherwise; D2 will have the value 1 if the movie was directed by Director2, and 0 otherwise... So with a set of values D2 D1 D2 D3 D1 D2, your dummy varibles will be:

    D1 D2 D3
D2  0  1  0
D1  1  0  0
D2  0  1  0
D3  0  0  1
D1  1  0  0
D2  0  1  0

In linear regression, in order to avoid multicollinearity we use only n-1 of these variables where n is the number of categories (number of directors for this example). One of the directors will be selected as the base, and will be represented by the constant in the regression model. It doesn't matter which one. For example, if you exclude D3, you will know the movie was directed by Director3 if D1=0 and D2=0. You don't need to specify D3=1.

In scikit-learn, this transformation is done with OneHotEncoder. The example is from scikit-learn documentation:

You have three categorical variables: Gender, Region and Browser. Gender has 2 levels: ["male", "female"], Region has three levels: ["from Europe", "from US", "from Asia"] and Browser has four levels: ["uses Firefox", "uses Chrome", "uses Safari", "uses Internet Explorer"]. Assume they are coded with zero-based numbers. So [0, 1, 2] means a male from US who uses Safari.

>>> enc = preprocessing.OneHotEncoder()
>>>[[0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]])  
OneHotEncoder(categorical_features='all', dtype=<... 'float'>,
       handle_unknown='error', n_values='auto', sparse=True)
>>> enc.transform([[0, 1, 3]]).toarray()
array([[ 1.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  1.]])

With scikit-learn infers the number of levels for each variable. For an observation like [0, 1, 3], if you call enc.transform you will see their dummy variables. Note that the resulting array's length is 2 + 3 + 4 = 9. The first two for gender (if male, the first one is 1), the next three for region, and so on.