Negative Correlation - 4 months ago 16

Python Question

I have 3 columns in my data set:

**Review**: A product review

**Type**: A category or product type

**Cost**: How much the product cost

This is a multiclass problem, with Type as the target variable. There are 64 different Types of products in this dataset.

**Review** and **Cost** are my two features.

I've split the data into 4 sets with the **Type** variable removed:

`X = data.drop('type', axis = 1)`

y = data.type

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

For

`vect = CountVectorizer(stop_words = stop)`

X_train_dtm = vect.fit_transform(X_train.review)

Here's where I am stuck!

In order to run the model I need to have both my features in the training set, however, since X_train_dtm is a sparse matrix, I am unsure as to how I concatenate my pandas series

Any help would be appreciated!!

Example data:

`| Review | Cost | Type |`

|:-----------------|------------:|:------------:|

| This is a review | 200 | Toy

| This is a review | 100 | Toy

| This is a review | 800 | Electronics

| This is a review | 35 | Home

After applying tarashypka's solution I was able to rid add the second feature to the X_train_dtm. However, I am getting an error when attempting to run the same on the test set:

from scipy.sparse import hstack

`vect = CountVectorizer(stop_words = stop)`

X_train_dtm = vect.fit_transform(X_train.review)

prices = X_train.prices.values[:,None]

X_train_dtm = hstack((X_train_dtm, prices))

#Works perfectly for the training set above

#But when I run with test set I get the following error

X_test_dtm = vect.transform(X_test)

prices_test = X_test.prices.values[:,None]

X_test_dtm = hstack((X_test_dtm, prices_test))

Traceback (most recent call last):

File "<ipython-input-10-b2861d63b847>", line 8, in <module>

X_test_dtm = hstack((X_test_dtm, points_test))

File "C:\Users\k\Anaconda3\lib\site-packages\scipy\sparse\construct.py", line 464, in hstack

return bmat([blocks], format=format, dtype=dtype)

File "C:\Users\k\Anaconda3\lib\site-packages\scipy\sparse\construct.py", line 581, in bmat

'row dimensions' % i)

ValueError: blocks[0,:] has incompatible row dimensions

Answer Source

The result of `CountVectorizer`

, in your case `X_train_dtm`

, is of type `scipy.sparse.csr_matrix`

. If you don't want to convert it to the numpy array, then `scipy.sparse.hstack`

is the way to add another column

```
>> from scipy.sparse import hstack
>> prices = X_train['Cost'].values[:, None]
>> X_train_dtm = hstack((X_train_dtm, prices))
```