lupejuares - 9 months ago 61

Python Question

i ran into this error while trying to do a project.

ValueError: Found arrays with inconsistent numbers of samples: [878049 884262]. i get it when i try to run my knn classifier at the bottom. ive been reading about it and i know its because my X and y are not the same. the shape for X is (878049, 2) and y is (884262,). how can i fix this error so that they match?

`# drop features that we wont be using`

#train.head()

df = train.drop(['Descript', 'Resolution', 'Address'],axis=1)

df2 = test.drop(['Address'],axis=1)

# trying to see the times during a day a particualr crime occurs, for example

# rapes occur more from 12am-4am during the weekend.example below

dow = {

'Monday':0,

'Tuesday':1,

'Wednesday':2,

'Thursday':3,

'Friday':4,

'Saturday':5,

'Sunday':6

}

df['DOW'] = df.DayOfWeek.map(dow)

# Add column containing time of day

df['Hour'] = pd.to_datetime(df.Dates).dt.hour

# making my feature column

feature_cols = ['DOW','Hour']

X = df[feature_cols]

df2['DOW'] = df2.DayOfWeek.map(dow)

y=df2['DOW']

# columns in X and y dont match

print(X.shape)

print(y.shape)

print(y.head())

print(X.head())

# Knn classifier

k = 5

my_knn_for_cs4661 = KNeighborsClassifier(n_neighbors=k)

my_knn_for_cs4661.fit(X, y)

#KNN (with k=5), Decision Tree accuracy

y_predict = my_knn_for_cs4661.predict(X)

print('\n')

score = accuracy_score(y, y_predict)

print("K=",k,"Has ",score, "Accuracy")

results = pd.DataFrame()

results['actual'] = y

results['prediction'] = y_predict

print(results.head(10))

---------------------------------------------------------------------------

ValueError Traceback (most recent call last)

<ipython-input-11-5a002c1fd668> in <module>()

7 k = 5

8 my_knn_for_cs4661 = KNeighborsClassifier(n_neighbors=k)

----> 9 my_knn_for_cs4661.fit(X, y)

10 #KNN (with k=5), Decision Tree accuracy

11 y_predict = my_knn_for_cs4661.predict(X)

C:\Users\Michael\Anaconda3\lib\site-packages\sklearn\neighbors\base.py in fit(self, X, y)

776 """

777 if not isinstance(X, (KDTree, BallTree)):

--> 778 X, y = check_X_y(X, y, "csr", multi_output=True)

779

780 if y.ndim == 1 or y.ndim == 2 and y.shape[1] == 1:

C:\Users\Michael\Anaconda3\lib\site-packages\sklearn\utils\validation.py in check_X_y(X, y, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, warn_on_dtype, estimator)

518 y = y.astype(np.float64)

519

--> 520 check_consistent_length(X, y)

521

522 return X, y

C:\Users\Michael\Anaconda3\lib\site-packages\sklearn\utils\validation.py in check_consistent_length(*arrays)

174 if len(uniques) > 1:

175 raise ValueError("Found arrays with inconsistent numbers of samples: "

--> 176 "%s" % str(uniques))

177

178

ValueError: Found arrays with inconsistent numbers of samples: [878049 884262]

Answer Source

Check shape of X and y by using X.shape. Stack trace says you have different no of instances(no of samples) in X and y. This is why fit function is throwing ValueError.

Refer documentation it states:

```
"""Fit the model using X as training data and y as target values
Parameters
----------
X : {array-like, sparse matrix, BallTree, KDTree}
Training data. If array or matrix, shape [n_samples, n_features],
or [n_samples, n_samples] if metric='precomputed'.
y : {array-like, sparse matrix}
Target values, array of float values, shape = [n_samples]
or [n_samples, n_outputs]
"""
```

In simple words,

```
X is (878049, 2) -> n_samples = 878049 and n_features = 2
y is (884262,) -> Here, n_samples = 884262
```

You are passing extra target values. Reduce no of target values in y. As your n_samples for X is 878049, you must pass same number of target values(878049).

You can try:

```
my_knn_for_cs4661.fit(X, y[:878049])
```