user5368737 - 1 year ago 106

Python Question

I'm trying to implement a k-nearest neighbour classifier in Python, and so I want to calculate the Euclidean distance. I have a dataset that I have converted into a big numpy array

`[[ 0. 0. 4. ..., 1. 0. 1.]`

[ 0. 0. 5. ..., 0. 0. 1.]

[ 0. 0. 14. ..., 16. 9. 1.]

...,

[ 0. 0. 3. ..., 2. 0. 3.]

[ 0. 1. 7. ..., 0. 0. 3.]

[ 0. 2. 10. ..., 0. 0. 3.]]

where the last element of each row indicates the class. So when calculating the Euclidean distance, I obviously don't want to include the last element. I thought I could do the following

`for row in dataset:`

distance = euclidean_distance(vector, row[:dataset.shape[1] - 1])

but that still includes the last element

`print row`

>>> [[ 0. 0. 4. ..., 1. 0. 1.]]

print row[:dataset.shape[1] - 1]

>>> [[ 0. 0. 4. ..., 1. 0. 1.]]

as you can see both are the same.

Answer Source

You can subset the data using numpy slicing. If you find yourself iterating over a numpy array, stop and try to find a method that takes advantage of the vectorized nature of numpy operations.

Assuming your array is called `arr`

:

```
data_points = arr[:,:-1]
classes = arr[:,-1]
```

To find the distance between a 1d array and all of the rows of a 2d array, you can use to following. It assumes the 1d array is `v`

and the 2d array is `arr`

.

```
dist = np.power(arr - v, 2).sum(axis=1)
```

`dist`

will be a 1d array of distances.

The following function takes a 2d array of numbers and returns the upper-diagonal matrix of pair-wise distances using the given L-x distance measurement (the Euclidean distance measure is the L=2 metric).

```
def pairwise_distance(arr, L=2):
d = arr.shape[0]
out = np.zeros(d)
for f in range(1, d):
out[:-f].ravel()[f::d+1] = np.power(arr[:-f]-arr[f:], L).sum(axis=1)
return np.power(out, 1.0/L)
```