Xpector - 9 months ago 50

R Question

I'm going to try kNN classification on a dataset containing, among other features, the one called "time of day". In the context of the application, Monday 23:58 is just as close to Tuesday 00:02 as is Friday 00:04. It's the angle of the hour hand on clock's face that matters. If not that one circular feature, Euclidean distance would do.

So far I'm aware of

`class::knn()`

`caret::knn3()`

A possible alternative would be an extra step in data preparation, namely to replace the circular feature with two linear (an angle θ becomes a point (cosθ,sinθ) ) or to replicate data points in training set accross the 00:00 boundary causing the boundary to vanish: http://stats.stackexchange.com/questions/51908/nearest-neighbor-algorithm-for-circular-dimensions However, I'd prefer avoiding both replacing one dimension by two and creating copies of data points, if ever possible.

Another way would be to calculate the distance matrix myself and then implement kNN. This sounds very much like reinventing the wheel.

One more reason I'm looking for a way to plug in my own customized distance metric is the following. While the distance between Tuesday 15:01 o'clock to Wednesday 15:02 o'clock is 1 minute, Sunday 23:00 UTC (currency exchange market opening) is considered "far" from any other day's 23:00. Other special cases might appear, too.

Answer

Afaik `knn`

works a little bit different. It is an instance based method, meaning that the actual model consists of the instances. For each set of test samples distance matrix is computed anew in terms of computing a distance matrix <- is this where you are ?

You cannot simply define knn by the distance matrix alone. At least I don't know a way, how, given a test vector, you can compute the distance without having a corresponding train vector set.

If however you have distance matrix then take a look at the following similar question Find K nearest neighbors, starting from a distance matrix

But the documentation explicitly says:

Usage

k.nearest.neighbors(i, distance_matrix, k = 5)

Arguments

i is from the numeric class and is a row from the distance_matrix.

distance_matrix is a nxn matrix.

k is from the numeric class and represent the number of neigbours that the function will return.

This imho is similar to:

```
apply(dm, 1, function(d) "majority vote for labels[order(d) < k]")
```

Given you have a distance matrix you already reinvented 80% of `knn`