Neil - 4 months ago 39

Python Question

I have a dataframe with latitude and longitude pairs.

Here is my dataframe look like.

`order_lat order_long`

0 19.111841 72.910729

1 19.111342 72.908387

2 19.111342 72.908387

3 19.137815 72.914085

4 19.119677 72.905081

5 19.119677 72.905081

6 19.119677 72.905081

7 19.120217 72.907121

8 19.120217 72.907121

9 19.119677 72.905081

10 19.119677 72.905081

11 19.119677 72.905081

12 19.111860 72.911346

13 19.111860 72.911346

14 19.119677 72.905081

15 19.119677 72.905081

16 19.119677 72.905081

17 19.137815 72.914085

18 19.115380 72.909144

19 19.115380 72.909144

20 19.116168 72.909573

21 19.119677 72.905081

22 19.137815 72.914085

23 19.137815 72.914085

24 19.112955 72.910102

25 19.112955 72.910102

26 19.112955 72.910102

27 19.119677 72.905081

28 19.119677 72.905081

29 19.115380 72.909144

30 19.119677 72.905081

31 19.119677 72.905081

32 19.119677 72.905081

33 19.119677 72.905081

34 19.119677 72.905081

35 19.111860 72.911346

36 19.111841 72.910729

37 19.131674 72.918510

38 19.119677 72.905081

39 19.111860 72.911346

40 19.111860 72.911346

41 19.111841 72.910729

42 19.111841 72.910729

43 19.111841 72.910729

44 19.115380 72.909144

45 19.116625 72.909185

46 19.115671 72.908985

47 19.119677 72.905081

48 19.119677 72.905081

49 19.119677 72.905081

50 19.116183 72.909646

51 19.113827 72.893833

52 19.119677 72.905081

53 19.114100 72.894985

54 19.107491 72.901760

55 19.119677 72.905081

I want to cluster this points which are nearest to each other(200 meters distance) following is my distance matrix.

`from scipy.spatial.distance import pdist, squareform`

distance_matrix = squareform(pdist(X, (lambda u,v: haversine(u,v))))

array([[ 0. , 0.2522482 , 0.2522482 , ..., 1.67313071,

1.05925366, 1.05420922],

[ 0.2522482 , 0. , 0. , ..., 1.44111548,

0.81742536, 0.98978355],

[ 0.2522482 , 0. , 0. , ..., 1.44111548,

0.81742536, 0.98978355],

...,

[ 1.67313071, 1.44111548, 1.44111548, ..., 0. ,

1.02310118, 1.22871515],

[ 1.05925366, 0.81742536, 0.81742536, ..., 1.02310118,

0. , 1.39923529],

[ 1.05420922, 0.98978355, 0.98978355, ..., 1.22871515,

1.39923529, 0. ]])

Then I am applying DBSCAN clustering algorithm on distance matrix.

`from sklearn.cluster import DBSCAN`

db = DBSCAN(eps=2,min_samples=5)

y_db = db.fit_predict(distance_matrix)

I don't know how to choose eps & min_samples value. It clusters the points which are way too far, in one cluster.(approx 2 km in distance) Is it because it calculates euclidean distance while clustering? please help.

Answer

DBSCAN is *meant* to be used on the raw data, with a spatial index for acceleration. The only tool I know with acceleration for geo distances is ELKI (Java) - scikit-learn unfortunately only supports this for a few distances like Euclidean distance (see `sklearn.neighbors.NearestNeighbors`

).
But apparently, you can affort to precompute pairwise distances, so this is not (yet) an issue.

However, *you did not read the documentation carefully enough*, and your assumption that DBSCAN uses a distance matrix is wrong:

```
from sklearn.cluster import DBSCAN
db = DBSCAN(eps=2,min_samples=5)
db.fit_predict(distance_matrix)
```

uses **Euclidean distance on the distance matrix rows**, which obviously does not make any sense.

See the documentation of `DBSCAN`

(emphasis added):

class sklearn.cluster.DBSCAN(eps=0.5, min_samples=5,

metric='euclidean', algorithm='auto', leaf_size=30, p=None, random_state=None)

metric: string, or callableThe metric to use when calculating distance between instances in a feature array. If metric is a string or callable, it must be one of the options allowed by metrics.pairwise.calculate_distance for its metric parameter.

If metric is “precomputed”, X is assumed to be a distance matrix and must be square.X may be a sparse matrix, in which case only “nonzero” elements may be considered neighbors for DBSCAN.

similar for `fit_predict`

:

X: array or sparse (CSR) matrix of shape (n_samples, n_features), or array of shape (n_samples, n_samples)A feature array, or array of distances between samples

if metric='precomputed'.

In other words, you need to do

```
db = DBSCAN(eps=2, min_samples=5, metric="precomputed")
```