rubito rubito - 2 months ago 6
Python Question

How to assign unique identifier to DataFrame row

I have a

.csv
file that is created from an
nd.array
after the input data is processed by
sklearn.cluster.DBSCAN()
.I would like to be able to "tie" every point in the cluster to an unique identifier given by a column in my input file.

This is how I'm reading my
input_data
:

# Generate sample data
col_1 ="RL15_LONGITUDE"
col_2 ="RL15_LATITUDE"
data = pd.read_csv("2004_Charley_data.csv")
coords = data.as_matrix(columns=[col_1, col_2])
data = data[[col_1,col_2]].dropna()
data = data.as_matrix().astype('float16',copy=False)


And this is what it looks like:

RecordID Storm RL15_LATITUDE RL15_LONGITUDE
2004_Charley95104-257448 2004_Charley 25.81774 -80.25079
2004_Charley93724-254950 2004_Charley 26.116338 -81.74986
2004_Charley93724-254949 2004_Charley 26.116338 -81.74986
2004_Charley75496-215198 2004_Charley 26.11817 -81.75756


With some help I was able to take the output of
DBSCAN
and save it as a
.CSV
file like this:

clusters = (pd.concat([pd.DataFrame(c, columns=[col_2,col_1]).assign(cluster=i)
for i,c in enumerate(clusters)])
.reset_index()
.rename(columns={'index':'point'})
.set_index(['cluster','point'])
)
clusters.to_csv('output.csv')


My output now is multi-index, but I would like to know if there's a way I could change the column point to
RecordID
instead of just a number? :

cluster point RL15_LATITUDE RL15_LONGITUDE
0 0 -81.0625 29.234375
0 1 -81.0625 29.171875
0 2 -81.0625 29.359375
1 0 -81.0625 29.25
1 1 -81.0625 29.21875
1 2 -81.0625 29.25
1 3 -81.0625 29.21875

Answer

UPDATE:

Code:

import numpy as np
import pandas as pd
from sklearn.cluster import DBSCAN
from sklearn import metrics
from sklearn.datasets.samples_generator import make_blobs
from sklearn.preprocessing import StandardScaler

fn = r'D:\temp\.data\2004_Charley_data.csv'
df = pd.read_csv(fn)

cols = ['RL15_LONGITUDE','RL15_LATITUDE']
eps_=4
min_samples_=13

db = DBSCAN(eps=eps_/6371., min_samples=min_samples_, algorithm='ball_tree', metric='haversine').fit(np.radians(df[cols]))

core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_

df['cluster'] = labels

res = df[df.cluster >= 0]

print('--------------------------------')
print(res)
print('--------------------------------')
print(res.cluster.value_counts())

Output:

--------------------------------
                     RecordID         Storm  RL15_LATITUDE  RL15_LONGITUDE  cluster
5    2004_Charley73944-211787  2004_Charley      29.228560      -81.034440        0
13   2004_Charley72308-208134  2004_Charley      29.442692      -81.109528        0
18   2004_Charley68044-198941  2004_Charley      29.442692      -81.109528        0
19   2004_Charley67753-198272  2004_Charley      29.270940      -81.097300        0
22   2004_Charley64829-191531  2004_Charley      29.313223      -81.101620        0
..                        ...           ...            ...             ...      ...
812  2004_Charley94314-256039  2004_Charley      28.287827      -81.353285        1
813  2004_Charley93913-255344  2004_Charley      26.532980      -82.194400        7
814  2004_Charley93913-255346  2004_Charley      27.210467      -81.863720        5
815  2004_Charley93913-255357  2004_Charley      26.935550      -82.054447        4
816  2004_Charley93913-255354  2004_Charley      26.935550      -82.054447        4

[688 rows x 5 columns]
--------------------------------
1    217
0    170
2    145
4     94
7     18
6     16
5     14
3     14
Name: cluster, dtype: int64

Old answer:

If i understood your code correctly you can do it this way:

# read CSV (you have provided space-delimited file and with one unnamed column, so i have converted it to somewhat similar to that from your question)
fn = r'D:\temp\.data\2004_Charley_data.csv'

df = pd.read_csv(fn, sep='\s+', index_col=0)
df.index = df.index.values + df.RecordID.map(str)
del df['RecordID']

first 10 rows:

In [148]: df.head(10)
Out[148]:
                                 Storm  RL15_LATITUDE  RL15_LONGITUDE
RecordID
2004_Charley67146-196725  2004_Charley      33.807550      -78.701172
2004_Charley73944-211790  2004_Charley      33.618435      -78.993407
2004_Charley73944-211793  2004_Charley      28.609200      -80.818880
2004_Charley73944-211789  2004_Charley      29.383210      -81.160100
2004_Charley73944-211786  2004_Charley      33.691235      -78.895129
2004_Charley73944-211787  2004_Charley      29.228560      -81.034440
2004_Charley73944-211795  2004_Charley      28.357253      -80.701632
2004_Charley73944-211792  2004_Charley      34.204490      -77.924700
2004_Charley66636-195501  2004_Charley      33.436717      -79.132074
2004_Charley66631-195496  2004_Charley      33.646292      -78.977968

clustering:

cols = ['RL15_LONGITUDE','RL15_LATITUDE']

eps_=4
min_samples_=13

db = DBSCAN(eps=eps_/6371., min_samples=min_samples_, algorithm='ball_tree', metric='haversine').fit(np.radians(df[cols]))

core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_

setting cluster info to our DF - we can simply assign it as labels has the same length as our DF:

df['cluster'] = labels

filter: keep only those rows where cluster >= 0:

res = df[df.cluster >= 0]

result:

In [152]: res.head(10)
Out[152]:
                                 Storm  RL15_LATITUDE  RL15_LONGITUDE  cluster
RecordID
2004_Charley73944-211787  2004_Charley      29.228560      -81.034440        0
2004_Charley72308-208134  2004_Charley      29.442692      -81.109528        0
2004_Charley68044-198941  2004_Charley      29.442692      -81.109528        0
2004_Charley67753-198272  2004_Charley      29.270940      -81.097300        0
2004_Charley64829-191531  2004_Charley      29.313223      -81.101620        0
2004_Charley67376-197429  2004_Charley      29.196990      -80.993800        0
2004_Charley73720-211013  2004_Charley      29.171450      -81.037170        0
2004_Charley73705-210991  2004_Charley      28.308746      -81.424273        1
2004_Charley65157-192371  2004_Charley      28.308746      -81.424273        1
2004_Charley65126-192326  2004_Charley      28.308746      -81.424273        1

stats:

In [151]: res.cluster.value_counts()
Out[151]:
1    217
0    170
2    145
4     94
7     18
6     16
5     14
3     14
Name: cluster, dtype: int64

if you don't want to have RecordID as index:

In [153]: res = res.reset_index()

In [154]: res.head(10)
Out[154]:
                   RecordID         Storm  RL15_LATITUDE  RL15_LONGITUDE  cluster
0  2004_Charley73944-211787  2004_Charley      29.228560      -81.034440        0
1  2004_Charley72308-208134  2004_Charley      29.442692      -81.109528        0
2  2004_Charley68044-198941  2004_Charley      29.442692      -81.109528        0
3  2004_Charley67753-198272  2004_Charley      29.270940      -81.097300        0
4  2004_Charley64829-191531  2004_Charley      29.313223      -81.101620        0
5  2004_Charley67376-197429  2004_Charley      29.196990      -80.993800        0
6  2004_Charley73720-211013  2004_Charley      29.171450      -81.037170        0
7  2004_Charley73705-210991  2004_Charley      28.308746      -81.424273        1
8  2004_Charley65157-192371  2004_Charley      28.308746      -81.424273        1
9  2004_Charley65126-192326  2004_Charley      28.308746      -81.424273        1