user3939059 user3939059 - 3 months ago 15
Python Question

Convert 160 bit Hash to unique integer ids for machine learning input

I am preparing some data for k-means clustering. At the moment I have the id in 160 bit hash format (this is the format for bitcoin addresses).

d = {'Hash' : pd.Series(['1HYKGGzRHDskth2ecKZ2HYvxSvQ1p87m6', '3DndG5HuyP8Ep8p3V1i394AUxG4gtgsvoj', '1HYKGGzRHDskth2ecKZ2HYvxSvQ1p87m6']),
'X1' : pd.Series([111, 222, 333]),
'X2' : pd.Series([111, 222, 333]),
'X3' : pd.Series([111, 222, 333])
}

df1 = (pd.DataFrame(d))
print(df1)

Hash X1 X2 X3
0 1HYKGGzRHDskth2ecKZ2HYvxSvQ1p87m6 111 111 111
1 3DndG5HuyP8Ep8p3V1i394AUxG4gtgsvoj 222 222 222
2 1HYKGGzRHDskth2ecKZ2HYvxSvQ1p87m6 333 333 333


In order to parse this data into the sklearn.cluster.KMeansĀ¶ algorithm I need to covert the data to np.float or np.array (i think).

Therefore I want to convert the hashes to an integer value, maintaining the relationship across all rows.

This is my attempt:

#REPLACE HASH WITH INT
look_up = {}
count = 0
for index, row in df1.iterrows():
count +=1
if row['Hash'] not in look_up:
look_up[row['Hash']] = count
else:
continue
print(look_up)

{'3DndG5HuyP8Ep8p3V1i394AUxG4gtgsvoj': 2, '1HYKGGzRHDskth2ecKZ2HYvxSvQ1p87m6': 1}


At this point I run through each of the dictionary and try to replace the hash value with the new integer value.

for index, row in df1.iterrows():
for address, id_int in look_up.iteritems():
if address == row['Hash']:
df1.set_value(index, row['Hash'], id_int)
print(df1)


Output:

Hash X1 X2 X3 \
0 1HYKGGzRHDskth2ecKZ2HYvxSvQ1p87m6 111 111 111
1 3DndG5HuyP8Ep8p3V1i394AUxG4gtgsvoj 222 222 222
2 1HYKGGzRHDskth2ecKZ2HYvxSvQ1p87m6 333 333 333

1HYKGGzRHDskth2ecKZ2HYvxSvQ1p87m6 3DndG5HuyP8Ep8p3V1i394AUxG4gtgsvoj
0 1.0 NaN
1 NaN 2.0
2 1.0 NaN


The output does not replace the hashed address with the integer value. How can I get the following output:

Expected output:

d = {'ID' : pd.Series([1, 2, 1]),
'X1' : pd.Series([111, 222, 333]),
'X2' : pd.Series([111, 222, 333]),
'X3' : pd.Series([111, 222, 333])
}

df3 = (pd.DataFrame(d))
print(df3)

ID X1 X2 X3
0 1 111 111 111
1 2 222 222 222
2 1 333 333 333


As the hash is the same in row
0
and
2
the same integer id should replace the hash.

Is there a more efficient way of generating these unique ids? At the moment this code take a long time to run.

Answer

You can use sklearn.preprocessing.LabelEncoder:

from sklearn import preprocessing

le = preprocessing.LabelEncoder()
le.fit(df1['Hash'])
df1['Hash'] = le.transform(df1['Hash'])

Resulting Output:

   Hash   X1   X2   X3
0     0  111  111  111
1     1  222  222  222
2     0  333  333  333

Also, note that this gives you an easy way to revert back to the original hash by using inverse_transform:

df1['Hash'] = le.inverse_transform(df1['Hash'])
Comments