aidsj aidsj - 3 months ago 8
Python Question

Determine column value based on 2 other columns

There are 2 column, Label1 and Label2. Both of them are cluster labels using different methods.

Label1 Label2
0 0 1024
1 1 1024
2 2 1025
3 3 1026
4 3 1027
5 4 1028


I wanna get the final cluster label based these 2 columns. Compared each row, as long as one of these two labels are same, they are in the same cluster.

For example: row 0 and row 1 have label 2 in common, row 3 and row 4 have label1 in common, thus row0 and row1 in the same group and row3 and row4 in the same group. So the results I'd like to have:

Label1 Label2 Cluster ID
0 0 1024 0
1 1 1024 0
2 2 1025 1
3 3 1026 2
4 3 1027 2
5 4 1028 3


What's the best way to do this´╝č
Any help would be appreciated.

Edited: I think I didn't give a good example. Acutally, labels are not necessarily in any order:

Label1 Label2
0 0 1024
1 1 1023
2 2 1025
3 3 1024
4 3 1027
5 4 1022

BPL BPL
Answer

Not sure I've understood correctly your question but here's a possible way to identify clusters:

import pandas as pd
import collections

df = pd.DataFrame(
    {'Label1': [0, 1, 2, 3, 3, 4], 'Label2': [1024, 1024, 1025, 1026, 1027, 1028]})
df['Cluster ID'] = [0] * 6

counter1 = {k: v for k, v in collections.Counter(
    df['Label1']).iteritems() if v > 1}
counter1 = counter1.keys()
counter2 = {k: v for k, v in collections.Counter(
    df['Label2']).iteritems() if v > 1}
counter2 = counter2.keys()

len1 = len(counter1)
len2 = len(counter2)
index_cluster = len1 + len2

for index, row in df.iterrows():
    if row['Label2'] in counter2:
        df.loc[index, 'Cluster ID'] = counter2.index(row['Label2'])
    elif row['Label1'] in counter1:
        df.loc[index, 'Cluster ID'] = counter1.index(row['Label1']) + len2
    else:
        df.loc[index, 'Cluster ID'] = index_cluster
        index_cluster += 1

print df
Comments