alvas alvas - 3 months ago 6
Python Question

How can I cluster a list of a list of tuple (tag, probability)? - python

I have a bunch of text and they are classified into categories and then each document is tagged 0, 1 or 2 with a probability for each tag.

[ "this is a foo bar",
"bar bar black sheep",
"sheep is an animal"
"foo foo bar bar"
"bar bar sheep sheep" ]


The previous tool in the pipeline returns a list of list of tuples as such, each element in the outer list is sort of a document. I can only work with the fact that I know each documents are tagged 0, 1 or 2 and their probabilities as such:

[ [(0,0.3), (1,0.5), (2,0.1)],
[(0,0.5), (1,0.3), (2,0.3)],
[(0,0.4), (1,0.4), (2,0.5)],
[(0,0.3), (1,0.7), (2,0.2)],
[(0,0.2), (1,0.6), (2,0.1)] ]


I need it to see which tag each of the list of tuple is most probable and achieve:

[ [[(0,0.5), (1,0.3), (2,0.3)], [(0,0.4), (1,0.4), (2,0.5)]] ,
[[(0,0.3), (1,0.7), (2,0.2)], [(0,0.2), (1,0.6), (2,0.1)]] ,
[[(0,0.4), (1,0.4), (2,0.5)]] ]


As another example:

[in]
:

[ [(0,0.7), (1,0.2), (2,0.4)],
[(0,0.5), (1,0.9), (2,0.3)],
[(0,0.3), (1,0.8), (2,0.4)],
[(0,0.8), (1,0.2), (2,0.2)],
[(0,0.1), (1,0.7), (2,0.5)] ]


[out]
:

[[[(0,0.7), (1,0.2), (2,0.4)],
[(0,0.8), (1,0.2), (2,0.2)]] ,

[[(0,0.5), (1,0.9), (2,0.3)],
[(0,0.1), (1,0.7), (2,0.5)],
[(0,0.3), (1,0.8), (2,0.4)]] ,

[]]


NOTE: I do NOT have access to the original text when the data comes to my part of the pipeline.

How can I cluster a list of a list of tuple with tags and probability? Is there something in
numpy
,
scipy
,
sklearn
or any python-able ML suite to do that? or even
NLTK
.

Let's take it that the number of cluster is fixed but cluster size is not.

I've only tried finding maximum value of the centroid but that only gives me the first value in each cluster:

instream = [ [(0,0.3), (1,0.5), (2,0.1)],
[(0,0.5), (1,0.3), (2,0.3)],
[(0,0.4), (1,0.4), (2,0.5)],
[(0,0.3), (1,0.7), (2,0.2)],
[(0,0.2), (1,0.6), (2,0.1)] ]

# Find centroid.
c1_centroid_value = sorted([i[0] for i in instream], reverse=True)[0]
c2_centroid_value = sorted([i[1] for i in instream], reverse=True)[0]
c3_centroid_value = sorted([i[2] for i in instream], reverse=True)[0]

c1_centroid = [i for i,j in enumerate(instream) if j[0] == c1_centroid_value][0]
c2_centroid = [i for i,j in enumerate(instream) if j[1] == c2_centroid_value][0]
c3_centroid = [i for i,j in enumerate(instream) if j[2] == c3_centroid_value][0]

print instream[c1_centroid]
print instream[c2_centroid]
print instream[c2_centroid]


[out]
(top element in each cluster:

[(0, 0.5), (1, 0.3), (2, 0.3)]
[(0, 0.3), (1, 0.7), (2, 0.2)]
[(0, 0.3), (1, 0.7), (2, 0.2)]

Answer

If I understood correctly, this is what you wanted.

import numpy as np

N_TYPES = 3

instream = [ [(0,0.3), (1,0.5), (2,0.1)],
             [(0,0.5), (1,0.3), (2,0.3)],
             [(0,0.4), (1,0.4), (2,0.5)],
             [(0,0.3), (1,0.7), (2,0.2)],
             [(0,0.2), (1,0.6), (2,0.1)] ]
instream = np.array(instream)

# this removes document tags because we only consider probabilities here
values = [map(lambda x: x[1], doc) for doc in instream]

# determine the cluster of each document by using maximum probability
belongs_to = map(lambda x: np.argmax(x), values)
belongs_to = np.array(belongs_to)

# construct clusters of indices to your instream
cluster_indices = [(belongs_to == k).nonzero()[0] for k in range(N_TYPES)]

# apply the indices to obtain full output
out = [instream[cluster_indices[k]].tolist() for k in range(N_TYPES)]   

output out:

[[[[0.0, 0.5], [1.0, 0.3], [2.0, 0.3]]],

 [[[0.0, 0.3], [1.0, 0.5], [2.0, 0.1]],
  [[0.0, 0.3], [1.0, 0.7], [2.0, 0.2]],
  [[0.0, 0.2], [1.0, 0.6], [2.0, 0.1]]],

 [[[0.0, 0.4], [1.0, 0.4], [2.0, 0.5]]]]

I used numpy arrays because they enable nice searching and indexing. For example, the expression (belongs_to == 1).nonzero()[0] returns the array of indices to array belongs_to where the value is 1. Example of indexing is instream[cluster_indices[2]].