alvas - 6 months ago 20

Python Question

I have a bunch of text and they are classified into categories and then each document is tagged 0, 1 or 2 with a probability for each tag.

`[ "this is a foo bar",`

"bar bar black sheep",

"sheep is an animal"

"foo foo bar bar"

"bar bar sheep sheep" ]

The previous tool in the pipeline returns a list of list of tuples as such, each element in the outer list is sort of a document. I can only work with the fact that I know each documents are tagged 0, 1 or 2 and their probabilities as such:

`[ [(0,0.3), (1,0.5), (2,0.1)],`

[(0,0.5), (1,0.3), (2,0.3)],

[(0,0.4), (1,0.4), (2,0.5)],

[(0,0.3), (1,0.7), (2,0.2)],

[(0,0.2), (1,0.6), (2,0.1)] ]

I need it to see which tag each of the list of tuple is most probable and achieve:

`[ [[(0,0.5), (1,0.3), (2,0.3)], [(0,0.4), (1,0.4), (2,0.5)]] ,`

[[(0,0.3), (1,0.7), (2,0.2)], [(0,0.2), (1,0.6), (2,0.1)]] ,

[[(0,0.4), (1,0.4), (2,0.5)]] ]

As another example:

`[in]`

`[ [(0,0.7), (1,0.2), (2,0.4)],`

[(0,0.5), (1,0.9), (2,0.3)],

[(0,0.3), (1,0.8), (2,0.4)],

[(0,0.8), (1,0.2), (2,0.2)],

[(0,0.1), (1,0.7), (2,0.5)] ]

`[out]`

`[[[(0,0.7), (1,0.2), (2,0.4)],`

[(0,0.8), (1,0.2), (2,0.2)]] ,

[[(0,0.5), (1,0.9), (2,0.3)],

[(0,0.1), (1,0.7), (2,0.5)],

[(0,0.3), (1,0.8), (2,0.4)]] ,

[]]

How can I cluster a list of a list of tuple with tags and probability? Is there something in

`numpy`

`scipy`

`sklearn`

`NLTK`

Let's take it that the number of cluster is fixed but cluster size is not.

I've only tried finding maximum value of the centroid but that only gives me the first value in each cluster:

`instream = [ [(0,0.3), (1,0.5), (2,0.1)],`

[(0,0.5), (1,0.3), (2,0.3)],

[(0,0.4), (1,0.4), (2,0.5)],

[(0,0.3), (1,0.7), (2,0.2)],

[(0,0.2), (1,0.6), (2,0.1)] ]

# Find centroid.

c1_centroid_value = sorted([i[0] for i in instream], reverse=True)[0]

c2_centroid_value = sorted([i[1] for i in instream], reverse=True)[0]

c3_centroid_value = sorted([i[2] for i in instream], reverse=True)[0]

c1_centroid = [i for i,j in enumerate(instream) if j[0] == c1_centroid_value][0]

c2_centroid = [i for i,j in enumerate(instream) if j[1] == c2_centroid_value][0]

c3_centroid = [i for i,j in enumerate(instream) if j[2] == c3_centroid_value][0]

print instream[c1_centroid]

print instream[c2_centroid]

print instream[c2_centroid]

`[out]`

`[(0, 0.5), (1, 0.3), (2, 0.3)]`

[(0, 0.3), (1, 0.7), (2, 0.2)]

[(0, 0.3), (1, 0.7), (2, 0.2)]

Answer

If I understood correctly, this is what you wanted.

```
import numpy as np
N_TYPES = 3
instream = [ [(0,0.3), (1,0.5), (2,0.1)],
[(0,0.5), (1,0.3), (2,0.3)],
[(0,0.4), (1,0.4), (2,0.5)],
[(0,0.3), (1,0.7), (2,0.2)],
[(0,0.2), (1,0.6), (2,0.1)] ]
instream = np.array(instream)
# this removes document tags because we only consider probabilities here
values = [map(lambda x: x[1], doc) for doc in instream]
# determine the cluster of each document by using maximum probability
belongs_to = map(lambda x: np.argmax(x), values)
belongs_to = np.array(belongs_to)
# construct clusters of indices to your instream
cluster_indices = [(belongs_to == k).nonzero()[0] for k in range(N_TYPES)]
# apply the indices to obtain full output
out = [instream[cluster_indices[k]].tolist() for k in range(N_TYPES)]
```

output `out`

:

```
[[[[0.0, 0.5], [1.0, 0.3], [2.0, 0.3]]],
[[[0.0, 0.3], [1.0, 0.5], [2.0, 0.1]],
[[0.0, 0.3], [1.0, 0.7], [2.0, 0.2]],
[[0.0, 0.2], [1.0, 0.6], [2.0, 0.1]]],
[[[0.0, 0.4], [1.0, 0.4], [2.0, 0.5]]]]
```

I used `numpy`

arrays because they enable nice searching and indexing. For example, the expression `(belongs_to == 1).nonzero()[0]`

returns the array of indices to array `belongs_to`

where the value is `1`

. Example of indexing is `instream[cluster_indices[2]]`

.