user99889 - 1 year ago 597

Python Question

I am using the Gensim HDP module on a set of documents.

`>>> hdp = models.HdpModel(corpusB, id2word=dictionaryB)`

>>> topics = hdp.print_topics(topics=-1, topn=20)

>>> len(topics)

150

>>> hdp = models.HdpModel(corpusA, id2word=dictionaryA)

>>> topics = hdp.print_topics(topics=-1, topn=20)

>>> len(topics)

150

>>> len(corpusA)

1113

>>> len(corpusB)

17

Why is the number of topics independent of corpus length?

Answer

@user3907335 is exactly correct here: HDP will calculate as many topics as the assigned truncation level. *However*, it may be the case that many of these topics have basically zero probability of occurring. To help with this in my own work, I wrote a handy little function that performs a rough estimate of the probability weight associated with each topic. Note that this is a rough metric only: *it does not account for the probability associated with each word*. Even so, it provides a pretty good metric for which topics are meaningful and which aren't:

```
import pandas as pd
import numpy as np
def topic_prob_extractor(hdp=None, topn=None):
topic_list = hdp.show_topics(topics=-1, topn=topn)
topics = [int(x.split(':')[0].split(' ')[1]) for x in topic_list]
split_list = [x.split(' ') for x in topic_list]
weights = []
for lst in split_list:
sub_list = []
for entry in lst:
if '*' in entry:
sub_list.append(float(entry.split('*')[0]))
weights.append(np.asarray(sub_list))
sums = [np.sum(x) for x in weights]
return pd.DataFrame({'topic_id' : topics, 'weight' : sums})
```

I assume that you already know how to calculate an HDP model. Once you have an hdp model calculated by gensim you call the function as follows:

```
topic_weights = topic_prob_extractor(hdp, 500)
```