Bhimasen Bhimasen - 9 months ago 65
Python Question

python : facing memory issue in document clustering using sklearn

I am using TfIdfVectorizer of sklearn for document clustering. I have 20 million texts, for which i want to compute clusters. But calculating TfIdf matrix is taking too much time and system is getting stuck.

Is there any technique to deal with this problem ? is there any alternative method for this in any python module ?

Answer Source

Well, a corpus of 20 million texts is very large, and without a meticulous and comprehensive preprocessing nor some good computing instances (i.e. a lot of memory and good CPUs), the TF-IDF calculation may take a lot of time.

What you can do :

  • Limit your text corpus to some hundred of thousands of samples (let's say 200.000 texts). Having too much texts might not introduce more variance than much smaller (but reasonable) datasets.

  • Try to preprocess your texts as much as you can. A basic approach would be : tokenize your texts, use stop words, word stemming, use carefully n_grams. Once you've done all these steps, see how much you've reduced the size of your vocabulary. It should be much more smaller than the original one.

If not too big (talking about your dataset), these steps might help you to compute the TF-IDF much faster .