bhaskar jupudi bhaskar jupudi - 1 year ago 79
Python Question

Threadpool in python is not as fast as expected

I'm beginner to python and machine learning. I'm trying to reproduce the code for

using multi-threading. I'm working with yelp dataset to do sentiment analysis using
. This is what I've written so far:

Code snippet:

from multiprocessing.dummy import Pool as ThreadPool
from threading import Thread, current_thread
from functools import partial
data = df['text']
rev = df['stars']

y = []
def product_helper(args):
return featureExtraction(*args)

def featureExtraction(p,t):
temp = [0] * len(bag_of_words)
for word in p.split():
if word in bag_of_words:
temp[bag_of_words.index(word)] += 1

return temp

# function to be mapped over
def calculateParallel(threads):
pool = ThreadPool(threads)
job_args = [(item_a, rev[i]) for i, item_a in enumerate(data)]
l =,job_args)
return l

temp_X = calculateParallel(12)

Here this is just part of code.


has all the reviews and
has the ratings (1 through 5). I'm trying to find the word count vector
using multi-threading.
is a list of some frequent words of choice.


Without multi-threading , I was able to compute the
in around 24 minutes and the above code took 33 mins for a dataset of size 100k reviews. My machine has 128GB of DRAM and 12 cores (6 physical cores with hyperthreading i.e., threads per core=2).

What am I doing wrong here?

vks vks
Answer Source

Your whole code seems CPU Bound rather than IO Bound.You are just using threads which are under GIL so effectively running just one thread plus overheads.It runs only on one core.To run on multiple cores use


import multiprocessing
pool = multiprocessing.Pool()
l = pool.map_async(product_helper,job_args)

from multiprocessing.dummy import Pool as ThreadPool is just a wrapper over thread module.It utilises just one core and not more than that.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download