bhaskar jupudi bhaskar jupudi - 7 days ago 9
Python Question

Threadpool in python is not as fast as expected

I'm beginner to python and machine learning. I'm trying to reproduce the code for

countvectorizer()
using multi-threading. I'm working with yelp dataset to do sentiment analysis using
LogisticRegression
. This is what I've written so far:

Code snippet:

from multiprocessing.dummy import Pool as ThreadPool
from threading import Thread, current_thread
from functools import partial
data = df['text']
rev = df['stars']


y = []
def product_helper(args):
return featureExtraction(*args)


def featureExtraction(p,t):
temp = [0] * len(bag_of_words)
for word in p.split():
if word in bag_of_words:
temp[bag_of_words.index(word)] += 1

return temp


# function to be mapped over
def calculateParallel(threads):
pool = ThreadPool(threads)
job_args = [(item_a, rev[i]) for i, item_a in enumerate(data)]
l = pool.map(product_helper,job_args)
pool.close()
pool.join()
return l

temp_X = calculateParallel(12)


Here this is just part of code.

Explanation:

df['text']
has all the reviews and
df['stars']
has the ratings (1 through 5). I'm trying to find the word count vector
temp_X
using multi-threading.
bag_of_words
is a list of some frequent words of choice.

Question:

Without multi-threading , I was able to compute the
temp_X
in around 24 minutes and the above code took 33 mins for a dataset of size 100k reviews. My machine has 128GB of DRAM and 12 cores (6 physical cores with hyperthreading i.e., threads per core=2).

What am I doing wrong here?

vks vks
Answer

Your whole code seems CPU Bound rather than IO Bound.You are just using threads which are under GIL so effectively running just one thread plus overheads.It runs only on one core.To run on multiple cores use

Use

import multiprocessing
pool = multiprocessing.Pool()
l = pool.map_async(product_helper,job_args)

from multiprocessing.dummy import Pool as ThreadPool is just a wrapper over thread module.It utilises just one core and not more than that.

Comments