Karan Goel Karan Goel - 3 months ago 9
Python Question

Learning python and threading. I think my code runs infinitely. Help me find bugs?

So I've started learning python now, and I absolutely am in love with it.

I'm building a small scale facebook data scraper. Basically, it will use the Graph API and scrape the first names of the specified number of users. It works fine in a single thread (or no thread I guess).

I used online tutorials to come up with the following multithreaded version (updated code):

import requests
import json
import time
import threading
import Queue

GraphURL = 'http://graph.facebook.com/'
first_names = {} # will store first names and their counts
queue = Queue.Queue()

def getOneUser(url):
http_response = requests.get(url) # open the request URL
if http_response.status_code == 200:
data = http_response.text.encode('utf-8', 'ignore') # Get the text of response, and encode it
json_obj = json.loads(data) # load it as a json object
# name = json_obj['name']
return json_obj['first_name']
# last = json_obj['last_name']
return None

class ThreadGet(threading.Thread):
""" Threaded name scraper """
def __init__(self, queue):
threading.Thread.__init__(self)
self.queue = queue

def run(self):
while True:
#print 'thread started\n'
url = GraphURL + str(self.queue.get())
first = getOneUser(url) # get one user's first name
if first is not None:
if first_names.has_key(first): # if name has been encountered before
first_names[first] = first_names[first] + 1 # increment the count
else:
first_names[first] = 1 # add the new name
self.queue.task_done()
#print 'thread ended\n'

def main():
start = time.time()
for i in range(6):
t = ThreadGet(queue)
t.setDaemon(True)
t.start()

for i in range(100):
queue.put(i)

queue.join()

for name in first_names.keys():
print name + ': ' + str(first_names[name])

print '----------------------------------------------------------------'
print '================================================================'
# Print top first names
for key in first_names.keys():
if first_names[key] > 2:
print key + ': ' + str(first_names[key])

print 'It took ' + str(time.time()-start) + 's'

main()


To be honest, I don't understand some of the parts of the code but I get the main idea. The output is nothing. I mean the shell has nothing in it, so I believe it keeps on running.

So what I am doing is filling
queue
with integers that are the user id's on fb. Then each ID is used to build the api call URL.
getOneUser
returns the name of one user at a time. That
task
(ID) is marked as 'done' and it moves on.

What is wrong with the code above?

Answer

Your original run function only processed one item from the queue. In all you've only removed 5 items from the queue.

Usually run functions look like

run(self):
    while True:
         doUsefulWork()

i.e. they have a loop which causes the recurring work to be done.

[Edit] OP edited code to include this change.

Some other useful things to try:

  • Add a print statement into the run function: you'll find that it is only called 5 times.
  • Remove the queue.join() call, this is what is causing the module to block, then you will be able to probe the state of the queue.
  • put the entire body of run into a function. Verify that you can use that function in a single threaded manner to get the desired results, then
  • try it with just a single worker thread, then finally go for
  • multiple worker threads.