I'm writing a simple crawler in Python using the threading and Queue modules. I fetch a page, check links and put them into a queue, when a certain thread has finished processing page, it grabs the next one from the queue. I'm using an array for the pages I've already visited to filter the links I add to the queue, but if there are more than one threads and they get the same links on different pages, they put duplicate links to the queue. So how can I find out whether some url is already in the queue to avoid putting it there again?
If you don't care about the order in which items are processed, I'd try a subclass of
Queue that uses
class SetQueue(Queue): def _init(self, maxsize): self.maxsize = maxsize self.queue = set() def _put(self, item): self.queue.add(item) def _get(self): return self.queue.pop()
As Paul McGuire pointed out, this would allow adding a duplicate item after it's been removed from the "to-be-processed" set and not yet added to the "processed" set. To solve this, you can store both sets in the
Queue instance, but since you are using the larger set for checking if the item has been processed, you can just as well go back to
queue which will order requests properly.
class SetQueue(Queue): def _init(self, maxsize): Queue._init(self, maxsize) self.all_items = set() def _put(self, item): if item not in self.all_items: Queue._put(self, item) self.all_items.add(item)
The advantage of this, as opposed to using a set separately, is that the
Queue's methods are thread-safe, so that you don't need additional locking for checking the other set.