Thavivelball Thavivelball - 2 months ago 6
Python Question

Speed up a list operation using threads in python

I have a huge list of proxies (70k) and I have this script:

entries = open("proxy.txt").readlines()
proxiesp = [x.strip().split(":") for x in entries]
proxies = []
for x in proxiesp:
x = tuple(x)
proxies.append(x)
set(proxies)


And the operation of set(proxies) so the duplicates removing, is really slow. Is there a way to speed up this using threads?

Answer

No, threading won't speed this up, for three reasons:

  1. The Python GIL prevents Python code from being executed in parallel; threads executing Python code can only be run concurrently. For the same amount of CPU work, the same amount of time or more is required.

  2. To be able to add to the same datastructure from multiple threads, you'd have to add locking, slowing down threading more.

  3. Your code is slow because it is wasting cycles, because you are recreating the set object each iteration and then discarding it again. This is sucking up all the time as proxies continues to grow, so in the end you created sets for each size of proxies, from length 1 all the way up to length 70k, approaching 5 million steps to throw away 70k sets.

You should produce the set once. You can do so in a set comprehension:

with open('proxy.txt') as f:
    proxies = {tuple(line.strip().split(':')) for line in f}