user5779223 user5779223 - 1 month ago 12
Python Question

Unknown reason of dead multiprocessing in python-3.5

I have a data frame

df
contains information of two-hundred-thousand items. And now I want to measure their pair-wise similarity and pick up the top
n
pairs. That means, I should calculate nearly tens of billions calculation. Given the huge computation cost, I choose to run it via multiprocessing. Here is the code I have written so far:

user_list = (list(range(1, 200001))
def gen_pair():
for u1 in reversed(user_list):
for u2 in reversed(list(range(1, u1))):
yield (u1, u2)

def cal_sim(u_pair):
u1, u2 = u_pair
# do sth quite complex...
sim = sim_f(df[u1], df[u2])
if sim < 0.5:
return None
else:
return (u1, u2, sim)

with multiprocessing.Pool(processes=3) as pool:
vals = pool.map(cal_sim, gen_pair())
for v in vals:
if v is not None:
with open('result.txt', 'a') as f:
f.write('{0}\t{1}\t{2}\n'.format(v[0], v[1], v[2]))


When I just take the first 1000 users, it works quite well. But when I take all of them, it is kind of dead and no single word in
result.txt
. But if I add the number of processes, it also be dead. I wonder what is the reason and how can I fix it? Thanks in advance.

EDIT:

Here is my code of
sim_f
:

def sim_f(t1, t2):
def intersec_f(l1, l2):
return set(l1)&set(l2)

def union_f(l1, l2):
return set(l1)|set(l2)

a_arr1, a_arr2 = t1[0], t1[1]
b_arr1, b_arr2 = t2[0], t2[1]
sim = float(len(union_f(intersec_f(a_arr1, a_arr2), intersec_f(b_arr1, b_arr2))))\
/ float(len(union_f(union_f(a_arr1, a_arr2), union_f(b_arr1, b_arr2))))
return sim

Answer

There's little to go on, but try:

vals = pool.imap(cal_sim, gen_pair())
            ^

instead: note that I changed "map" to "imap". As documented, map() blocks until the entire computation is complete, so you never get to your for loop until all the work is finished. imap returns "at once".

And if you don't care in which order results are delivered, use imap_unordered() instead.

OPENING THE FILE JUST ONCE

With respect to an issue raised in comments:

with open('result.txt', 'w') as f:
    for v in vals:
        if v is not None:
            f.write('{0}\t{1}\t{2}\n'.format(v[0], v[1], v[2]))

is the obvious way to open the file just once. But I'd be surprised if it helped you - all evidence to date suggests it's just that cal_sim() is plain expensive.

SPEEDING SIM_F

There's lots of redundant work being done in:

def sim_f(t1, t2):
    def intersec_f(l1, l2):
        return set(l1)&set(l2)

    def union_f(l1, l2):
        return set(l1)|set(l2)

    a_arr1, a_arr2 = t1[0], t1[1]
    b_arr1, b_arr2 = t2[0], t2[1]
    sim =  float(len(union_f(intersec_f(a_arr1, a_arr2), intersec_f(b_arr1, b_arr2))))\
    / float(len(union_f(union_f(a_arr1, a_arr2), union_f(b_arr1, b_arr2))))
    return sim

Wholly untested, here's an obvious ;-) rewrite:

def sim_f(t1, t2):
    a1, a2 = set(t1[0]), set(t1[1])
    b1, b2 = set(t2[0]), set(t2[1])
    sim =  float(len((a1 & a2) | (b1 & b2))) \
            / len((a1 | a2) | (b1 | b2))
    return sim

This is faster because:

  • Converting to sets is done only once for each input.
  • No useless (but time-consuming) conversions of sets to sets.
  • No internal function-call overhead for unions and intersections.
  • One needless call to float() is eliminated (or, in Python 3, the remaining float() call could also be eliminated).
Comments