user5779223 - 1 year ago 69

Python Question

I have a data frame

`df`

`n`

`user_list = (list(range(1, 200001))`

def gen_pair():

for u1 in reversed(user_list):

for u2 in reversed(list(range(1, u1))):

yield (u1, u2)

def cal_sim(u_pair):

u1, u2 = u_pair

# do sth quite complex...

sim = sim_f(df[u1], df[u2])

if sim < 0.5:

return None

else:

return (u1, u2, sim)

with multiprocessing.Pool(processes=3) as pool:

vals = pool.map(cal_sim, gen_pair())

for v in vals:

if v is not None:

with open('result.txt', 'a') as f:

f.write('{0}\t{1}\t{2}\n'.format(v[0], v[1], v[2]))

When I just take the first 1000 users, it works quite well. But when I take all of them, it is kind of dead and no single word in

`result.txt`

EDIT:

Here is my code of

`sim_f`

`def sim_f(t1, t2):`

def intersec_f(l1, l2):

return set(l1)&set(l2)

def union_f(l1, l2):

return set(l1)|set(l2)

a_arr1, a_arr2 = t1[0], t1[1]

b_arr1, b_arr2 = t2[0], t2[1]

sim = float(len(union_f(intersec_f(a_arr1, a_arr2), intersec_f(b_arr1, b_arr2))))\

/ float(len(union_f(union_f(a_arr1, a_arr2), union_f(b_arr1, b_arr2))))

return sim

Answer Source

There's little to go on, but try:

```
vals = pool.imap(cal_sim, gen_pair())
^
```

instead: note that I changed "map" to "imap". As documented, `map()`

blocks until the *entire* computation is complete, so you never get to your `for`

loop until all the work is finished. `imap`

returns "at once".

And if you don't care in which order results are delivered, use `imap_unordered()`

instead.

With respect to an issue raised in comments:

```
with open('result.txt', 'w') as f:
for v in vals:
if v is not None:
f.write('{0}\t{1}\t{2}\n'.format(v[0], v[1], v[2]))
```

is the obvious way to open the file just once. But I'd be surprised if it helped you - all evidence to date suggests it's just that `cal_sim()`

is plain expensive.

There's lots of redundant work being done in:

```
def sim_f(t1, t2):
def intersec_f(l1, l2):
return set(l1)&set(l2)
def union_f(l1, l2):
return set(l1)|set(l2)
a_arr1, a_arr2 = t1[0], t1[1]
b_arr1, b_arr2 = t2[0], t2[1]
sim = float(len(union_f(intersec_f(a_arr1, a_arr2), intersec_f(b_arr1, b_arr2))))\
/ float(len(union_f(union_f(a_arr1, a_arr2), union_f(b_arr1, b_arr2))))
return sim
```

Wholly untested, here's an obvious ;-) rewrite:

```
def sim_f(t1, t2):
a1, a2 = set(t1[0]), set(t1[1])
b1, b2 = set(t2[0]), set(t2[1])
sim = float(len((a1 & a2) | (b1 & b2))) \
/ len((a1 | a2) | (b1 | b2))
return sim
```

This is faster because:

- Converting to sets is done only once for each input.
- No useless (but time-consuming) conversions of sets
*to*sets. - No internal function-call overhead for unions and intersections.
- One needless call to
`float()`

is eliminated (or, in Python 3, the remaining`float()`

call could also be eliminated).