Wedoso Wedoso - 1 year ago 111
Python Question

How to write multiprocessing python codes with dictionary and dataframe

I spent couple hours on multiprocessing coding on Python. After I read codes on document, I wrote codes below. My plan is to add values in two global dataframe together, and assign the result to a dictionary.

from multiprocessing import Process, Manager
import pandas as pd
import numpy as np
import time

def f(d):
for i in C:
d[i] = A.loc[i].sum() + B.loc[i].sum()

C = [10,20,30]
A = pd.DataFrame(np.matrix('1,2;3,4;5,6'), index = C, columns = ['A','B'])
B = pd.DataFrame(np.matrix('3,4;5,4;5,2'), index = C, columns = ['A','B'])

if __name__ == '__main__':
manager = Manager()
d = manager.dict()
d = dict([(c, 0) for c in C])
t0 = time.clock()
p = Process(target=f, args=(d,))
print time.clock()-t0, 'seconds processing time'
print d

d = dict([(c, 0) for c in C])
t0 = time.clock()
print time.clock()-t0, 'seconds processing time'
print d

The result in my linux server is shown below, which is not my expect:

0.0 seconds processing time

{10: 0, 20: 0, 30: 0}

0.0 seconds processing time

{10: 10, 20: 16, 30: 18}

It seems the multiprocessing part didn't add two dataframes' values together. Could you guys give me some hints?

Thanks in advance.

Answer Source

Example here that you could adapt and which works:

You have you use a manager object to be able to share memory between processes.

In your example you create a dictionary using the manager but you kill it with a normal dictionary the line after

manager = Manager()
d = manager.dict()   # correct
d = dict([(c, 0) for c in C])  # d is not a manager.dict: no shared memory

Instead do this (tested, compiles)

d = manager.dict([(c, 0) for c in C])