Wedoso Wedoso - 4 months ago 38
Python Question

How to write multiprocessing python codes with dictionary and dataframe

I spent couple hours on multiprocessing coding on Python. After I read codes on document, I wrote codes below. My plan is to add values in two global dataframe together, and assign the result to a dictionary.

from multiprocessing import Process, Manager
import pandas as pd
import numpy as np
import time

def f(d):
for i in C:
d[i] = A.loc[i].sum() + B.loc[i].sum()

C = [10,20,30]
A = pd.DataFrame(np.matrix('1,2;3,4;5,6'), index = C, columns = ['A','B'])
B = pd.DataFrame(np.matrix('3,4;5,4;5,2'), index = C, columns = ['A','B'])

if __name__ == '__main__':
manager = Manager()
d = manager.dict()
d = dict([(c, 0) for c in C])
t0 = time.clock()
p = Process(target=f, args=(d,))
p.start()
p.join()
print time.clock()-t0, 'seconds processing time'
print d

d = dict([(c, 0) for c in C])
t0 = time.clock()
f(d)
print time.clock()-t0, 'seconds processing time'
print d


The result in my linux server is shown below, which is not my expect:


0.0 seconds processing time

{10: 0, 20: 0, 30: 0}

0.0 seconds processing time

{10: 10, 20: 16, 30: 18}


It seems the multiprocessing part didn't add two dataframes' values together. Could you guys give me some hints?

Thanks in advance.

Answer

Example here that you could adapt and which works:

https://docs.python.org/2/library/multiprocessing.html

You have you use a manager object to be able to share memory between processes.

In your example you create a dictionary using the manager but you kill it with a normal dictionary the line after

manager = Manager()
d = manager.dict()   # correct
d = dict([(c, 0) for c in C])  # d is not a manager.dict: no shared memory

Instead do this (tested, compiles)

d = manager.dict([(c, 0) for c in C])