pete lin pete lin - 3 months ago 15
Python Question

Why write to a file is faster than mutiprocessing.Pipe?

i am test the fastest way between two process. i got two process, one write data, one receive data. my script show write and read from a file is fater than pipe. How can this happen? memory is faster than disk??

write and read from file:

#!/usr/bin/env python
# -*- coding:utf-8 -*-
from mutiprocesscomunicate import gen_data

data_size = 128 * 1024 # KB


def send_data_task(file_name):
with open(file_name, 'wb+') as fd:
for i in range(data_size):
fd.write(gen_data(1))
fd.write('\n'.encode('ascii'))
# end EOF
fd.write('EOF'.encode('ascii'))
print('send done.')


def get_data_task(file_name):
offset = 0
fd = open(file_name, 'r+')
i = 0
while True:
data = fd.read(1024)
offset += len(data)
if 'EOF' in data:
fd.truncate()
break
if not data:
fd.close()
fd = None
fd = open(file_name, 'r+')
fd.seek(offset)
continue
print("recv done.")


if __name__ == '__main__':
import multiprocessing

pipe_out = pipe_in = 'throught_file'
p = multiprocessing.Process(target=send_data_task, args=(pipe_out,), kwargs=())
p1 = multiprocessing.Process(target=get_data_task, args=(pipe_in,), kwargs=())

p.daemon = True
p1.daemon = True
import time

start_time = time.time()
p1.start()
import time

time.sleep(0.5)
p.start()
p.join()
p1.join()
import os
os.sync()
print('through file', data_size / (time.time() - start_time), 'KB/s')
open(pipe_in, 'w+').truncate()


use pipe

#!/usr/bin/env python
# -*- coding:utf-8 -*-

import multiprocessing
from mutiprocesscomunicate import gen_data

data_size = 128 * 1024 # KB


def send_data_task(pipe_out):
for i in range(data_size):
pipe_out.send(gen_data(1))
# end EOF
pipe_out.send("")
print('send done.')


def get_data_task(pipe_in):
while True:
data = pipe_in.recv()
if not data:
break
print("recv done.")


if __name__ == '__main__':
pipe_out, pipe_in = multiprocessing.Pipe()
p = multiprocessing.Process(target=send_data_task, args=(pipe_out,), kwargs=())
p1 = multiprocessing.Process(target=get_data_task, args=(pipe_in,), kwargs=())

p.daemon = True
p1.daemon = True
import time

start_time = time.time()
p1.start()
p.start()
p.join()
p1.join()
print('through pipe', data_size / (time.time() - start_time), 'KB/s')


create data function:

def gen_data(size):
onekb = "a" * 1024
return (onekb * size).encode('ascii')


result:

through file 110403.02025891568 KB/s

through pipe 75354.71358973449 KB/s

i use Mac os with python3.

update

if data is just 1kb, pipe is 100 faster than file. but if date if big, like 128MB result is above.

Answer Source

A pipe has a limited capacity, in order to match speeds of producer and consumer (via back pressure flow control) rather than consume an unlimited amount of memory. The particular limit on OS X, according to this Unix stack exchange answer, is 16KiB. As you're writing 128KiB, this means 8 times as many system calls (and context switches), at least. When working with files, the size is limited by your disk space or quota only, and without a fdatasync or similar, it won't need to make it to disk; it can be read again directly from cache. On the other hand, when your data is small, the time to find a place to put the file dominates leaving the pipe far faster.

When you do use fdatasync, or just exceed the available memory for disk caching, writing to disk also slows down to match actual disk transfer speeds.