kskp kskp - 2 months ago 16
Python Question

Collect data in chunks from stdin: Python

I have the following Python code where I collect data from standard input into a list and run syntaxnet on it. The data is in the form of json objects from which I will extract the text field and feed it to syntaxnet.

data = []
for line in sys.stdin:
data.append(line)
run_syntaxnet(data) ##This is a function##


I am doing this because I do not want Syntaxnet to run for every single tweet since it will take a very long time and hence decrease performance.

Also, when I run this code on very large data, I do not want to keep collecting it forever and run out of memory. So I want to collect data in chunks- may be like 10000 tweets at a time and run Syntaxnet on them. Can someone help me how to do this?

Also, I want to understand what can be the maximum length of the list
data
so that I do not run out of memory.

Answer

I would gather the data into chunks and process those chunks when they get "large":

LARGE_DATA = 10

data = []
for line in sys.stdin:
    data.append(line)
    if len(data) > LARGE_DATA:
        run_syntaxnet(data)
        data = []
run_syntaxnet(data)
Comments