Eltorrooo Eltorrooo - 6 months ago 18
Linux Question

python use diskspace instead of RAM by no sufficient RAM

consider the following python code

with open(sys.argv[2], 'r') as fin, \
open(sys.argv[3], 'w') as fout:
reader = csv.DictReader(fin, delimiter='%s' % sys.argv[4])
writer = csv.DictWriter(fout, reader.fieldnames, dialect='excel')
writer.writeheader()
writer.writerows(reader)


lets assume we have a big file about 2GB as input and our system has only 512MB RAM, this may lead to an error
Memory Usage


Is there a way to let my code use diskspace instead of RAM even if that will make it slow ? or this is a OS issue and should add more Swap for example ?

update



the code above is only an example

consider this exmaple

with io.open(sys.argv[2], 'r', encoding='utf8', errors='ignore') as fin, \
io.open(sys.argv[3], 'w', encoding='utf8', errors='ignore') as fout:
rows = csv.DictReader(fin, delimiter='%s' % sys.argv[4])
fout.write(json.dumps(list(rows), indent=4))


when using
json.dumps
you always need to write the data at once, and if you want to append the file, you must read the file and append the data and write to the file, something like this

data = readjson(jsonfile)
data.append(newentry)
jsonfile.write(json.dumps(data))


update 2 using generator (lazy evolution)



I come to this idea but I'm not sure if it makes a difference

def gen(list):
for e in list:
yield e

with open(sys.argv[2], 'r') as fin, \
open(sys.argv[3], 'w') as fout:
reader = csv.DictReader(fin, delimiter='%s' % sys.argv[4])
writer = csv.DictWriter(fout, reader.fieldnames, dialect='excel')
writer.writeheader()
writer.writerows(gen(reader))

with open(sys.argv[2], 'r') as fin, \
open(sys.argv[3], 'w') as fout:
rows = csv.DictReader(fin, delimiter='%s' % sys.argv[4])
# fout.write(json.dumps(gen(rows), indent=4)) -> cause error <generator object gen at 0x025BDDA0> is not JSON serializable
fout.write(json.dumps(gen(list(rows)), indent=4))

Answer

when using json.dumps you always need to write the data at once

Not really. You really should adopt a streaming approach for large data. In this case, something like:

fout.write('[')
for ii, row in enumerate(rows):
    if ii != 0:
        fout.write(',\n')
    json.dump(row, fout, indent=4)
fout.write(']')

This way you can write one row at a time, and you also save the overhead of putting all the rows into a list which you don't need.