SpyTh SpyTh - 1 month ago 5
Python Question

Concatenate every n-th line from multiple large files in python

Consider the following files of different size:

file1.txt

sad
mad
rad
cad
saf


file2.txt

er
ar
ir
lr
gr
cf


file3.txt

1
2
3
4
5
6
7
8
9


I am looking for a way to concatenate every second line from all the files so the desired output file would be:

sad
er
1
rad
ir
3
saf
gr
5
7
9


I successfully manage to do it using the following script for my test files:

import os

globalList = list()

for file in os.listdir('.'):
if file.endswith('txt'):
with open(file, 'r') as inf:
l = list()
n=0
for i, line in enumerate(inf):
if i == n:
nline=line.strip()
l.append(nline)
n+=2

globalList.append(l)

inf.close()

ouf = open('final.txt', 'w')

for i in range(len(max(globalList, key=len))):
for x in globalList:
if i < len(x):
ouf.write(x[i])
ouf.write('\n')
else:
pass

ouf.close()


The above script works fine with small test files. However, when I try it with my actual files (hundreds of files with millions of lines) my computer quickly runs out of memory and the script crashes. Is there a way to overcome this problem, i.e. to avoid storing so much information in RAM and somehow directly write the lines in an output file? Thanks!

Answer

Try this code in python3:

script.py

from itertools import  zip_longest
import glob


every_xth_line = 2
files = [open(filename) for filename in glob.glob("*.txt")]

with open('output.txt', 'w') as f:
    trigger = 0
    for lines in zip_longest(*files, fillvalue=''):
        if not trigger:
            for line in lines:
                f.write(line)
        trigger = (trigger + 1) % every_xth_line

output.txt

sad
er
1
rad
ir
3
saf
gr
5
7
9

open itself actually can be iterated over. zip_longest makes sure that the script will run until the longest file has been exhausted, and the fillvalues are simply inserted as empty strings.

A trigger must be used to separate even and uneven files, a more general solution can be achieved with a simple modulo operation by setting every_xth_line to something else.

As for scaleability:

I tried to generate large-ish files:

cat /usr/share/dict/words > file1.txt
cat /usr/share/dict/words > file2.txt
cat /usr/share/dict/words > file3.txt

After some copy paste:

68M Nov  1 13:45 file.txt
68M Nov  1 13:45 file2.txt
68M Nov  1 13:45 file3.txt

Running it:

time python3 script.py
4.31user 0.14system 0:04.46elapsed 99%CPU (0avgtext+0avgdata 9828maxresident)k
0inputs+206312outputs (0major+1146minor)pagefaults 0swaps

The result:

101M Nov  1 13:46 output.txt