Nick Slavsky Nick Slavsky - 1 year ago 104
Python Question

Build 2 lists in one go while reading from file, pythonically

I'm reading a big file with hundreds of thousands of number pairs representing the edges of a graph. I want to build 2 lists as I go: one with the forward edges and one with the reversed.

Currently I'm doing an explicit

loop, because I need to do some pre-processing on the lines I read. However, I'm wondering if there is a more pythonic approach to building those lists, like list comprehensions, etc.

But, as I have 2 lists, I don't see a way to populate them using comprehensions without reading the file twice.

My code right now is:

with open('SCC.txt') as data:
for line in data:
line = line.rstrip()
if line:
edge_list.append((int(line.rstrip().split()[0]), int(line.rstrip().split()[1])))
reversed_edge_list.append((int(line.rstrip().split()[1]), int(line.rstrip().split()[0])))

Answer Source

I would keep your logic as it is the Pythonic approach just not split/rstrip the same line multiple times:

with open('SCC.txt') as data:
    for line in data:
        spl = line.split()
        if spl:
            i, j = map(int, spl)
            edge_list.append((i, j))
            reversed_edge_list.append((j, i))

Calling rstrip when you have already called it is redundant in itself even more so when you are splitting as that would already remove the whitespace so splitting just once means you save doing a lot of unnecessary work.

You can also use csv.reader to read the data and filter empty rows once you have a single whitespace delimiting:

from csv import reader

with open('SCC.txt') as data:
    edge_list, reversed_edge_list = [], []
    for i, j in filter(None, reader(data, delimiter=" ")):
        i, j = int(i), int(j)
        edge_list.append((i, j))
        reversed_edge_list.append((j, i))

Or if there are multiple whitespaces delimiting you can use map(str.split, data):

    for i, j in filter(None, map(str.split, data)):
        i, j = int(i), int(j)

Whatever you choose will be faster than going over the data twice or splitting the sames lines multiple times.