e1v1s e1v1s - 5 months ago 8
Python Question

How to split file into smaller by first number in second column

So my data looks like:

1 3456542 5 may 2014
2 1245678 4 may 2014
3 4256876 2 may 2014
4 5643156 6 may 2014
.....


The goal is to sort it by the 2nd column then separate the rows based on the first number in the 2nd column (i.e. 3456542 goes to subs_3.txt, 1245678 goes to subs_1.txt...). The output is totally wrong and gives me 6 files with what appears to be random rows in them. Any suggestions?

import csv
from operator import itemgetter

file_lines = []
with open("subs.txt", "r") as csv_file:
reader = csv.reader(csv_file, delimiter=" ")
for row in reader:
file_lines.append(row)

file_lines.sort(key=itemgetter(1))

with open("sorted_subs.txt", "w") as csv_file:
writer = csv.writer(csv_file, delimiter=" ")
for row in file_lines:
writer.writerow(row)

for row in file_lines:
file_num = row[1[1]
with open("file_{0}.txt".format(file_num), "w") as f:
writer = csv.writer(f, delimiter=" ")
writer.writerow(row)

Answer

You could use itertools.groupby to group the lines that go to same file together and then just loop over the groups in order to write the files:

from itertools import groupby

for k, g in groupby(file_lines, key=lambda x: x[1][0]):
    with open("file_{0}.txt".format(k), "w") as f:
        csv.writer(f, delimiter=" ").writerows(g)

Update: groupby will group the lines based on the first number in second column. It will return the key used for grouping and iterator containing the grouped items. Since file_lines is already sorted we know that all items belonging to same group will be returned within one group. Here's a short example how it works, note that test data is different than in original question in order to demonstrate grouping:

from itertools import groupby

lst = [
    ['2', '1245678', '', '4', 'may', '2014'], 
    ['1', '3456542', '', '5', 'may', '2014'], 
    ['3', '3256876', '', '2', 'may', '2014'], 
    ['4', '5643156', '', '6', 'may', '2014']
]

for k, g in groupby(lst, key=lambda x: x[1][0]):
    print('key: {0}, items: {1}'.format(k, list(g)))

Output:

key: 1, items: [['2', '1245678', '', '4', 'may', '2014']]
key: 3, items: [['1', '3456542', '', '5', 'may', '2014'], ['3', '3256876', '', '2', 'may', '2014']]
key: 5, items: [['4', '5643156', '', '6', 'may', '2014']]