shamirs888 shamirs888 - 1 month ago 18
Python Question

How to divide csv file with condition?

I have this csv file:

89,Network activity,ip-dst,,,1,20160929
89,Payload delivery,md5,4ad2924ced722ab65ff978f83a40448e,,1,20160929
89,Network activity,domain,,,1,20160929
90,Payload delivery,md5,197c018922237828683783654d3c632a,,1,20160929
90,Network activity,domain,,,1,20160929
90,Network activity,ip-dst,,,1,20160929
90,Payload delivery,filename,Airline.xls,,1,20160929
91,Payload delivery,md5,23a9bbf8d64ae893db17777bedccdc05,,1,20160929
91,Payload delivery,md5,07e47f06c5ed05a062e674f8d11b01d8,,1,20160929
91,Payload delivery,md5,bd75af219f417413a4e0fae8cd89febd,,1,20160929
91,Payload delivery,md5,9f4023f2aefc8c4c261bfdd4bd911952,,1,20160929
91,Network activity,domain,,,1,20160929
91,Payload delivery,md5,1e4653631feebf507faeb9406664792f,,1,20160929
92,Payload delivery,md5,6fa869f17b703a1282b8f386d0d87bd4,,1,20160929
92,Payload delivery,md5,24befa319fd96dea587f82eb945f5d2a,,1,20160929

I need to divide this csv file to 4 csv files where as the condition is the event number at the beginning of every row. so far I created a set that includes al the event numbers {89,90,91,92}, and I know that I need to make loop in a loop and copy each row to its dedicated csv file.


It would be best to not hardcode the event numbers in your code so it's not dependent on the values of the data. I also prefer to use the csv module which has been optimized to read and write .csv files.

Here's a way to do that:

import csv

prefix = 'events'  # of output csv file names
data = {}

with open('conditions.csv', 'rb') as conditions:
    reader = csv.reader(conditions)
    for row in reader:
        data.setdefault(row[0], []).append(row)

for event in sorted(data):
    csv_filename = '{}_{}.csv'.format(prefix, event)
    with open(csv_filename, 'wb') as csvfile:
        writer = csv.writer(csvfile)


The approach implemented above first reads the entire csv file into memory, and then writes the all the rows associated with each event value into a separate output file, one at a time.

A more memory-efficient approach would be to open multiple output files simultaneously and write each row immediately after it has been read out to the proper destination file. Doing this requires keeping track of what files are already open. Something else the file managing code needs to do is make sure all the files are closed when processing is complete.

In the code below all of this has been accomplished by defining and using a Python Context Manager type to centralize the handling of all the csv output files that might be generated depending on how many different event values there are in the input file.

import csv
import sys
PY3 = sys.version_info.major > 2

class MultiCSVOutputFileManager(object):
    """Context manager to open and close multiple csv files and csv writers.
    def __enter__(self):
        self.files = {}
        return self

    def __exit__(self, exc_type, exc_value, traceback):
        for file, csv_writer in self.files.values():
            print('closing file: {}'.format(
        return None

    def get_csv_writer(self, filename):
        if filename not in self.files:  # new file?
            open_kwargs = dict(mode='w', newline='') if PY3 else dict(mode='wb')
            print('opening file: {}'.format(filename))
            file = open(filename, **open_kwargs)
            self.files[filename] = file, csv.writer(file)

        return self.files[filename][1]  # return associated csv.writer object

And here's how to use it:

prefix = 'events'  # to name of each csv output file

with open('conditions.csv', 'rb') as conditions:
    reader = csv.reader(conditions)
    with MultiCSVOutputFileManager() as file_manager:
        for row in reader:
            csv_filename = '{}_{}.csv'.format(prefix, row[0])  # row[0] is event
            writer = file_manager.get_csv_writer(csv_filename)