bumble bumble - 2 months ago 10
Python Question

Python CSV: How to ignore writing similar rows given one row meets a condition?

I'm currently keeping track of the large scale digitization of video tapes and need help pulling data from multiple CSVs. Most tapes have multiple copies, but we only digitize one tape from the set. I would like to create a new CSV containing only tapes of shows that have yet to be digitized. Here's a mockup of my original CSV:

Date Digitized | Series | Episode Number | Title | Format
---------------|----------|----------------|-------|--------
01-01-2016 | Series A | 101 | | VHS
| Series A | 101 | | Beta
| Series A | 101 | | U-Matic
| Series B | 101 | | VHS


From here, I'd like to ignore all fields containing "Series A" AND "101", as this show has a value in the "Date Digitized" cell. I attempted isolating these conditions but can't seem to get a complete list of undigitized content. Here's my code:

import csv, glob

names = glob.glob("*.csv")
names = [os.path.splitext(each)[0] for each in names]

for name in names:
with open("%s_.csv" % name, "rb") as source:
reader = csv.reader( source )
with open("%s_edit.csv" % name,"wb") as result:
writer = csv.writer( result )
for row in reader:
if row[0]:
series = row[1]
epnum = row[2]
if row[1] != series and row[2] != epnum:
writer.writerow(row)


I'll add that this is my first question and I'm very new to Python, so any advice would be much appreciated!

Answer

The simplest approach is to make two reads of the set of CSV files: one to build a list of all digitized tapes, the second to build a unique list of all tapes not on the digitized list:

# build list of digitized tapes
digitized = []
for name in names:
    with open("%s_.csv" % name, "rb") as source:
        reader = csv.reader(source)
        next(reader) # skip header
        for row in reader:
            if row[0] and ((row[1], row[2]) not in digitized):
                digitized.append((row[1], row[2]))

# build list of non-digitized tapes
digitize_me = []
for name in names:
    with open("%s_.csv" % name, "rb") as source:
        reader = csv.reader(source)
        header = next(reader)[1:3] # skip / save header
        for row in reader:
            if not row[0] and ((row[1], row[2]) not in digitized + digitize_me):
                digitize_me.append((row[1], row[2]))

# write non-digitized tapes to 'digitize.csv`
with open("digitize.csv","wb") as result:
    writer = csv.writer(result)
    writer.writerow(header)
    for tape in digitize_me:
        writer.writerow(tape)

input file 1:

Date Digitized,Series,Episode Number,Title,Format
01-01-2016,Series A,101,,VHS
,Series A,101,,Beta
,Series C,101,,Beta
,Series D,102,,VHS
,Series B,101,,U-Matic

input file 2:

Date Digitized,Series,Episode Number,Title,Format
,Series B,101,,VHS
,Series D,101,,Beta
01-01-2016,Series C,101,,VHS

Output:

Series,Episode Number
Series D,102
Series B,101
Series D,101

As per OP comment, the line

header = next(reader)[1:3] # skip / save header

serves two purposes:

  1. Assuming each csv file starts with a header, we do not want to read that header row as if it contained data about our tapes, so we need to "skip" the header row in that sense
  2. But we also want to save the relevant parts of the header for when we write the output csv file. We want that file to have a header as well. Since we are only writing the series and episode number, which are row fields 1 and 2, we assign just that slice, i.e. [1:3], of the header row to the header variable

It's not really standard to have a line of code serve two pretty unrelated purposes like that, which is why I commented it. It also assigns to header multiple times (assuming multiple input files) when header only needs to be assigned once. Perhaps a cleaner way to write that section would be:

# build list of non-digitized tapes
digitize_me = []
header = None
for name in names:
    with open("%s_.csv" % name, "rb") as source:
        reader = csv.reader(source)
        if header:
            next(reader) # skip header
        else:
            header = next(reader)[1:3] # read header
        for row in reader:
            ...

It's a question of which form is more readable. Either way is close but I thought combining 5 lines into one keeps the focus on the more salient parts of the code. I would probably do it the other way next time.

Comments