CFraley CFraley - 6 months ago 29
Bash Question

Python script to clean .csv file based on array values

I am new to Python, so please forgive me. I have pieced this together through things i've found online, however, it's still not working exactly as it should.

I'm wanting a python script that will look in a given spreadsheet (list.csv), parse it for any "key_words", then export a file of only the rows that DO NOT contain any "key_words" called "cleaned.csv". I would like for it to only look in the first column, [0]. If possible, I would like for it to also export me a second spreadsheet of the ones that DO contain keywords, just to verify what it is scraping out.

This current code looks at the entire csv file and I see it not putting some of the rows in "cleaned.csv", when technically, it should be, unless there is a problem with my array.

Here is my current code...

key_words = [ 'Dog', 'Cat', 'Bird', 'Cow', ]

with open('list.csv') as oldfile, open('cleaned.csv', 'w') as newfile:
for line in oldfile:
if not any(key_word in line for key_word in key_words):
newfile.write(line)


First couple rows of data are...

Dog,Walks,Land,4legs,
Fish,Swims,Water,fins,
Kangaroo,Hops,Land,2legs,
Cow,Walks,Land,4legs,
Bird,Flies,Air,2legs,


Cleaned.csv should show:

Fish,Swims,Water,fins,
Kangaroo,Hops,Land,2legs,


Other.csv (bad, matching array) should show:

Dog,Walks,Land,4legs,
Cow,Walks,Land,4legs,
Bird,Flies,Air,2legs,

Answer

Well code looks fine and worked for me, so there is no problem with it per se.

If you only want to check in the first row you have to split the line by ",":

key_words = ['Dog', 'Cat', 'Bird', 'Cow', ]

with open('list.csv') as oldfile, open('cleaned.csv', 'w') as cleaned, open("matched.csv", "w") as matched:
    for line in oldfile:
        if not any(key_word in line.split(",", 1)[0] for key_word in key_words):
            cleaned.write(line)
        else:
            matched.write(line)

If first column is always a "word" and not a "sentence" (like Dog is out) then you could improve the test like this:

if not line.split(",", 1)[0] in key_words:

NOTE: with strings test be careful with case sensitivity.

Note that providing a maxsplit=1 here line.split(",", 1) will improve string parsing performance especially if you have longer lines because it will stop parsing after finding first , and return a list of 2 items. First item will be your first column. Read more here:

https://docs.python.org/2/library/stdtypes.html#str.split

TEST RESULT:

mac: cat list.csv
Dog,Walks,Land,4legs,
Fish,Swims,Water,fins,
Kangaroo,Hops,Land,2legs,
Cow,Walks,Land,4legs,
Bird,Flies,Air,2legs,

mac: cat cleaned.csv
Fish,Swims,Water,fins,
Kangaroo,Hops,Land,2legs,

mac: cat matched.csv
Dog,Walks,Land,4legs,
Cow,Walks,Land,4legs,
Bird,Flies,Air,2legs,
Comments