yatri yatri - 1 year ago 44
Python Question

python - extract lines in CSV file which don't have elements in a list

I have a list with substrings which I need to compare with a column in CSV file if any of the substring present in the list is present in that column of a CSV file. I would like to write those lines which don't have those substrings in that string column.
There are many columns in this file and I am looking only in one column.

Example my_string column has values { "This is just comparison of likely tokens","what a tough thing?"}

de = ["just","not","really ", "hat"]

I would like to write only the row which has "What a tough thing?"

This works fine if there is only the word in the list in the column. For instance if the my_string column has "really" it will not write to new file. But, it can't pass if item in list comes with other strings.

with open(infile, 'rb') as inFile, open(outfile, 'wb') as outfile:
reader = csv.reader(inFile, delimiter=',')
writer = csv.writer(outfile, delimiter=',')

for row[1] in reader:

if any(d in row[1] for d in de):

Answer Source

It sounds like you want to search for words instead of just substrings so that, for instance, "hat" won't match "What". Word searches can get complicated when want to match plurals, different cases, hyphenated strings and so on. But if you don't mind ignoring those complications, you can use a regex to break the column into a list of words, lower case them and then use set operations for the check.

import re
import csv

# TEST: write a sample csv file. using col0 to indicate what should be
# in the outfile
open('infile.csv', 'w').write(
"""exclude,This is just a comparison of likely tokens,col02,col03
include,what a tough thing?,col12,col13""")

# the words to find
de = ["just","not","really", "hat"]

# the files
infile = 'infile.csv'
outfile = 'outfile.csv'

# a "normalized set" of words to search
de = set(word.lower() for word in de)

def normalize_text(text):
    """Return a set of all the words in lowercased text"""
    return set(re.findall('\w+', text.lower()))

with open(infile, 'r') as inFile, open(outfile, 'w') as outFile:
    reader = csv.reader(inFile, delimiter=',')
    writer = csv.writer(outFile, delimiter=',')
    for row in reader:
        mycol = normalize_text(row[1])
        if not mycol & de:

print("---- output file ----")