Harry Lens Harry Lens - 7 months ago 16
Python Question

Checking condition for each line in CSV

Say I have a sample csv file like this:

phonemes,graphemes
W IY K D EY,w ee k d ay
T EH K S T,t e x _ t
Y UW,ewe _
SH UW T,chu te
SH UW T,chu te
SH UW T,chu te !
SX AH K,s u ck


I want to check a specific condition for each line. When I am trying to iterate through each line, when the element of one line meets the condition I want to increment my counter by 1 and continue to check for the next line instead of checking all elements in that particular line.

I believe this is similar to lazy evaluation? But I cannot figure out a way to complete this task.

My code for evaluating:

for p, g in reader:
phonemes = p.split()
graphemes = g.split()
if (len(phonemes) == len(graphemes) and
all(p in valid_pset for p in phonemes) and
all(g in valid_gset for g in graphemes)):

valid_row += 1
p_count += len(phonemes)
g_count += len(graphemes)
else:
invalid_row += 1


So with this code it will evaluate each element in a single line and every time it meets the requirement my
valid_row
or
invalid_row
will increment by 1.

Which is not what I intend to do...
I would like to know is there a way that I can simply evaluate, increment, and go to the next line to keep doing the same thing until the end of file?

Edit: when checking if it is valid I need all of the elements in that line to meet the correct requirement. And what would be a concise way to accomplish that(By checking all the characters in a line are valid, increment the valid counter by 1 )

Edit: I suppose when I hit a invalid character I can increment the counter and break from the inside loop and get to the next line then re-enter the loop? Or is there some quicker ways?

edit:


AA
AE
AH
AO
AW
AY
B
CH
D
DH
EH
ER
EY
F
G
HH
IH
IY
JH
K
L
M
N
NG
OW
OY
P
R
S
SH
T
TH
UH
UW
V
W
Y
Z
ZH


This is a text file contain all the valid phonemes.(Which I have already added to a valid_pset)

And the valid graphemes is this:(Added to a valid_pset)

valid_graphemes =
{'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p',
'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '_'})


So when checking the sample file for example. The invalid rows should be 4
But my code fails to do so.

EDIT: It seems like I may have found a way to do this. But one last thing that is keeping me from getting the correct answer is that how do I check for every element in a line that is in the text file? More specifically:

For
ee
I want to check both "e" in this "word" is in the valid_set. Meaning as long as both single "e" is in the set then
ee
should be valid. Any help on that?

Answer

EDIT I modified the code in concert with the changed you made to the OP:

I ran this code, and it seems to work. It gave me one valid row, with explanations:

import csv

valid_pset = set("""
    AA AE AH AO AW AY B CH D DH EH ER EY F G HH IH IY
    JH K L M N NG OW OY P R S SH T TH UH UW V W Y Z ZH
    """.strip().split())
valid_gset = set("abcdefghijklmnopqrstuvwxyz_")

valid_row = 0
invalid_row = 0
p_count = 0
g_count = 0

with open('test.csv','r') as f:
    reader = csv.reader(f)
    # Skip headers
    next(reader)
    try:
        line = 1
        for p,g in reader:
            phonemes = p.split()
            graphemes = g.split()
            line += 1

            valid = True
            if len(phonemes) != len(graphemes):
                print("Line {}: Number of phonemes and graphemes differ.".format(line))
                valid = False

            bad_p = [p for p in phonemes if p not in valid_pset]
            if bad_p:
                print("Line {}: Invalid phonemes {}".format(line, bad_p))
                valid = False

            graphemes = list(''.join(graphemes))
            bad_g = [g for g in graphemes if g not in valid_gset]
            if bad_g:
                print("Line {}: Invalid graphemes {}".format(line, bad_g))
                valid = False

            if valid:
                valid_row += 1
                p_count += len(phonemes)
                g_count += len(graphemes)
            else:
                invalid_row += 1
    except ValueError:
        pass

print("Valid rows: {}, Invalid rows: {}, p_count: {}, g_count: {}".format(
    valid_row, invalid_row, p_count, g_count))

Here's the output I got:

$ python test.py
Line 5: Number of phonemes and graphemes differ.
Line 6: Number of phonemes and graphemes differ.
Line 7: Invalid graphemes ['!']
Line 8: Invalid phonemes ['SX']
Valid rows: 3, Invalid rows: 4, p_count: 12, g_count: 16