Mona Jalal Mona Jalal - 1 month ago 14
Python Question

csv write doesn't work correctly

Any idea why is this always writing the same line in output csv?

21 files = glob.glob(path)
22 csv_file_complete = open("graph_complete_reddit.csv", "wb")
23 stat_csv_file = open("test_stat.csv", "r")
24 csv_reader = csv.reader(stat_csv_file)
25 lemmatizer = WordNetLemmatizer()
26 for file1, file2 in itertools.combinations(files, 2):
27 with open(file1) as f1:
28 print(file1)
29 f1_text = f1.read()
30 f1_words = re.sub("[^a-zA-Z]", ' ', f1_text).lower().split()
31 f1_words = [str(lemmatizer.lemmatize(w, wordnet.VERB)) for w in f1_words if w not in stopwords]
32 print(f1_words)
33 f1.close()
34 with open(file2) as f2:
35 print(file2)
36 f2_text = f2.read()
37 f2_words = re.sub("[^a-zA-Z]", ' ', f2_text).lower().split()
38 f2_words = [str(lemmatizer.lemmatize(w, wordnet.VERB)) for w in f2_words if w not in stopwords]
39 print(f2_words)
40 f2.close()
41
42 a_complete = csv.writer(csv_file_complete, delimiter=',')
43 print("*****")
44 print(file1)
45 print(file2)
46 print("************************************")
47
48 f1_head, f1_tail = os.path.split(file1)
49 print("************")
50 print(f1_tail)
51 print("**************")
52 f2_head, f2_tail = os.path.split(file2)
53 print(f2_tail)
54 print("********************************")
55 for row in csv_reader:
56 if f1_tail in row:
57 file1_file_number = row[0]
58 file1_category_number = row[2]
59 if f2_tail in row:
60 file2_file_number = row[0]
61 file2_category_number = row[2]
62
63 row_complete = [file1_file_number, file2_file_number, file1_category_number, file2_category_number ]
64 a_complete.writerow(row_complete)
65
66 csv_file_complete.close()


Those prints show different filenames!

This is test_stat.csv file which the code uses as input:

1 1,1bmmoc.txt,1
2 2,2b3u1a.txt,1
3 3,2mf64u.txt,2
4 4,4x74k3.txt,5
5 5,lsspe.txt,3
6 6,qbimg.txt,4
7 7,w95fm.txt,2


And here's what the code outputs:

1 7,4,2,5
2 7,4,2,5
3 7,4,2,5
4 7,4,2,5
5 7,4,2,5
6 7,4,2,5
7 7,4,2,5
8 7,4,2,5
9 7,4,2,5
10 7,4,2,5
11 7,4,2,5
12 7,4,2,5
13 7,4,2,5
14 7,4,2,5
15 7,4,2,5
16 7,4,2,5
17 7,4,2,5
18 7,4,2,5
19 7,4,2,5
20 7,4,2,5
21 7,4,2,5


please comment or suggest fixes.

Answer

You're never rewinding stat_csv_file, so eventually, your loop over csv_reader (which is a wrapper around stat_csv_file) isn't looping at all, and you write whatever you found on the last loop. Basically, the logic is:

  1. On first loop, look through all of csv_reader, find hit (though you keep looking even when you find it, exhausting the file), write hit
  2. On all subsequent loops, the file is exhausted, so the inner search loop doesn't even execute, and you end up writing the same values as last time

The slow, but direct way to fix this is to add stat_csv_file.seek(0) before you search it:

 53         print(f2_tail)
 54         print("********************************")
            stat_csv_file.seek(0)  # Rewind to rescan input from beginning
 55         for row in csv_reader:
 56             if f1_tail in row:
 57                 file1_file_number = row[0]
 58                 file1_category_number = row[2]
 59             if f2_tail in row:
 60                 file2_file_number = row[0]
 61                 file2_category_number = row[2]

A likely better approach would be to load the input CSV into a dict once, then perform lookup there as needed, avoiding repeated (slow) I/O in favor of fast dict lookup. The cost would be higher memory use; if the input CSV is small enough, that's not an issue, if it's huge, you may need to use a proper database to get the rapid lookup without blowing memory.

It's a little unclear what the logic should be here, since your inputs and outputs don't align (your output should start with a repeated digit, but it doesn't for some reason?). But if the intent is that the input contains file_number, file_tail, category_number, then you could begin your code (above the top level loop) with:

# Create mapping from second field to associated first and third fields
tail_to_numbers = {ftail: (fnum, cnum) for fnum, ftail, cnum in csv_reader}

Then replace:

    for row in csv_reader:
        if f1_tail in row:
            file1_file_number = row[0]
            file1_category_number = row[2]
        if f2_tail in row:
            file2_file_number = row[0]
            file2_category_number = row[2]

    row_complete = [file1_file_number, file2_file_number, file1_category_number, file2_category_number ]
    a_complete.writerow(row_complete)

with the simpler and much faster:

try:
    file1_file_number, file1_category_number = tail_to_numbers[f1_tail]
    file2_file_number, file2_category_number = tail_to_numbers[f2_tail]
except KeyError:
    # One of the tails wasn't found in the lookup dict, so don't output
    # (variables would be stale or unset); optionally emit some error to stderr
    continue
else:
    # Found both tails, output associated values
    row_complete = [file1_file_number, file2_file_number, file1_category_number, file2_category_number]
    a_complete.writerow(row_complete)