Robert Simoes Robert Simoes - 3 months ago 12
Python Question

Data Loss reading from one file, processing and writing to another? Python

I wrote a script to transform a large 4MB textfile with 40k+ lines of unordered data to a specifically formatted and easier to deal with CSV file.

Problem:

Analyzing my file sizes, it appears i've lost over 1MB of data (20K Lines | edit: original file was 7MB so lost ~4MB of data), and when I attempt to search specific data points present in CommaOnly.txt in sorted_CSV.csv I cannot find them.

I found this really weird so.

What I tried:

I searched for and replaced all unicode chars present in the CommaOnly.txt that might be causing a problem.. No luck!

Example: \u0b99 replaced with " "

Here's an example of some data loss

A line from: CommaOnly.txt

name,SJ Photography,category,Professional Services,
state,none,city,none,country,none,about,
Capturing intimate & milestone moment from pregnancy and family portraits to weddings


Sorted_CSV.csv

Not present.


What could be causing this?

Code:

import re
import csv
import time

# Final Sorted Order for all data:
#['name', 'data',
# 'category','data',
# 'about', 'data',
# 'country', 'data',
# 'state', 'data',
# 'city', 'data']


## Recieves String Item, Splits on "," Delimitter Returns the split List
def split_values(string):
string = string.strip('\n')
split_string = re.split(',', string)
return split_string


## Iterates through the list, reorganizes terms in the desired order at the desired indices
## Adds the field if it does not initially
def reformo_sort(list_to_sort):
processed_values=[""]*12
for i in range(11):
try:
## Terrible code I know, but trying to be explicit for the question
if(i==0):
for j in range(len(list_to_sort)):
if(list_to_sort[j]=="name"):
processed_values[0]=(list_to_sort[j])
processed_values[1]=(list_to_sort[j+1])
## append its neighbour

## if after iterating, name does not appear, add it.
if(processed_values[0] != "name"):
processed_values[0]="name"
processed_values[1]="None"

elif(i==2):
for j in range(len(list_to_sort)):
if(list_to_sort[j]=="category"):
processed_values[2]=(list_to_sort[j])
processed_values[3]=(list_to_sort[j+1])

if(processed_values[2] != "category"):
processed_values[2]="category"
processed_values[3]="None"

elif(i==4):
for j in range(len(list_to_sort)):
if(list_to_sort[j]=="about"):
processed_values[4]=(list_to_sort[j])
processed_values[5]=(list_to_sort[j+1])

if(processed_values[4] != "about"):
processed_values[4]="about"
processed_values[5]="None"

elif(i==6):
for j in range(len(list_to_sort)):
if(list_to_sort[j]=="country"):
processed_values[6]=(list_to_sort[j])
processed_values[7]=(list_to_sort[j+1])
if(processed_values[6]!= "country"):
processed_values[6]="country"
processed_values[7]="None"

elif(i==8):
for j in range(len(list_to_sort)):
if(list_to_sort[j]=="state"):
processed_values[8]=(list_to_sort[j])
processed_values[9]=(list_to_sort[j+1])

if(processed_values[8] != "state"):
processed_values[8]="state"
processed_values[9]="None"

elif(i==10):
for j in range(len(list_to_sort)):
if(list_to_sort[j]=="city"):
processed_values[10]=(list_to_sort[j])
processed_values[11]=(list_to_sort[j+1])

if(processed_values[10] != "city"):
processed_values[10]="city"
processed_values[11]="None"
except:
print("failed to append!")
return processed_values

# Converts desired data fields to a string delimitting values by ','
def to_CSV(values_to_convert):
CSV_ENTRY=str(values_to_convert[1])+','+str(values_to_convert[3])+','+str(values_to_convert[5])+','+str(values_to_convert[7])+','+str(values_to_convert[9])+','+str(values_to_convert[11])
return CSV_ENTRY


with open("CommaOnly.txt", 'r') as c:
print("Starting.. :)")

for line in c:
entry = c.readline()
to_sort = split_values(entry)
now_sorted = reformo_sort(to_sort)
CSV_ROW=to_CSV(now_sorted)
with open("sorted_CSV.csv", "a+") as file:
file.write(str(CSV_ROW)+"\n")

print("Finished! :)")
time.sleep(60)

Answer

I've rewritten the main loop that seems fishy to me, using csv package.

Your reformo_sort routine is incomplet and syntaxically incorrect, with empty elif blocks and missing processing, so I got incomplete lines, but that should work much better than your code. Note the usage of csv, the "binary" flag, the single open in write mode instead of open/close each line (much faster) and the 1-out-of-2 filtering of the now_sorted array.

with open("CommaOnly.txt", 'rb') as c:
    print("Starting.. :)")
    cr = csv.reader(c,delimiter=",",quotechar='"')
    with open("sorted_CSV.csv", "wb") as fw:
        cw = csv.writer(fw,delimiter=",",quotechar='"')

        for to_sort in cr:
            now_sorted = reformo_sort(to_sort)
            cw.writerow(now_sorted[1::2])