Leandro S. Matos Leandro S. Matos - 28 days ago 6
Python Question

Python - CSV reader - Reading one row per time

Okay, I have a CSV file with several lines (more than 40k currently). Due to the massive number of lines, I need to read one by one and do a sequence of operations. This is the first question. The second is: How to read the csv file and encode it to utf-8? Second is how to read the file in utf-8 following the example: csv documentation. Mesmo utilizando a classe

class UTF8Recoder:
o retorno no meu print é
\xe9 s\xf3
. Could someone help me solve this problem?

import preprocessing
import pymongo
import csv,codecs,cStringIO
from pymongo import MongoClient
from unicodedata import normalize
from preprocessing import PreProcessing

class UTF8Recoder:
def __init__(self, f, encoding):
self.reader = codecs.getreader(encoding)(f)
def __iter__(self):
return self
def next(self):
return self.reader.next().encode("utf-8")

class UnicodeReader:
def __init__(self, f, dialect=csv.excel, encoding="utf-8-sig", **kwds):
f = UTF8Recoder(f, encoding)
self.reader = csv.reader(f, dialect=dialect, **kwds)
def next(self):
'''next() -> unicode
This function reads and returns the next line as a Unicode string.
'''
row = self.reader.next()
return [unicode(s, "utf-8") for s in row]
def __iter__(self):
return self

with open('data/MyCSV.csv','rb') as csvfile:
reader = UnicodeReader(csvfile)
#writer = UnicodeWriter(fout,quoting=csv.QUOTE_ALL)
for row in reader:
print row

def status_processing(corpus):

myCorpus = preprocessing.PreProcessing()
myCorpus.text = corpus

print "Starting..."
myCorpus.initial_processing()
print "Done."
print "----------------------------"


Edit 1: The solution of Mr. S Ringne works. But now, I can not do the operations inside my
def
. Here's the new code:

for csvfile in pd.read_csv('data/AracajuAgoraNoticias_facebook_statuses.csv',encoding='utf-8',sep=',', header='infer',engine='c', chunksize=2):

def status_processing(csvfile):

myCorpus = preprocessing.PreProcessing()
myCorpus.text = csvfile

print "Fazendo o processo inicial..."
myCorpus.initial_processing()
print "Feito."
print "----------------------------"


And at the end of the script:

def main():
status_processing(csvfile)

main()


The output is when i use
BeautifulSoup
to remove links:

ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

Answer

you can store your csv in pandas and do further operations,which would be quicker.

import pandas as pd
df = pd.read_csv('path_to_file.csv',encoding='utf-8',header = 'infer',engine = 'c')