Leandro S. Matos Leandro S. Matos - 10 months ago 66
Python Question

Python - CSV reader - Reading one row per time

Okay, I have a CSV file with several lines (more than 40k currently). Due to the massive number of lines, I need to read one by one and do a sequence of operations. This is the first question. The second is: How to read the csv file and encode it to utf-8? Second is how to read the file in utf-8 following the example: csv documentation. Mesmo utilizando a classe

class UTF8Recoder:
o retorno no meu print é
\xe9 s\xf3
. Could someone help me solve this problem?

import preprocessing
import pymongo
import csv,codecs,cStringIO
from pymongo import MongoClient
from unicodedata import normalize
from preprocessing import PreProcessing

class UTF8Recoder:
def __init__(self, f, encoding):
self.reader = codecs.getreader(encoding)(f)
def __iter__(self):
return self
def next(self):
return self.reader.next().encode("utf-8")

class UnicodeReader:
def __init__(self, f, dialect=csv.excel, encoding="utf-8-sig", **kwds):
f = UTF8Recoder(f, encoding)
self.reader = csv.reader(f, dialect=dialect, **kwds)
def next(self):
'''next() -> unicode
This function reads and returns the next line as a Unicode string.
'''
row = self.reader.next()
return [unicode(s, "utf-8") for s in row]
def __iter__(self):
return self

with open('data/MyCSV.csv','rb') as csvfile:
reader = UnicodeReader(csvfile)
#writer = UnicodeWriter(fout,quoting=csv.QUOTE_ALL)
for row in reader:
print row

def status_processing(corpus):

myCorpus = preprocessing.PreProcessing()
myCorpus.text = corpus

print "Starting..."
myCorpus.initial_processing()
print "Done."
print "----------------------------"


Edit 1: The solution of Mr. S Ringne works. But now, I can not do the operations inside my
def
. Here's the new code:

for csvfile in pd.read_csv('data/AracajuAgoraNoticias_facebook_statuses.csv',encoding='utf-8',sep=',', header='infer',engine='c', chunksize=2):

def status_processing(csvfile):

myCorpus = preprocessing.PreProcessing()
myCorpus.text = csvfile

print "Fazendo o processo inicial..."
myCorpus.initial_processing()
print "Feito."
print "----------------------------"


And at the end of the script:

def main():
status_processing(csvfile)

main()


The output is when i use
BeautifulSoup
to remove links:

ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

Answer Source

you can store your csv in pandas and do further operations,which would be quicker.

import pandas as pd
df = pd.read_csv('path_to_file.csv',encoding='utf-8',header = 'infer',engine = 'c')