Abdel-Rahman Shoman Abdel-Rahman Shoman - 9 days ago 5
Python Question

python gensim TypeError: coercing to Unicode: need string or buffer, list found

So I believe despite this being a common issue with many similar questions (especially on stackoverflow), the main reason behind this issue varies in each case

In my case I have a method named

readCorpus
(find code below) it reads a list of 21 files, extract docs from each file then yield them

The yield operation occurs at the end of reading each file

I have another method named
uploadCorpus
(find code below). The main aim of this method is to upload that corpus.

Obviously the main reason behind using yield is that the corpus can be very large and I only need to read it once.

Once I run the method
uploadCorpus
I receive the error below

TypeError: coercing to Unicode: need string or buffer, list found


The erros occurs at the line
self.readCorpus()])
.

Reading similar problems I came to understand that it happens when a list is misplaced .. I tried to uplate the line of question here to
docs for docs in self.readCorpus()])
but I ended with the same issue

My code (uploadCorpus)

def uploadCorpus(self):
#convert docs to corpus
print "uploading"

utils.upload_chunked(
self.service,
[{'id': 'doc_%i' % num, 'tokens': utils.simple_preprocess(doc)}
for num, doc in enumerate([
self.readCorpus()])
],
chunksize=1000) # send 1k docs at a time


My code readCorpus()

def readCorpus(self):
path = '../data/reuters'
doc=''
docs = []
docStart=False

fileCount=0

print 'Reading Corpus'
for name in glob.glob(os.path.join(path, '*.sgm')):
print 'Reading File| ' + name
docCount=0
for line in open(name):
if(len(re.findall(r'<BODY>', line)) > 0 ):
docStart = True
pattern = re.search(r'<BODY>.*', line)
doc+= pattern.group()[6:]

if(len(re.findall(r'</BODY>\w*', line)) > 0 ):
docStart = False
docs.append(doc)
doc=''
docCount+=1
continue
#break
if(docStart):
doc += line

fileCount+=1
print 'docuemnt[%d][%d]'%(fileCount,docCount)
yield docs
docs = []

Answer

The line below is expecting an iterable object .. where the readCorpus function was supposed to be a generator using the keyword yield

self.readCorpus()

However the readCorpus function was not behaving the way a generator is supposed to be because of a poor implementation of the yield keyword.

The current implementation yield an array of items every 1000 loop iterations while the correct way is yield item by item.

Hence the readCorpus needs to be modified as following

def readCorpus(self):
        path = '../data/reuters'
        doc=''
        docStart=False

        fileCount=0

        print 'Reading Corpus'
        for name in glob.glob(os.path.join(path, '*.sgm')):
            print 'Reading File| ' + name
            docCount=0
            for line in open(name):
                if(len(re.findall(r'<BODY>', line)) > 0 ): 
                    docStart = True
                    pattern = re.search(r'<BODY>.*', line)
                    doc+= pattern.group()[6:]

                if(len(re.findall(r'</BODY>\w*', line)) > 0 ):
                    docStart = False
                    #docs.append(doc)
                    yield doc
                    doc=''
                    docCount+=1
                    continue
                    #break
                if(docStart):
                    doc += line

            fileCount+=1
            print 'docuemnt[%d][%d]'%(fileCount,docCount)
Comments