Daria Smirnova Daria Smirnova - 1 month ago 24
Python Question

Normalize all words in a document

I need to normalize all words in a huge corpora. Any ideas how to optimize this code? That's too slow...

texts = [ [ list(morph.normalize(word.upper()))[0] for word in document.split() ]
for document in documents ]


documents
is a list of strings, where each string is single book's text.

morph.normalize
works only for upper register, so I apply .upper() to all words. Moreover, it returns a set with one element, which is normalized word (string)

Answer

The first and obvious thing I'd do would be to cache the normalized words in a local dict, as to avoid calling morph.normalize() more than once for a given word.

A second optimization is to alias methods to local variables - this avoids going thru the whole attribute lookup + function descriptor invocation + method object instanciation on each turn of the loop.

Then since it's a "huge" corpus you probably want to avoid creating a full list of lists at once, which might eat all your ram, make your computer start to swap (which is garanteed to make it snail slow) and finally crash with a memory error. I don't know what your supposed to do with this list of lists nor how huge each document is but as an example I iter on a per-document result and write it to stdout - what should really be done depends on the context and concrete use case.

NB : untested code, obviously, but at least this should get you started

 def iterdocs(documents, morph):
    # keep trac of already normalized words
    # beware this dict might get too big if you
    # have lot of different words. Depending on
    # your corpus, you may want to either use a LRU 
    # cache instead and/or use a per-document cache
    # and/or any other appropriate caching strategy...
    cache = {} 

    # aliasing methods as local variables 
    # is faster for tight loops
    normalize = morph.normalize 

    def norm(word):
        upw = word.upper()
        if upw in cache:
            return cache[upw]
        nw = cache[upw] = normalize(upw).pop()
        return nw

    for doc in documents:
        words = [norm(word) for word in document.split() if word]
        yield words

for text in iterdocs(docs, morph):
    # if you need all the texts for further use 
    # at least write them to disk or other persistence
    # mean and re-read them when needed.
    # Here I just write them to sys.stdout as an example
    print(text)

Also, I don't know where you get your documents from but if they are text files, you may want to avoid loading them all in memory. Just read them one by one, and if they are themselves huge don't even read a whole file at once (you can iterate over a file line by line - the most obvious choice for text).

Finally, once you made sure your code don't eat up to much memory for a single document, the next obvious optimisation is parallelisation - run a process per available core and split the corpus between processes (each writing it's results to a given place). Then you just have to sum up the results if you need them all at once...

Oh and yes : if that's still not enough you may want to distribute the work with some map reduce framework - your problem looks like a perfect fit for map reduce.