MrPyCharm MrPyCharm - 4 months ago 21
Python Question

fast way to find occurrences in a list in python

I have a set of unique words called

. I also have a 2D list of documents called
which has a structure like:

[ ['hello', 'world', 'i', 'am'],
['hello', 'stackoverflow', 'i', 'am'],
['hello', 'world', 'i', 'am', 'mr'],
['hello', 'stackoverflow', 'i', 'am', 'pycahrm'] ]


('hello', 'world', 'i', 'am', 'stackoverflow', 'mr', 'pycharm')

what I want is to find the occurrences of the unique words in the tokenized documents list.

So far I came up with this code but this seems to be VERY slow. Is there any efficient way to do this?

term_id = []
for term in h_unique:
print term
for doc_id, doc in enumerate(h_tokenized_doc):
term_id.append([doc_id for t in doc if t == term])

In my case I have a document list of 7000 documents, structured like:

[ [doc1], [doc2], [doc3], ..... ]


It'll be slow because you're running through your entire document list once for every unique word. Why not try storing the unique words in a dictionary and appending to it for each word found?

unique_dict = {term: [] for term in h_unique}
for doc_id, doc in enumerate(h_tokenized_doc):
    for term_id, term in enumerate(doc):
            # Not sure what structure you want to keep it in here...
            # This stores a tuple of the doc, and position in that doc
            unique_dict[term].append((doc_id, term_id))
        except KeyError:
            # If the term isn't in h_unique, don't do anything

This runs through all the document's only once.

From your above example, unique_dict would be:

{'pycharm': [], 'i': [(0, 2), (1, 2), (2, 2), (3, 2)], 'stackoverflow': [(1, 1), (3, 1)], 'am': [(0, 3), (1, 3), (2, 3), (3, 3)], 'mr': [(2, 4)], 'world': [(0, 1), (2, 1)], 'hello': [(0, 0), (1, 0), (2, 0), (3, 0)]}

(Of course assuming the typo 'pycahrm' in your example was deliberate)