MrPyCharm MrPyCharm - 1 month ago 12
Python Question

fast way to find occurrences in a list in python

I have a set of unique words called

h_unique
. I also have a 2D list of documents called
h_tokenized_doc
which has a structure like:


[ ['hello', 'world', 'i', 'am'],
['hello', 'stackoverflow', 'i', 'am'],
['hello', 'world', 'i', 'am', 'mr'],
['hello', 'stackoverflow', 'i', 'am', 'pycahrm'] ]


and
h_unique
as:

('hello', 'world', 'i', 'am', 'stackoverflow', 'mr', 'pycharm')


what I want is to find the occurrences of the unique words in the tokenized documents list.

So far I came up with this code but this seems to be VERY slow. Is there any efficient way to do this?

term_id = []
for term in h_unique:
print term
for doc_id, doc in enumerate(h_tokenized_doc):
term_id.append([doc_id for t in doc if t == term])


In my case I have a document list of 7000 documents, structured like:

[ [doc1], [doc2], [doc3], ..... ]

SCB SCB
Answer

It'll be slow because you're running through your entire document list once for every unique word. Why not try storing the unique words in a dictionary and appending to it for each word found?

unique_dict = {term: [] for term in h_unique}
for doc_id, doc in enumerate(h_tokenized_doc):
    for term_id, term in enumerate(doc):
        try:
            # Not sure what structure you want to keep it in here...
            # This stores a tuple of the doc, and position in that doc
            unique_dict[term].append((doc_id, term_id))
        except KeyError:
            # If the term isn't in h_unique, don't do anything
            pass

This runs through all the document's only once.

From your above example, unique_dict would be:

{'pycharm': [], 'i': [(0, 2), (1, 2), (2, 2), (3, 2)], 'stackoverflow': [(1, 1), (3, 1)], 'am': [(0, 3), (1, 3), (2, 3), (3, 3)], 'mr': [(2, 4)], 'world': [(0, 1), (2, 1)], 'hello': [(0, 0), (1, 0), (2, 0), (3, 0)]}

(Of course assuming the typo 'pycahrm' in your example was deliberate)

Comments