Aaron Aaron - 6 months ago 12x
Python Question

How to efficiently tally co-occurrences in python list of lists

I have a relatively large (~3GB, 3+ million entries) list of sublists where each sublist contains a group of tags. Here's a very simple example:

tag_corpus = [['cat', 'fish'], ['cat'], ['fish', 'dog', 'cat']]

unique_tags = ['dog', 'cat', 'fish']
co_occurences = {key:Counter() for key in unique_tags}

for tags in tag_corpus:
tallies = Counter(tags)
for key in tags:
co_occurences[key] = co_occurences[key] + tallies

This works like charm, sort of, but it's SUPER slow on the actual data set, which has very large sublists (~30K total unique tags). Any python pros know how I can speed this thing up?


This might go faster. You'll have to measure.

from collections import Counter
from collections import defaultdict

tag_corpus = [['cat', 'fish'], ['cat'], ['fish', 'dog', 'cat']]

co_occurences = defaultdict(Counter)
for tags in tag_corpus:
    for key in tags:
unique_tags = sorted(co_occurences)

print co_occurences
print unique_tags