NirIzr NirIzr - 2 months ago 8
Python Question

How can I find dict keys for matching values in two dicts?

I have two dictionaries mapping IDs to values. For simplicity, lets say those are the dictionaries:

d_source = {'a': 1, 'b': 2, 'c': 3, '3': 3}
d_target = {'A': 1, 'B': 2, 'C': 3, '1': 1}


As named, the dictionaries are not symmetrical.
I would like to get a dictionary of keys from dictionaries
d_source
and
d_target
whose values match. The resulting dictionary would have
d_source
keys as its own keys, and
d_target
keys as that keys value (in either a
list
,
tuple
or
set
format).

This would be The expected returned value for the above example should be the following list:

{'a': ('1', 'A'),
'b': ('B',),
'c': ('C',),
'3': ('C',)}


There are two somewhat similar questions, but those solutions can't be easily applied to my question.

Some characteristics of the data:


  1. Source would usually be smaller than target. Having roughly few thousand sources (tops) and a magnitude more targets.

  2. Duplicates in the same dict (both
    d_source
    and
    d_target
    ) are not too likely on values.

  3. matches are expected to be found for (a rough estimate) not more than 50% than
    d_source
    items.

  4. All keys are integers.



What is the best (performance wise) solution to this problem?
Modeling data into other datatypes for improved performance is totally ok, even when using third party libraries (i'm thinking numpy)

Answer
from collections import defaultdict
from pprint import pprint

d_source  = {'a': 1, 'b': 2, 'c': 3, '3': 3}
d_target = {'A': 1, 'B': 2, 'C': 3, '1': 1}

d_result = defaultdict(list)
{d_result[a].append(b) for a in d_source for b in d_target if d_source[a] == d_target[b]}

pprint(d_result)

Output:

{'3': ['C'],
 'a': ['A', '1'],
 'b': ['B'],
 'c': ['C']}

Timing results:

from collections import defaultdict
from copy import deepcopy
from random import randint
from timeit import timeit


def Craig_match(source, target):
    result = defaultdict(list)
    {result[a].append(b) for a in source for b in target if source[a] == target[b]}
    return result

def Bharel_match(source_dict, *rest):
    flipped_rest = defaultdict(list)
    for d in rest:
        while d:
            k, v = d.popitem()
            flipped_rest[v].append(k)
    return {k: tuple(flipped_rest.get(v, ())) for k, v in source_dict.items()}

def modified_Bharel_match(source_dict, *rest):
    """Optimized for ~50% source match due to if statement addition.

    Also uses less memory.
    """
    unique_values = set(source_dict.values())
    flipped_rest = defaultdict(list)
    for d in rest:
        while d:
            k, v = d.popitem()
            if v in unique_values:
                flipped_rest[v].append(k)
    return {k: tuple(flipped_rest.get(v, ())) for k, v in source_dict.items()}

# generate source, target such that:
# a) ~10% duplicate values in source and target
# b) 2000 unique source keys, 20000 unique target keys
# c) a little less than 50% matches source value to target value
# d) numeric keys and values
source = {}
for k in range(2000):
    source[k] = randint(0, 1800)
target = {}
for k in range(20000):
    if k < 1000:
        target[k] = randint(0, 2000)
    else:
        target[k] = randint(2000, 19000)

best_time = {}
approaches = ('Craig', 'Bharel', 'modified_Bharel')
for a in approaches:
    best_time[a] = None

for _ in range(3):
    for approach in approaches:
        test_source = deepcopy(source)
        test_target = deepcopy(target)

        statement = 'd=' + approach + '_match(test_source,test_target)'
        setup = 'from __main__ import test_source, test_target, ' + approach + '_match'
        t = timeit(stmt=statement, setup=setup, number=1)
        if not best_time[approach] or (t < best_time[approach]):
            best_time[approach] = t

for approach in approaches:
    print(approach, ':', '%0.5f' % best_time[approach])

Output:

Craig : 7.29259
Bharel : 0.01587
modified_Bharel : 0.00682