felih felih - 2 months ago 24
Python Question

Checksum for a list of numbers

I have a large number of lists of integers. I want to check if any of the lists are duplicates. I was thinking a good way of doing this would be to calculate a basic checksum, then only doing an element by element check if the checksums coincide. But I can't find a checksum algorithm with good properties, namely:


  • Verifies order effectively;

  • Quick to calculate;

  • Returns a small result, eg short integer;

  • Has a fairly uniform distribution, giving a low probability of different lists coinciding.



For example, a function check_sum which returned different numbers in the range [0,65536] for the following 5 calls would be ideal.

check_sum([1,2,3,4,5])
check_sum([1,2,3,5,4])
check_sum([5,4,3,2,1])
check_sum([1,2,3,4,4])


I looked at the IPv4 header checksum algorithm which returns a result of about the right size but doesn't check order so isn't what I'm looking for.

I'm going to implement it in python, but any format will do for algorithm, or pointer at a good reference material.

Answer

Calculate the checksums with hash():

checksums = \
    list(
        map(
            lambda l:
                hash(tuple(l)),
            list_of_lists
        )
    )

To know how many duplicates you have:

from collections import Counter

counts = Counter(checksums)

To compile a unique list:

unique_list = list(dict(zip(checksums, list_of_lists)).values())
Comments