Thomas Baruchel Thomas Baruchel - 13 days ago 5
Python Question

Removing duplicates in a Python list by id

I build large lists of high-level objects while parsing a tree. However, after this step, I have to remove duplicates from the list and I found this new step very slow in Python 2 (it was acceptable but still a little slow in Python 3). However I know that distinct objects actually have a distinct id. For this reason, I managed to get a much faster code by following these steps:


  • append all objects to a list while parsing;

  • sort the list with
    key=id
    option;

  • iterate over the sorted list and remove an element if the previous one has the same id.



Thus I have a working code which now runs smoothly, but I wonder whether I can achieve this task more directly in Python.

Example. Let's build two identical objects having the same value but a different id (for instance I will take a
fractions.Fraction
in order to rely on the standard library):

from fractions import Fraction
a = Fraction(1,3)
b = Fraction(1,3)


Now, if I try to achieve what I want to do by using the pythonical
list(set(...))
, I get the wrong result as
{a,b}
keeps only one of both values (which are identical but have a different id).

My question now is: what is the most pythonical, reliable, short and fast way for removing duplicates by id rather than duplicates by value? Order of the list doesn't matter if it has to be changed.

Answer

You should override the __eq__ method so that it depends on the objects id rather than its values. But note that your objects must be hashable as well, so you should define a proper __hash__ method too.

class My_obj:
    def __init__(self, val):
        self.val = val

    def __hash__(self):
        return hash(self.val)

    def __eq__(self, arg):
        return id(self) == id(arg)

    def __repr__(self):
        return str(self.val)

Demo:

a = My_obj(5)
b = My_obj(5)

print({a, b})
{5, 5}
Comments