fgarci03 fgarci03 - 3 months ago 9
Python Question

How to remove duplicates of huge lists of objects in Python

I have gigantic lists of objects with many duplicates (I'm talking thousands of lists with thousands of objects each, taking up to about 10million individual objects (already without duplicates).

I need to go through them and remove all the duplicates inside each list (no need to compare between lists, only inside each one).

I can, of course, go through the lists and compare with any dedupe algorithm that has been posted a lot of times, but I would guess this would take me forever.

I thought I could create an object with a crafted

__hash__
method and use a
list(set(obj))
to remove them but first: I don't know if this would work, second: I would still have to loop the lists to convert the elements to the new object.

I know Python is not the best solution for what I am trying to achieve, but in this case, it will have to be done in Python. I wonder what would be the best way to achieve this with the best performance possible.

Edit: for clarification: I have about 2k lists of objects, with about 5k objects inside each one (rough estimate). The duplicated objects are copies, and not references to the same memory location. The lists (dicts) are basically converted JSON arrays




Edit 2: I'm sorry for not being clear, I will rephrase.

This is for a django data migration, although my question only applies to the data 'formatting' and not the framework itself or database insertion.
I inserted a whole bunch of data as JSON to a table for later analysis. Now I need to normalize it and save it correctly. I created new tables and need to migrate the data.

So when I retrieve the data from the db I have about 2000 JSON arrays. Applying
json.loads(arr)
(by the documentation) I get 2000 lists of objects (dicts). Each dict has only strings, numbers and booleans as values to each key, no nested objects/arrays, so something like this:

[
{
a: 'aa',
b: 2,
c: False,
date: <date_as_long> // ex: 1471688210
},
{
a: 'bb',
b: 4,
c: True,
date: <date_as_long> // ex: 1471688210
}
]


What I need is to run through every list and remove duplicates. Something is considered duplicate if all the fields except the date match (this wasn't in the original question, as I had not predicted it) inside a list. If they match across different lists, they are not considered duplicates.

After better analysis of the contents, I found out I have close to 2 million individual records (not 10 million as said previously).
The performance problems I face are because each dict needs to suffer some sort of data formatting (converting dates, for example) and 'wrap' it in the model object for database insertion:
ModelName(a='aaa', b=2, c=True, date=1471688210)
.

The insertion on the database itself is done by
bulk_create
.

NOTE: I'm sorry for the lack of clarification on the original question. The more I dug into this the more I learned about what had to be done and how to handle the data.




I accepted @tuergeist 's answer because it pointed to what I needed even with bad details on my part.

Given dicts cannot be hashed, thus I can't add them to a set(), my solution was to create a
set()
of tuples for the duplicated data, and verify the duplicates with it. This prevented an extra iteration if the duplicates where in a list.

So it was something like this:

data = [lots of lists of dicts]
formatted_data = []
duplicates = set()

for my_list in data:
for element in my_list:
a = element['a']
b = convert_whatever(element['b'])
c = element['c']

d = (a, b, c) # Notice how only the elements that count for checking if it's a duplicate are here (not the date)

if d not in duplicates:
duplicates.add(d)
normalized_data = {
a: a,
b: b,
c: c,
date: element['date']
}
formatted_data.append(MyModel(**normalized_data)

duplicates.clear()


After this, for better memory management, I used generators:

data = [lots of lists of dicts]
formatted_data = []
duplicates = set()

def format_element(el):
a = el['a']
b = convert_whatever(el['b'])
c = el['c']

d = (a, b, c)

if d not in duplicates:
duplicates.add(d)
normalized_data = {
'a': a,
'b': b,
'c': c,
'date': el['date']
}
formatted_data.append(MyModel(**normalized_data))

def iter_list(l):
[format_element(x) for x in l]
duplicates.clear()

[iter_list(my_list) for my_list in data]


Working code here: http://codepad.org/frHJQaLu

NOTE: My finished code is a little different (and in a functional style) than this one. This serves only as an example of how I solved the problem.




Edit 3:
For the database insertion I used bulk_create. In the end it took 1 minute to format everything correctly (1.5 million unique entries, 225k duplicates) and 2 minutes to insert everything to the database.

Thank you all!

Answer

A fast, not order preserving solution for (hashable items) is

def unify(seq):
    # Not order preserving
    return list(set(seq))

Complete Edit

I assume, that you have dicts inside a list. And you have many lists. The solution to remove duplicates from a single list is:

def remove_dupes(mylist):
    newlist = [mylist[0]]
    for e in mylist:
        if e not in newlist:
            newlist.append(e)
    return newlist

A list here contains the following dicts. (But all random)

{"firstName":"John", "lastName":"Doe"}, 
{"firstName":"Anna", "lastName":"Smith"}, 
{"firstName":"Peter","lastName":"Jones"}

Running this, it took 8s for 2000 dicts on my MacBook (2,4GHz, i5)

Complete code: http://pastebin.com/NSKuuxUe

Comments