astabada astabada - 1 year ago 36
Python Question

Removing duplicates: python results differ from sort -u

I have a very long text file (2GB) and I removed duplicates using:

sort -u filename > outfile1


>>> data = open('filename', 'r').readlines()
>>> u = list(set(data))
>>> open('outfile2', 'w').writelines(u)

However the two files outfile2 and outfile1 have a different number of entries:

wc -l outfile?
185866729 filename
109608242 outfile1
109611085 outfile2

How is this possible?

Following up on the request to see the data, I have found that python will remove as duplicates entries like:


Effectively the second character is ignored in
sort -u
, and only the first entry is kept. Python instead does a good job of distinguishing the three records.

Answer Source

Without seeing the actual output (or at least the 'extra' lines, we can only guess.

But it will come down to how much preprocessing is done by sort, which is finding more duplicates than set().

Possible causes might be

  • Trailing spaces on some lines. They might be removed by sort but not by set.
  • Different handling of unicode characters. Perhaps sort maps some of them onto a smaller set of equivalents, producing more duplicates.