tiredsys tiredsys - 5 months ago 30
Bash Question

Two files comparsion

I have a really weird problem. I've got three files, which contain one column of numbers. I need to get ONLY unique values from first file, that are not present at second and third files.

I tried Python like:

for e in firstfile:
if e not in secondfile:
resultfile.append(e)
return resultfile


And same for third file.

I tried uniq, sort, diff, some awk scripts and comm in linux shell like here: Fast way of finding lines in one file that are not in another?

But the only result i get each time is THE SAME AMOUNT OF LINES AS IT WAS IN FORMER FIRST FILE. I don't get it at all!

Maybe, i've missed something? Maybe it's something with a format? However, i checked it a lot of times. Here are the files: http://dropmefiles.com/BaKGj

P.S. Later i thought there are no unique lines at all, but i checked it manually, some numbers in first file ARE unique.

Answer

What's wrong

And same for third file

If you are really doing the same for the third file, i.e. comparing the original contents of the first file with the third, you can introduce duplicates of items that were not in the second file but are in the third. For example:

file 1:
1
2
3

file 2:
1

file 3:
2

After processing file 2, resultfile would contain 2 and 3. Then after processing file 3, resultfile would contain 2 and 3 (from the first run) plus 1 and 3, i.e. 2, 3, 1, 3. However, the result should just be 3.

It's not clear from your code whether you are actually writing the output of each run to the file resultfile. If you are, then you should use it as the input for the second and subsequent runs, don't process the first file again.


A better way to fix it

If you do not need to preserve the order of lines from the first file you could use set.difference() like this:

with open('file1') as f1, open('file2') as f2, open('file3') as f3:
    unique_f1 = set(f1).difference(f2, f3)

Note that this will include any whitespace (including newline characters) present in the files. If you wanted to ignore leading and trailing whitespace from each line:

from itertools import chain

with open('file1') as f1, open('file2') as f2, open('file3') as f3:
    unique_f1 = set(map(str.strip, f1)).difference(map(str.strip, chain(f2, f3)))

The above assumes Python 3. If you're using Python 2 then, optionally for better efficiency, import itertools.imap and use it instead of map().

Or you might like to treat the data as numeric (I'll assume float here, but you can use int instead):

from itertools import chain

with open('file1') as f1, open('file2') as f2, open('file3') as f3:
    unique_f1 = set(map(float, f1)).difference(map(float, chain(f2, f3)))