zerg . zerg . - 1 year ago 94
Perl Question

Efficient way of comparing two files and removing partial match

I have two files, example:




this is line3
this is partial
typo artial


  1. Delete all lines from file2 which contains any line from file1.

  2. There must be a partial match, line from file1 found in file2 (not full line match).

  3. I am looking for the most efficient way, I am comparing files of millions of lines.

  4. It can be achieved with any tool/language on linux.

Expected result:

typo artial

I tested with python but it is extremely slow.
Also tested with grep and it is nearly as slow as python.

The files I am comparing can have up to 10GB in size. Memory on server side is not an issue but I would like not to waste resources.

Testing results based on answers:

Files used for testing:

  • file1 with 7051 lines

  • file2 with 2182387 lines

Using grep:

# time grep -v -f file1 file2 > file3
real 28m50.078s
user 27m13.984s
sys 1m36.068s
# wc -l file3
1947790 file3

Grep with -F:

# time grep -v -F -f file1 file2 > file3
real 0m1.441s
user 0m1.400s
sys 0m0.040s
# wc -l file3
1950655 file3

Using perl posted by Borodin:

# time ./clean.pl > file3
real 0m2.281s
user 0m2.176s
sys 0m0.104s
# wc -l file3
1950655 file3

To be honest I did not expect fixed strings to make such a big difference for grep. So far grep wins this, will have to test with 10GB files and time it. After make sure the results are correct. Will be back with an update.


Perl wins this one since I had to introduce some regex for some special cases. For instance I have a big file with domains and I want to exclude those from another file. But that means that I need domain$ as regex, otherwise google.co would match google.com and it is not ok.
If you do not have that special case as I had for some files only, grep is the obvious performance winner.

Answer Source

The simplest way is to build a regex pattern from all of the strings in file1.txt, and print only those files in file2.txt that don't match the pattern

use strict;
use warnings FATAL => 'all';

my $re = do {
    open my $fh, '<', 'file1.txt' or die $!;
    my @data = <$fh>;
    chomp @data;
    my $re = join '|', map quotemeta($_), @data;

open my $fh, '<', 'file2.txt' or die $!;
/$re/ or print while <$fh>;


typo artial
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download