AnkP AnkP - 29 days ago 7
Perl Question

Find similarities and differences between two large files based on the first column



I have two tab-delimited files with over a million lines, and I need find based on first column how many values are common and how many are specific to just one of the files.

I am trying to do it in Perl with following code, but it isn't working right.

I need to consider the computational time given the size of the files.

Can someone please help me to correct this, or suggest a more efficient method?

left.txt



K00134:78_1 272 1 3057610
K00134:78_0 272 1 3057610
K00134:78_2 272 1 3057610
K00134:78_3 272 1 3057610


right.txt



K00134:78_1 272 1 3057610
K00134:78_5 272 1 3057610
K00134:78_6 272 1 3057610
K00134:78_3 272 1 3057610


Perl code



use strict;
use warnings;

my %Set;

open (SET1, "<", "left.txt") or die "cannot open file";

while (<SET1>) {
my @line = split (/\t/, $_);
$Set{$line[0]} = $line[1];
}

my @k = keys %Set;
foreach my $key (@k) {
print "$key, $Set{$key}\n";
}
close SET1;

open (SET2, "<", "right.txt") or die "cannot open file";
print "common:\n";

while (<SET2>) {
chomp;

if ( exists $Set{"$_"} ) {
print "$Set{$_}\n";
}
}

close SET2;


The output should look like this, listing the common fields based on first column -

common lines -
K00134:78_1 272 1 3057610
K00134:78_3 272 1 3057610


uncommon lines - left.txt

K00134:78_0 272 1 3057610
K00134:78_2 272 1 3057610


uncommon lines - right.txt

K00134:78_5 272 1 3057610
K00134:78_6 272 1 3057610


Also, I am trying to add mismatches from each file as output too, but I am not sure if its possible given size of the files. Thanks!

Answer

Your second read loop code is wrong. It should split by tabs and check. Change it to:

while (<SET2>) {
    my @line = split (/\t/, $_);
    print $_ if exists $Set{$line[0]};
}

And it will work. Your approach is OK-ish. Since you only want to compare the first column, you don't have to set the value of the $Set{} to the second column ($line[1]) you can just set it to '' in attempt to save memory. Also, to save memory make sure left.txt is the smallest of the two. Here is a working example:

use strict;
use warnings;

my %Set;

open (SET1, "<", "left.txt") or die "cannot open file";

while (<SET1>) {
    my @line = split (/\t/, $_);
    $Set{$line[0]} = '';
}

close SET1;

open (SET2, "<", "right.txt") or die "cannot open file";
print "common:\n";

while (<SET2>) {
    my @line = split (/\t/, $_);
    print $_ if exists $Set{$line[0]};
}

close SET2;