user3781528 user3781528 - 22 days ago 8
Linux Question

Perl/Linux filtering large file with content of another file

I'm filtering a 580 MB file using the content of another smaller file.
File1 (smaller file)

chr start End
1 123 150
2 245 320
2 450 600


File2 (large file)

chr pos RS ID A B C D E F
1 124 r2 3 s 4 s 2 s 2
1 165 r6 4 t 2 k 1 r 2
2 455 t2 4 2 4 t 3 w 3
3 234 r4 2 5 w 4 t 2 4


I would like to capture lines from the File2 if the following criteria is met.
File2.Chr == File1.Chr && File2.Pos > File1.Start && File2.Pos < File1.End

I’ve tried using awk but it runs very slow, also I was wondering if there is better way to accomplish the same?

Thank you.

Here is the code that I’m using:

#!/usr/bin/perl -w
use strict;
use warnings;

my $bed_file = "/data/1000G/Hotspots.bed";#File1 smaller file
my $SNP_file = "/data/1000G/SNP_file.txt";#File2 larger file
my $final_file = "/data/1000G/final_file.txt"; #final output file

open my $in_fh, '<', $bed_file
or die qq{Unable to open "$bed_file" for input: $!};

while ( <$in_fh> ) {

my $line_str = $_;

my @data = split(/\t/, $line_str);

next if /\b(?:track)\b/;# skip header line
my $chr = $data[0]; $chr =~ s/chr//g; print "chr is $chr\n";
my $start = $data[1]-1; print "start is $start\n";
my $end = $data[2]+1; print "end is $end\n";

my $cmd1 = "awk '{if(\$1==chr && \$2>$start && \$2</$end) print (\"chr\"\$1\"_\"\$2\"_\"\$3\"_\"\$4\"_\"\$5\"_\"\$6\"_\"\$7\"_\"\$8)}' $SNP_file >> $final_file"; print "cmd1\n";
my $cmd2 = `awk '{if(\$1==chr && \$2>$start && \$2</$end) print (\"chr\"\$1\"_\"\$2\"_\"\$3\"_\"\$4\"_\"\$5\"_\"\$6\"_\"\$7\"_\"\$8)}' $SNP_file >> $final_file`; print "cmd2\n";

}

Answer

Read the small file into a data structure and check every line of the other file against it.

Here I read it into an array, each element being an arrayref with fields from a line. Then each line of the data file is checked against refarrays in this array, comparing fields per requirements.

use warnings 'all';
use strict;

my $ref_file = 'reference.txt';
open my $fh, '<', $ref_file or die "Can't open $ref_file: $!";
my @ref = map { chomp; [ split ] } grep { /\S/ } <$fh>;

my $data_file = 'data.txt';
open $fh, '<', $data_file or die "Can't open $data_file: $!";

# Drop header lines
my $ref_header  = shift @ref;    
my $data_header = <$fh>;

while (<$fh>) 
{
    next if not /\S/;  # skip empty lines
    my @line = split;

    foreach my $refline (@ref) 
    {
        next if $line[0] != $refline->[0];
        if ($line[1] > $refline->[1] and $line[1] < $refline->[2]) {
            print "@line\n";
        }
    }   
}
close $fh;

This prints out correct lines from the provided samples. It allows for multiple lines to match. If this somehow can't be, add last in the if block to exit the foreach once a match is found.

A few comments on the code. Let me know if more can be useful.

When reading the reference file, <$fh> is used in list context so it returns all lines, and grep filters out the empty ones. The map first chomps the newline and then makes an arrayref by [ ], with elements being fields on the line obtained by split. The output list is assigned to @ref.

When we reuse $fh it is first closed (if it was open) so there is no need for a close.

I store the header lines just so, perhaps to print or check. We really only need to exclude them.