Rob Rob - 3 months ago 11
Perl Question

perl regex: multiple matches as variables

I am not interested in how to use a variable in a regex search. Instead, I am curious how I can turn multiple regex matches into variables.

I have a file that looks like this:

>m160505_031746_42156_c100980652550000001823221307061611_s1_p0|153096|3235_13893
Length=10658

Score = 33.7 bits (18), Expect = 0.19
Identities = 18/18 (100%), Gaps = 0/18 (0%)
Strand=Plus/Minus

Query 3 CTATTTAAACCTAATCGG 20
||||||||||||||||||
Sbjct 10604 CTATTTAAACCTAATCGG 10587


>m160505_031746_42156_c100980652550000001823221307061611_s1_p0|114630|20543_24727
Length=4184

Score = 33.7 bits (18), Expect = 0.19
Identities = 18/18 (100%), Gaps = 0/18 (0%)
Strand=Plus/Plus

Query 3 CTATTTAAACCTAATCGG 20
||||||||||||||||||
Sbjct 85 CTATTTAAACCTAATCGG 102


My ultimate goal is to search this (very large) file and only extract lines that look like ">m160505_..." based on the end position of the subject match (seen as 10587 and 102 in the above example). If the end position of the subject is within 500 of the Length of the query length, or if it is within 500 absolutely, the >m... line gets printed. I realize this seems complicated, so looking at my code might help clarify things.
This is what my code looks like so far:

use strict;
use warnings;

my $file = '/path/to/file.txt';
my $data;
{
open my $fh, '<', $file or die;
local $/ = undef;
$data = <$fh>;
close $fh;
}
my @matches = $data =~ />(m.+)\nLength=([0-9]+)\n\n Score.+\n Iden.+\n Str.+\n\nQuery.+\n.+\nSbjct [0-9]+ [TAGC]+ ([0-9]+)/g;
foreach (@matches) {
print "$_\n";
}


This prints out something like the following:

>m160505_031746_42156_c100980652550000001823221307061611_s1_p0|153096|3235_13893
10658
10587
>m160505_031746_42156_c100980652550000001823221307061611_s1_p0|114630|20543_24727
4184
102


From here I need to change things so the regex matches turn into variables (flexible vairables). I would like to be able to use them in something like the following:

my $mVariable = "m160505_031746_42156_c100980652550000001823221307061611_s1_p0|114630|20543_24727";
my $firstnumber = 10685;
my $secondnumber = 10587;
if ($firstnumber - $secondnumber < 500 || $secondnumber < 500) {
print $mVariable, "\n";
}


Thanks for your help!
If I can clarify something please let me know.

Answer

It's wasteful and unnecessary to read an entire file into memory; more so if it is a very large file

My solution below sets the record separator to > so that the file can be read one chunk at a time. The variables that you describe are extracted from the chunk, and the remainder of the loop is skipped if any of them aren't found

This program expects the path to the input file as a parameter on the command line

use strict;
use warnings 'all';
use feature 'say';

local $/ = ">";

while ( <> ) {

    next unless my ($m_variable) = / ^ ( m \d+ .+ ) /x;
    next unless my ($length)     = / ^ Length=(\d+) /xm;
    next unless my ($end_pos)    = / ^ Sbjct \b .*  \b (\d+) /xm;

    if ( abs($length - $end_pos) < 500 or $length < 500 ) {
        say $m_variable;
    }
}

output

m160505_031746_42156_c100980652550000001823221307061611_s1_p0|153096|3235_13893 
Comments