EA00 EA00 - 5 months ago 11
Perl Question

Perl - print lines in between two patterns

Need some help with my print command - I would like to print everything between lines

@cluster t.# has ### elements
(including this line) and
@cluster t.#+1 has ### elements
(preferably omitting this line) from my input file into corresponding numbered output files (
clust(#).txt
). The script thus far creates the appropriate numbered files, without any content.

Also, any commentary on how to simplify my loops would be greatly appreciated since I'm a beginner.

#!/usr/bin/perl

use strict;
use warnings;

open(IN,$ARGV[0]);

our $num = 0;

while(my $line = <IN>) {
if ($line =~ /^\@cluster t has (\d+) elements/) {
my $clust = "full";
open (OUT, ">clust$clust.txt");

} elsif ($line =~ m/^\@cluster t.(\d+.*) has (\d+) elements/) {
my $clust = $1;
$num++;
open (OUT, ">clust$clust.txt");
print OUT, $_ if (/$line/ ... /$line/);
}
}

Answer

The range operator is near tailor made for this. It keeps track of its true/false state across repeated calls. It turns true once its left operand evaluates true and stays that way until the right one is true, after which it is false, on the next evaluation. There is more to it, please see the docs.

Made-up input file data_range.txt

@cluster t.1 has 100 elements
@cluster t.2 has 200 elements
@cluster t.3 has 300 elements
@cluster t.4 has 400 elements
@cluster t.5 has 500 elements

Print everything between marker-lines 2 and 4, including the starting line but not the ending one.

use warnings;
use strict;

my $file = 'data_range.txt';
open my $fh, $file  or die "Can't open $file: $!";

# Build the start and end patterns
my $beg = qr/^\@cluster t\.2 has 200 elements$/;
my $end = qr/^\@cluster t\.4 has 400 elements$/;

while (<$fh>) 
{
    if (/$beg/ .. /$end/) {
        print if not /$end/;
    }   
}

This prints lines 2 and 3. The .. operator turns true once the line ($_) matches $beg and is true until a line matches $end. After that it is false, for the next line. Thus it ends up including both start and end lines as well. So we also test for the end marker, and not print if we have that line.


If you are processing the line in the loop body anyway, and/or would rather not mess with escaping things in the regex but want to be able to use the marker lines literally (perhaps programmatically), you can use strings and test for equality.

my $beg = q(@cluster t.2 has 200 elements);
my $end = q(@cluster t.4 has 400 elements);

while (my $line = <$fh>) 
{
    chomp($line);
    if ($line eq $beg .. $line eq $end) {
        print "$line\n" if $line ne $end;
    }   
}

This works the same way as the example above. Note that now we have to chomp since the newline would foil eq test (and then we add \n for printing).


The specific problem, with my current understanding of input. File data_range.txt

@cluster t.1 has 100 elements
data 1
data 1 1
@cluster t.2 has 200 elements
data 2
@cluster t.3 has 300 elements

Print t.# and the lines following up to the next t.#, to a file clust(#).txt.

use warnings;
use strict;

my $file = 'data_nrange.txt';
open my $fh, $file  or die "Can't open $file: $!";
my $fh_out;

my $clustline = qr/\@cluster t.(\d+) has \d+ elements/;
while (<$fh>) 
{
    if (/$clustline/) {
        my $fout = "clust($1).txt";
        open $fh_out, '>', $fout or die "Can't open $fout for writing: $!";
        print $fh_out $_;
    }
    else { print $fh_out $_ }
}

For each line with @cluster a new file with the corresponding number is opened, closing the previous one since we use the same filehandle, and that line is printed to it. All following lines belong to that file and they are printed there. This assumes that the first line in the file is a @cluster... line, otherwise we'd need a flag for when to start. If there are some other kinds of lines, which don't belong to any of these files, introduce an elsif and match them there. Once that's understood we can really just write

my $clustline = qr/\@cluster t.(\d+) has \d+ elements/;
while (<$fh>) 
{
    if (/$clustline/) {
        my $fout = "clust($1).txt";
        open $fh_out, '>', $fout or die "Can't open $fout for writing: $!";
    }
    print $fh_out $_;
}

Wiht either we get the following files. The clust(1).txt is

@cluster t.1 has 100 elements
data 1
data 1 1

while clust(2).txt has the t.2 line and data 2 line, and clust(3).txt has the t.3 line.