perlbeginner perlbeginner - 4 months ago 9
Perl Question

How to split large file and write into individual record using identical pattern perl?



I have a multi-GB file consisting of thousands of individual files based on their IDs.

Each component file consists of four comment lines followed by the contents. Every second commented lines has a unique ID. I would like to split the file individual files named by their ID.

There is a second

size list
of IDs and size. I want to have this line written first as the very first line in each output file.

Examples

size list



A_1 100
Bxx_xx 25
P_b 342
1A_Z0 343
Z867 200
BWS 111


input file



# ver XX
# Query: A_1
# Database: XX
# Usage: XX
A_1 .*
A_1 .*
A_1 .*
A_1 .*
A_1 .*
# ver
# Query: Bxx_xx
# Database: XXXXXX
# Usage: XXXXX
Bxx_xx .*
Bxx_xx .*
Bxx_xx .*
Bxx_xx .*
Bxx_xx .*
Bxx_xx .*
Bxx_xx .*
Bxx_xx .*
Bxx_xx .*
Bxx_xx .*
# ver
# Query: P_b
# Database: XXXXXX
# Usage: XXXXX
P_b.*
P_b.*
P_b.*
P_b.*
P_b.*
P_b.*
# ver
# Query: 1A_Z0
# Database: XXXXXX
# Usage: XXXXX
1A_Z0.*
1A_Z0.*
1A_Z0.*
1A_Z0.*
# ver
# Query: Z867
# Database: XXXXXX
# Usage: XXXXX
# ver
# Query: BWS
# Database: XXXXXX
# Usage: XXXXX
BWS.*
BWS.*
BWS.*


Output should be like this, (ID.txt)

A_1.txt



A_1 100
# ver XX
# Query: A_1
# Database: XX
# Usage: XX
A_1 .*
A_1 .*
A_1 .*
A_1 .*
A_1 .*


Bxx_xx.txt



Bxx_xx 25
# ver
# Query: Bxx_xx
# Database: XXXXXX
# Usage: XXXXX
Bxx_xx .*
Bxx_xx .*
Bxx_xx .*
Bxx_xx .*
Bxx_xx .*
Bxx_xx .*
Bxx_xx .*
Bxx_xx .*
Bxx_xx .*
Bxx_xx .*


P_b.txt



P_b 342
# ver
# Query: P_b
# Database: XXXXXX
# Usage: XXXXX
P_b.*
P_b.*
P_b.*
P_b.*
P_b.*
P_b.*


1A_Z0.txt



1A_Z0 343
# ver
# Query: 1A_Z0
# Database: XXXXXX
# Usage: XXXXX
1A_Z0.*
1A_Z0.*
1A_Z0.*
1A_Z0.*


Z867.txt



Z867 200
# ver
# Query: Z867
# Database: XXXXXX
# Usage: XXXXX


BWS.txt



BWS 200
# ver
# Query: BWS
# Database: XXXXXX
# Usage: XXXXX
BWS.*
BWS.*
BWS.*


In some cases, there may be no contents after four lines. Example,

# ver
# Query: Z867
# Database: XXXXXX
# Usage: XXXXX


Still I want them as new file,
Z867.txt


My code is as follows

while ( $line = <BOF> ) {

chomp $line;
$cpline = $line;

next if ( $cpline =~ /^Query/ );

if ( $cpline =~ /^#\sQuery\:\s(\w.*)/ ) {

$query = $1;

foreach $sizeLine (@sizeList) {

$sizeLine =~ /^(\w.*)\t(\d+)$/;
$seqId = $1;
$seqLen = $2;

if ( $seqId eq $query ) {
print "Query\t$seqLen\n";
}
}
}

$cpline = "";

if ( $line =~ /^#/ ) {
print "$line\n";
}

if ( $line !~ /^#/ ) {

if ( $line =~ /^((.+)\_.+)\t((.+)\_.+)\t(.+)\t(.+)\t.+\t.+\t.+\t.+\t.+\t.+\t.+\t\s?.+$/ ) {

$queryId = $1;

if ( $seqId eq $queryId ) {
print "$line\n";
}
}
}
}

Answer

I am confused about what you are asking, as your Perl code seems to do something very different from what your question describes. However, here's a simple solution that opens a new file for every # Query: line in the comment and generates the output that you say you want

This program expects the path to the input file as a parameter on the command line

use strict;
use warnings 'all';
use autodie;

my $out_fh;
my @header;

while ( <> ) {

    if ( /^#/ ) {

        push @header, $_;

        if ( /Query:\s*(\S+)/ ) {
            my $file = "$1.txt";
            print qq{Creating "$file"\n};
            open $out_fh, '>', $file;
        }

        if ( @header == 4 ) {
            print $out_fh @header;
            @header = ();
        }
    }
    else {
        print $out_fh $_;
    }
}

close $out_fh;

output

Creating "A_1.txt"
Creating "Bxx_xx.txt"
Creating "P_b.txt"
Creating "1A_Z0.txt"
Creating "Z867.txt"
Creating "BWS.txt"



Update

Here's a new version of my code that complies with your revised specification. (Please don't do that.)

use strict;
use warnings 'all';
use autodie;

@ARGV = qw/ 4l.txt size_list.txt /;

my ( $input, $size_list ) = @ARGV;

my %sizes;
{
    open my $fh, '<', $size_list;
    while ( <$fh> ) {
        my ($file, $size) = split;
        $sizes{$file} = $size if defined $size;
    }
}


my $out_fh;
my @header;

while ( <> ) {

    if ( /^#/ ) {

        push @header, $_;

        if ( /Query:\s*(\S+)/ ) {

            my $id = $1;
            my $size = $sizes{$id};
            die qq{No size found for ID "$id"} unless defined $size;
            my $file = "$id.txt";

            print qq{Creating "$file"\n};

            open $out_fh, '>', $file;
            print $out_fh "$id\t$size\n";
        }

        if ( @header == 4 ) {
            print $out_fh @header;
            @header = ();
        }
    }
    else {
        print $out_fh $_;
    }
}

close $out_fh if $out_fh;