Rob Rob - 8 months ago 27
Perl Question

Parsing file based on column ID: perl

I have a tab delineated file with repeated values in the first column. The single, but repeated values in the first column correspond to multiple values in the second column. It looks something like this:

AAAAAAAAAA1 m081216|101|123
AAAAAAAAAA1 m081216|100|1987
AAAAAAAAAA1 m081216|927|463729
BBBBBBBBBB2 m081216|254|260489
BBBBBBBBBB2 m081216|475|1234
BBBBBBBBBB2 m081216|987|240
CCCCCCCCCC3 m081216|433|1000
CCCCCCCCCC3 m081216|902|366
CCCCCCCCCC3 m081216|724|193

For every type of sequence in the first column, I am trying to print to a file with just the sequences that correspond to it. The name of the file should include the repeated sequence in the first column and the number of sequences that correspond to it in the second column. In the above example I would therefore have 3 files of 3 sequences each. The first file would be named something like "AAAAAAAAAA1.3.txt" and look like the following when opened:


I have seen other similar questions, but they have been answered with using a hash. I don't think I can't use a hash because I need to keep the number of relationships between columns. Maybe there is a way to use a hash of hashes? I am not sure.
Here is my code so far.

use warnings;
use strict;
use List::MoreUtils 'true';

open(IN, "<", "/path/to/in_file") or die $!;

my @array;
my $queryID;

my $OutputLine = $_;

sub processOutputLine {
my ($OutputLine) = @_;
my @Columns = split("\t", $OutputLine);
my ($queryID, $target) = @Columns;
push(@array, $target, "\n") unless grep{$queryID eq $_} @array;
my $delineator = "\n";
my $count = true { /$delineator/g } @array;
open(OUT, ">", "/path/to/out_$..$queryID.$count.txt") or die $!;
print OUT @array;


I would still recommend a hash. However, you store all sequences related to the same id in an anonymous array which is the value for that ID key. It's really two lines of code.

use warnings;
use strict;
use feature qw(say);

my $filename = 'rep_seqs.txt';   # input file name
open my $in_fh, '<', $filename or die "Can't open $filename: $!";

my %seqs;
foreach my $line (<$in_fh>) {
    chomp $line;
    my ($id, $seq) = split /\t/, $line;
    push @{$seqs{$id}}, $seq;
close $in_fh;

my $out_fh;
for (sort keys %seqs) {
    my $outfile = $_ . '_' . scalar @{$seqs{$_}} . '.txt';
    open $out_fh, '>', $outfile  or do {
        warn "Can't open $outfile: $!";
    say $out_fh $_ for @{$seqs{$_}};
close $out_fh;

With your input I get the desired files, named AA..._count.txt, with their corresponding three lines each. If items separated by | would better be split you can do that while writing it out, for example.


  • The anonymous array for a key $seqs{$id} is created once we push, if not there already

  • If there are issues with tabs (converted to spaces?), use /\s+/

  • A filehandle is closed and re-opened on every open, so no need to close every time