user3781528 user3781528 - 1 month ago 5
Perl Question

How to combine data in an array with duplicate identifier in the first column without removing duplicates in other columns

I populated an array with tab delimited data. I would like to merge rows with duplicate ids in the first column without removing duplicates in other columns:

Here is what the lines of @fusions array look like before I run the code:

enter image description here

Desired output:

enter image description here

I’ve tried using a hash but it removes duplicates in all the columns and I need to remove duplicates in the first column only.
Here is the code I’ve adopted that uses a hash.

foreach (@fusions){
chomp($_);
my ($key, @items) = split /\t/;
$fusion_hash{$key}{$_}++ for @items;
}

#print join("\t", $_, sort keys %{$fusion_hash{$_}}), "\n" for sort keys %fusion_hash;


Please suggest how to change the code so it merges the data and doesn't remove duplicates in other columns. Thanks

Answer

You have to save the data in an array per "key"

use strict;
my %fusion_hash;
my @fusions= split("\n", <<EOT);
SLC34A2-ROS1.S4R32.COSF1197 chr4 25665952 PASS 56812 SLC34A2 4 COSF1197 
SLC34A2-ROS1.S4R32.COSF1197 chr6 117650609 PASS 56812 ROS1 32 COSF1197 
SLC34A2-ROS1.S4R34.COSF1198 chr4 25665952 PASS 3367 SLC34A2 4 COSF1198 
SLC34A2-ROS1.S4R34.COSF1198 chr6 117645578 PASS 3367 ROS1 34 COSF1198 
EOT

foreach (@fusions){
    chomp($_);
     my ($key, @items) = split /\s/;
     $fusion_hash{$key} = [] unless defined $fusion_hash{$key} ;
     push @{$fusion_hash{$key}}, @items;
}

#print join("\t", $_, @{$fusion_hash{$_}}), "\n" for sort keys %fusion_hash;

should do the job.

HTH Georg

BTW: you can ommit the line

     $fusion_hash{$key} = [] unless defined $fusion_hash{$key} ;

as this is done by perl automatically.