perlbeginner perlbeginner - 4 months ago 12
Perl Question

perl remove identical id and similar id, maintain the array order and print them like matrix

my first file have some keys. Example,

Eur1
Eur2
Eur3
Eur4
Eur5


the other file is the actual input. separated by ":" (headers are on the left side. right side is "|" delimited matching ids) it will look like,

Eur2_1116: Eur1_9816|Eur1_916|Eur3_16|Eur4_17|Eur5_716
Eur3_2980: Eur1_8568|Eur2_98|Eur2_92|Eur4_12.3|Eur4_12.1|Eur4_12.9
Eur4_786: Eur2_0909|Eur5_1090|Eur5_2535|Eur1_233
Eur1_10616:
Eur5_5634: Eur3_1868|Eur4_8634|Eur1_35923
Eur1_34312:
Eur3_0000: Eur5_8235
Eur3_3433:
Eur5_3354: Eur2_245|Eur2_245|Eur1_34
Eur3_1122: Eur4_4431


in my definition,

identical ids are identical. example (Eur2_245, Eur2_245 at line 9)

similar ids are derived from the same key with different value. example (Eur1_9816 and Eur1_916 at line1) or (Eur4_12.3,Eur4_12.1 and Eur4_12.9 at line 2).

I want to remove the identical ids. Also, the similar ids after first match. Example, Eur1_9816 should go the output line 2. or Eur4_12.3 for line 2. Because, they are the first similar id among other similar ids.

my desired output should be like matrix, having the headers in the first column exactly like the input headers. Followed by the ordered keys in the consecutive columns. Of note, headers are also belongs to one of the ordered keys. So header itself become a matching id at a respective column.

Example, line 1, Eur2_1116 is the header.
It will be in the line 1 as header and ALSO it will be part of "Eur2" ordered key column.

And similarly all other respective matches (after excluding identical, similar matches) should go to the respective columns.

Desired output #1

Column1 Eur1 Eur2 Eur3 Eur4 Eur5

Eur2_1116 Eur1_9816 Eur2_1116 Eur3_16 Eur4_17 Eur5_716
Eur3_2980 Eur1_8568 Eur2_98 Eur3_2980 Eur4_12.3
Eur4_786 Eur1_233 Eur2_0909 Eur4_786 Eur5_1090
Eur1_10616 Eur1_10616
Eur5_5634 Eur1_35923 Eur3_1868 Eur4_8634 Eur5_5634
Eur1_34312 Eur1_34312
Eur3_0000 Eur3_0000 Eur5_8235
Eur3_3433 Eur3_3433
Eur5_3354 Eur1_34 Eur2_245 Eur5_3354
Eur3_1122 Eur3_1122 Eur4_4431


my actual code is, (after =for commented lines, these code block is wrong). could you please correct it. If possible, to track back the identical and similar matches, can i write it in a separate file?

use strict;
use warnings 'all';

my $ipArr = $ARGV[0]; chomp $ipArr; #setArr.txt
open (IN,"$ipArr") or die "Unalbe to open the file:$!\n";

my $orderArr = $ARGV[1]; chomp $orderArr; #orderArr.txt
open(OA,"$orderArr") or die "Unable to open the order array:$!\n";
my @orderKeys = <OA>; chomp @orderKeys; close OA;

while (my $line = <IN>)
{
chomp ($line);
my ($header,$matchSet) = "";
my (@tmpSet, @setArr1, @setArr2) = ();
($header,$matchSet) = split("\:",$line);
$matchSet =~ s/\s+//g;
@tmpSet = split (/\|/, $matchSet);
push(@setArr1,$header,@tmpSet);
sub uniq {
my % seen;
grep !$seen{$_}++, @_;
}
@setArr2 = uniq(@setArr1);
print join "\t", @setArr2,"\n";
}

=for
foreach my $g (@setArr2)
{
my ($k,$val) = "";
$g =~ /^(\w.*)\_(\w.*)$/;
$k = $1; $val = $2;
$hash{$k} = {$val};
}
}
foreach $S (@orderKeys)
{
if($setArr[$i] =~ /^$S\_.*$/)
{
print "$S\t";
}
}
}

print "\n";
}
close IN;

Answer

Code:

use strict;
use warnings;
open FH,"tmp1" or die "Error\n";#orderArr.txt
my %myhash;
my $count++;
my @firstarray = ("Eur1","Eur2", "Eur3", "Eur4", "Eur5");
while (<FH>){
        chomp($_);
        my @val = split(/:/,$_);
        $val[1] =~ s/\s//;
        if($val[1] ne "")
        {
                my @arr = split(/\|/,$val[1]);
                foreach my $id (@arr)
                {
                        my @val1 = split(/_/,$id);
                        push(@{$myhash{$count}{$val[0]}{$val1[0]}},$id);
                }
        }
        my @val1 = split(/_/,$val[0]);
        push(@{$myhash{$count}{$val[0]}{$val1[0]}},$val[0]);

        $count++;
        #print "$val[0]\n";

}
foreach my $count (sort { $a <=>  $b} keys %myhash)
{
        foreach my $id (keys %{$myhash{$count}})
        {
                print "$id";
                foreach my $idlist (@firstarray)
                #foreach my $idlist(keys %{$myhash{$count}{$id}})
                {
                        #print " t $idlist t ";
                        if(exists $myhash{$count}{$id}{$idlist})
                        {
                                my $value =$myhash{$count}{$id}{$idlist}[0];
                                print "\t$value";
                        }
                        else
                        {
                                print "\t";
                        }
                }
                print "\n";

        }
}

Output :

Eur2_1116       Eur1_9816       Eur2_1116       Eur3_16 Eur4_17 Eur5_716
Eur3_2980       Eur1_8568       Eur2_98 Eur3_2980       Eur4_12.3
Eur4_786        Eur1_233        Eur2_0909               Eur4_786        Eur5_1090
Eur1_10616      Eur1_10616
Eur5_5634       Eur1_35923              Eur3_1868       Eur4_8634       Eur5_5634
Eur1_34312      Eur1_34312
Eur3_0000                       Eur3_0000               Eur5_8235
Eur3_3433                       Eur3_3433
Eur5_3354       Eur1_34 Eur2_245                        Eur5_3354
Eur3_1122                       Eur3_1122       Eur4_4431