user3781528 user3781528 - 9 months ago 99
Perl Question

sort array and remove duplicates in specific columns Perl



I would like to remove duplicates rows in col 0 of an array in such a way that only the max values in col 1 remain. Data is tab delimited. There are 16 columns.

sample1_EGFR_19 53 exon19 ...
sample1_EGFR_19 12 exon20 ...
sample2_EGFR_19 20 exon19 ...
sample3_EGFR_20 65 exon20 ...
sample2_EGFR_19 25 exon12 ...
sample1_EGFR_20 12 exon20 ...
sample3_EGFR_20 125 exon20 ...


Desired output:

sample1_EGFR_19 53 exon19 ...
sample1_EGFR_20 12 exon20 ...
sample2_EGFR_19 25 exon12 ...
sample3_EGFR_20 125 exon20 ...


I've started with tab delimited text files that I split and populated an array. Then i use a hash and sort by keys. The final output I get the data correctly sorted, however, the duplicates are not removed. How do I remove lines that are now blank in the first column? Thanks

sample1_EGFR_19 53 exon19 ...
12 exon20 ...
sample2_EGFR_19 25 exon12 ...
20 exon19 ...
sample3 EGFR_20 125 exon20 ...
65 exon20 ...
sample1 EGFR_20 12 exon20 ...


Please suggest a straight forward method to accomplish his. Thanks

Here is the code:

#!/usr/bin/perl

use strict;
use warnings;

use List::MoreUtils qw(uniq);
use List::Util 'first';
use Data::Dumper;

my $filename = "/data/Test/output.txt";
my $output_filename = "/data/Test/output_changed.txt";

my @resultarray;
my @sorted;

open( TXT2, "$filename" );
while ( <TXT2> ) {
push( @resultarray, $_ );
}
close( TXT2 );

foreach ( @resultarray ) {
chop( $_ );
}

foreach ( @resultarray ) {
print( $_);
chomp( $_ );
my ( $key, $val ) = split /\t/, $_, 2;
push @{ $result_hash{$key} }, $val;
}

foreach ( sort { $result_hash{$a} <=> $result_hash{$b} } keys %result_hash ) {
push( @final_array, $_ . "\t" . join "\t", @{ $result_hash{$_} } );
}

undef %{result_hash};

foreach ( @final_array ) {
chomp( $_ );
print( $_);
}

for ( 0 .. @final_array - 1 ) {

my $myuniquearray = $final_array[$_];
open( MYFILE, ">>$output_filename" ); ##opens files with header and adds the rest of the lines.

print MYFILE $myuniquearray . "\n";
close( MYFILE );
}

mwp mwp
Answer Source

This is a fairly straightforward UNIX one-liner. Why the requirement to write it in Perl?

$ sort -k1,1 -k2,2rn /data/Test/output.txt | awk '!seen[$1]++' | tee /data/Test/output_changed.txt
sample1_EGFR_19 53  exon19  ...
sample1_EGFR_20 12  exon20  ...
sample2_EGFR_19 25  exon12  ...
sample3_EGFR_20 125 exon20  ...

This sorts it by the first column ascending and by the second column descending and numeric, then uses awk to select the first line from each group. If that awk statement is too confusing, it has the same function as awk 'x != $1 { print; x = $1 }'. (tee writes the lines to the file and displays the output to the terminal.)

If you really must use Perl, here's a simple solution to the described problem:

#!/usr/bin/perl

use strict;
use warnings;

sub sort_func {
  # sort by the first col asc and then by the second col desc and numeric
  $a->[0] cmp $b->[0] || $b->[1] <=> $a->[1]
}

my %seen;
print
  map join("\t", @$_),     # re-join the fields with tabs into the original line
  grep !$seen{$_->[0]}++,  # select the first line of each sorted group
  sort sort_func           # sort lines using the above sort function
  map [split /\t/, $_, 3], # split by tabs so we can sort by the first two fields
  <>;                      # read lines from stdin or the filename given by ARGV[0]

Mark the file executable and use it like so:

./sortlines.pl /data/Test/output.txt >/data/Test/output_changed.txt
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download