I would like to remove duplicates rows in col 0 of an array in such a way that only the max values in col 1 remain. Data is tab delimited. There are 16 columns.
sample1_EGFR_19 53 exon19 ...
sample1_EGFR_19 12 exon20 ...
sample2_EGFR_19 20 exon19 ...
sample3_EGFR_20 65 exon20 ...
sample2_EGFR_19 25 exon12 ...
sample1_EGFR_20 12 exon20 ...
sample3_EGFR_20 125 exon20 ...
sample1_EGFR_19 53 exon19 ...
sample1_EGFR_20 12 exon20 ...
sample2_EGFR_19 25 exon12 ...
sample3_EGFR_20 125 exon20 ...
sample1_EGFR_19 53 exon19 ...
12 exon20 ...
sample2_EGFR_19 25 exon12 ...
20 exon19 ...
sample3 EGFR_20 125 exon20 ...
65 exon20 ...
sample1 EGFR_20 12 exon20 ...
#!/usr/bin/perl
use strict;
use warnings;
use List::MoreUtils qw(uniq);
use List::Util 'first';
use Data::Dumper;
my $filename = "/data/Test/output.txt";
my $output_filename = "/data/Test/output_changed.txt";
my @resultarray;
my @sorted;
open( TXT2, "$filename" );
while ( <TXT2> ) {
push( @resultarray, $_ );
}
close( TXT2 );
foreach ( @resultarray ) {
chop( $_ );
}
foreach ( @resultarray ) {
print( $_);
chomp( $_ );
my ( $key, $val ) = split /\t/, $_, 2;
push @{ $result_hash{$key} }, $val;
}
foreach ( sort { $result_hash{$a} <=> $result_hash{$b} } keys %result_hash ) {
push( @final_array, $_ . "\t" . join "\t", @{ $result_hash{$_} } );
}
undef %{result_hash};
foreach ( @final_array ) {
chomp( $_ );
print( $_);
}
for ( 0 .. @final_array - 1 ) {
my $myuniquearray = $final_array[$_];
open( MYFILE, ">>$output_filename" ); ##opens files with header and adds the rest of the lines.
print MYFILE $myuniquearray . "\n";
close( MYFILE );
}
This is a fairly straightforward UNIX one-liner. Why the requirement to write it in Perl?
$ sort -k1,1 -k2,2rn /data/Test/output.txt | awk '!seen[$1]++' | tee /data/Test/output_changed.txt
sample1_EGFR_19 53 exon19 ...
sample1_EGFR_20 12 exon20 ...
sample2_EGFR_19 25 exon12 ...
sample3_EGFR_20 125 exon20 ...
This sorts it by the first column ascending and by the second column descending and numeric, then uses awk
to select the first line from each group. If that awk
statement is too confusing, it has the same function as awk 'x != $1 { print; x = $1 }'
. (tee
writes the lines to the file and displays the output to the terminal.)
If you really must use Perl, here's a simple solution to the described problem:
#!/usr/bin/perl
use strict;
use warnings;
sub sort_func {
# sort by the first col asc and then by the second col desc and numeric
$a->[0] cmp $b->[0] || $b->[1] <=> $a->[1]
}
my %seen;
print
map join("\t", @$_), # re-join the fields with tabs into the original line
grep !$seen{$_->[0]}++, # select the first line of each sorted group
sort sort_func # sort lines using the above sort function
map [split /\t/, $_, 3], # split by tabs so we can sort by the first two fields
<>; # read lines from stdin or the filename given by ARGV[0]
Mark the file executable and use it like so:
./sortlines.pl /data/Test/output.txt >/data/Test/output_changed.txt