user2697374 user2697374 - 7 months ago 11
Perl Question

Calculate mean value of column if multiple occurrences in another column

This is part of a tab separated file that I generated:

Sample Gene RPKM
MK_001_27 HPSE2 17.3767266340978
MK_003_11 HPSE2 51.1152373602497
MK_002_5 HPSE2 16.5372913024845
MK_001_23 HPSE2 25.8985857481585
MK_001_23 HPSE2 21.6045197173757
MK_001_27 HPSE2 139.450963428357
MK_001_23 HPSE2 36.7603351866179
MK_003_9 HPSE2 25.2860867858956
MK_001_22 HPSE2 100.915250867745
MK_003_9 HPSE2 35.078964327254
MK_003_12 HPSE2 34.078813048573
MK_003_9 HPSE2 13.5865540939141


Any gene is present in at least one sample. It may be present multiple times in the same sample.

I want to generate the mean of the Reads Per Kilobase per Million (RPKM) value and use it to replace multiple value for any given gene from the same sample

For example

MK_003_9 HPSE2 35.078964327254
MK_003_9 HPSE2 13.5865540939141
MK_003_9 HPSE2 25.2860867858956


becomes

MK_003_9 HPSE2 24.650535069


s there a way that you know of that could resolve this in Unix or in Perl?

Pseudocode

-For each line, combine column 0 (sample) and column 1 (gene) to create a "key"
though not always unique
-Run through the file to see if this "key" is present somewhere else
-Count the number of times this "key" is present
-If present > 1 time, calculate the mean of the RPKM values by sum()/count
-Create occurrence with this "key" and the new RPKM value
-Delete(?) the other corresponding "keys"

Answer

This is quite straightforward if you accumulate the contents of your data file into a Perl hash. You don't make it clear whether there may be multiple genes in a single file, so I've coded this for multiple samples and multiple genes

The program expects the input file as a parameter on the command line, and prints its output to STDOUT which may be redirected as normal

use strict;
use warnings qw/ all FATAL /;

use List::Util 'sum';

print scalar <>;  # Copy the header line

my %mean_rpkm;

while ( <> ) {
    my ($sample, $gene, $rpkm) = split;
    push @{ $mean_rpkm{$gene}{$sample} }, $rpkm;
}

for my $gene ( sort keys %mean_rpkm ) {

    for my $sample ( sort keys %{ $mean_rpkm{$gene} } ) {
        my $rpkm = $mean_rpkm{$gene}{$sample};
        my $mean = sum(@$rpkm) / @$rpkm;
        printf "%s\t%s\t%.3f\n", $sample, $gene, $mean;
    }
}

output

Sample      Gene    RPKM
MK_001_22   HPSE2   100.915
MK_001_23   HPSE2   28.088
MK_001_27   HPSE2   78.414
MK_002_5    HPSE2   16.537
MK_003_11   HPSE2   51.115
MK_003_12   HPSE2   34.079
MK_003_9    HPSE2   24.651



Update

The output from my solution is essentially unordered. I have sorted the genes and samples lexically, but you may want them in the same order as they appear in the input file. If so then you should say so

A useful intermediate solution is to install Sort::Naturally (it's not a core module) and add

use Sort::Naturally 'nsort';

to the top of the above program. If you then replace both occurrences of sort with nsort then you will get this output. It may not be idea, but it's an improvement because it sorts MK_003_9 before MK_003_11 which a simple lexical sort doesn't do

Sample      Gene    RPKM
MK_001_22   HPSE2   100.915
MK_001_23   HPSE2   28.088
MK_001_27   HPSE2   78.414
MK_002_5    HPSE2   16.537
MK_003_9    HPSE2   24.651
MK_003_11   HPSE2   51.115
MK_003_12   HPSE2   34.079