Sarah Sarah - 4 months ago 8
Perl Question

Merging two file based on columns and sorting

I have two files,

FILE1
and
FILE2
, that have a different number of
columns and some columns in common. In both files the first column is
a row identifier. I want to merge the two files (
FILE1
and
FILE2
)
without changing the order of the columns, and where there is a missing
value input the value '5'.

For example
FILE1
(first column is the row ID,
A1
is the first row,
A2

the second, ...):

A1 1 2 5 1
A2 0 2 1 1
A3 1 0 2 2


The column names for
FILE1
is (these are specified in another file),

Affy1
Affy3
Affy4
Affy5


which is to say that the value in row
A1
, column
Affy1
is
1

and the value in row
A3
, column
Affy5
is
2


v~~~~~ Affy3
A1 1 2 5 1
A2 0 2 1 1
A3 1 0 2 2
^~~~ Affy1


Similarly for
FILE2


B1 1 2 0
B2 0 1 1
B3 5 1 1


And its column names,

Affy1
Affy2
Affy3


Meaning that

v~~~~~ Affy2
B1 1 2 0
B2 0 1 1
B3 5 1 1
^~~~ Affy1


I want to merge and sort columns based on the column names and put a
'5' for missing values. so the merged result would be as follows:

A1 1 5 2 5 1
A2 0 5 2 1 1
A3 1 5 0 2 2
B1 1 2 0 5 5
B2 0 1 1 5 5
B3 5 1 1 5 5


And the columns:

Affy1
Affy2
Affy3
Affy4
Affy5


Which is to say,

v~~~~~~~ Affy2
A1 1 5 2 5 1
A2 0 5 2 1 1
A3 1 5 0 2 2
B1 1 2 0 5 5
B2 0 1 1 5 5
B3 5 1 1 5 5
^~~~ Affy1


In reality I have over 700K columns and over 2K rows in each file. Thanks in advance!

Answer

The difficult part is ordering the headers when some of them appear only in one file. The best way I know is to build a directed graph using the Graph module and sort the elements topologically

Once that's done it's simply a matter of assigning the values from each file to the correct columns and filling the blanks with 5s

I've incorporated the headers as the first line of each data file, so this program works with this data

file1.txt

ID Affy1 Affy3 Affy4 Affy5
A1 1 2 5 1 
A2 0 2 1 1 
A3 1 0 2 2

file2.txt

ID Affy1 Affy2 Affy3
B1 1 2 0
B2 0 1 1
B3 5 1 1

And here's the code

consolidate_columns.pl

use strict;
use warnings 'all';
use autodie;

use Graph::Directed;

open my $f1_fh, '<', 'file1.txt';
open my $f2_fh, '<', 'file2.txt';

my @h1 = split ' ', <$f1_fh>;
my @h2 = split ' ', <$f2_fh>;

my @headers = do {

    my $g = Graph::Directed->new;

    $g->add_edge($h1[$_], $h1[$_+1]) for 0 .. $#h1-1;
    $g->add_edge($h2[$_], $h2[$_+1]) for 0 .. $#h2-1;

    $g->topological_sort;
};

my %headers = map { $headers[$_] => $_ } 0 .. $#headers;

my @f1_indexes = map { $headers{$_} } @h1;
my @f2_indexes = map { $headers{$_} } @h2;

print "@headers\n";

for my $fh ( $f1_fh, $f2_fh ) {

    while ( <$fh>  ) {
        next unless /\S/;

        my @columns;
        @columns[@f1_indexes] = split;
        $_ //= 5 for @columns[0 .. $#headers];

        print "@columns\n";
    }
}

output

ID Affy1 Affy2 Affy3 Affy4 Affy5
A1 1 5 2 5 1
A2 0 5 2 1 1
A3 1 5 0 2 2
B1 1 5 2 0 5
B2 0 5 1 1 5
B3 5 5 1 1 5