user1987607 user1987607 - 3 months ago 10
Perl Question

perl convert lines to columns when values match

I have a tab-delimited text file with the following structure

col1 col2 col3 col4 col5 col6
T1 a b c d x
T5 a b c d x
T3 a b c e y
T50 e f g h y
T2 e f g h y
T60 a b c d y


*If the values in col2, col3, col4 & col5 are the same for two or more lines, I want the value of col6 for all the matching lines to be put in an extra column that has the value of col1 in the header.
So in this case lines 1 and 2 should be considered as the same, because there values for col2,3,4,5 are the same (abcd). The same holds for lines 4 and 5 (efgh)

*If the values in col2, col3, col4 & col5 are unique, then it's just the value of the col6 for that one line that should be put in a new column.
So in this case line 3 is unique, because none of the other lines has abce as values for col2,3,4,5.

So the output should look like this.

col1 col2 col3 col4 col5 col6 T1 T2 T3 T5 T50 T60
T1 a b c d x x - - x - y
T5 a b c d x x - - x - y
T3 a b c e y - - y - - -
T50 e f g h y - y - - y -
T2 e f g h y - y - - y -
T60 a b c d y x - - x - y


I want to do this in perl. But I have no idea how to do this. Should I work with a hash to store the values of col2-col3-col4-col5? Below is just the start of the script to open input and output files

#! /usr/bin/perl
use strict;
use warnings;

open(my $table1,'<', "input.txt") or die "$! - [$input]"; #input file
open(my $table2, '+>', "output.txt") || die ("Can't write new file: $!"); #output file

Answer

This solution does two passes of the file. It has some special treatment for __DATA__ that you can remove if you work with files.

use strict;
use warnings 'all';
use feature 'say';
use List::MoreUtils 'uniq';
use Fcntl 'SEEK_SET';

# grab the header and split up the headings
chomp( my $header = <DATA> );
my @fields = split /\s+/, $header;

# we need this because __DATA__ is not a regular filehandle
my $data_start = tell DATA;

# this name is bad, because we don't know what this the column really contains
my %col6;

# first pass
while ( my $row = <DATA> ) {
    chomp $row;
    my @cols = split /\s+/, $row;

    # in a hash with col2 to col5 as key, use col1 as key and col6 as value
    $col6{ join q{::}, @cols[ 1 .. 4 ] }->{ $cols[0] } = $cols[5];
}

# reset DATA to beginning, this is not needed if you work with a file
seek DATA, $data_start, 0;

# get the additional headings from the col1 mapping
my @new_fields = sort +uniq map { keys %{ $col6{$_} } } keys %col6;

# output them
say join "\t", @fields, @new_fields;

# second pass
while ( my $row = <DATA> ) {
    chomp $row;
    my @cols = split /\s+/, $row;

    # go through all the new headings and either print the value of the col6, or a dash
    say join "\t", @cols, map { $col6{ join q{::}, @cols[ 1 .. 4 ] }->{$_} || q{-} } @new_fields;
}

__DATA__
col1    col2    col3    col4    col5    col6
T1      a       b       c       d       x
T5      a       b       c       d       x
T3      a       b       c       e       y
T50     e       f       g       h       y
T2      e       f       g       h       y

Output:

col1    col2    col3    col4    col5    col6    T1  T2  T3  T5  T50
T1  a   b   c   d   x   x   -   -   x   -
T5  a   b   c   d   x   x   -   -   x   -
T3  a   b   c   e   y   -   -   y   -   -
T50 e   f   g   h   y   -   y   -   -   y
T2  e   f   g   h   y   -   y   -   -   y
Comments