zara zara - 6 months ago 11
Bash Question

How to delete the first subset of each set of column in a data file?

I have a data file with more than 40000 column. In header each column's name begins with C1 , c2, ..., cn and each set of c has one or several subset for example c1. has 2 subsets. I need to delete first column(subset) of each set of c. for example if input looks like :

input:

c1.20022 c1.31012 c2.44444 c2.87634 c2.22233 c3.00444 c3.44444
1 1 0 1 0 0 0 1
2 0 1 0 0 1 0 1
3 0 1 0 0 1 1 0
4 1 0 1 0 0 1 0
5 1 0 1 0 0 1 0
6 1 0 1 0 0 1 0


I need the output be like:

c1.31012 c2.87634 c2.22233 c3.44444
1 0 0 0 1
2 1 0 1 1
3 1 0 1 0
4 0 0 0 0
5 0 0 0 0
6 0 0 0 0
7 1 0 0 0


Any suggestion please?

Answer

Perl solution: It first reads the header line, uses a regex to extract the column name before a dot, and keeps a list of column numbers to keep. It then uses the indices to print only the wanted columns from the header and remaining lines.

#!/usr/bin/perl
use warnings;
use strict;
use feature qw{ say };

my @header = split ' ', <>;
my $last = q();
my @keep;
for my $i (0 .. $#header) {
    my ($prefix) = $header[$i] =~ /(.*)\./;
    if ($prefix eq $last) {
        push @keep, $i + 1;
    }
    $last = $prefix;
}
unshift @header, q();
say join "\t", @header[@keep];

while (<>) {
    my @columns = split;
    say join "\t", @columns[@keep];
}