Marta Wisniewska Marta Wisniewska - 5 months ago 13
Perl Question

split of column from diffrent files

I'd like to do my own script but I sticked fast. It is so trivial, but this is my second script.

I have a couple of files. In each, I have a 2 column, first with constant value (it is in the whole number of files), the third is diffrent. Each file is named as: occ_CXX, where XX is the number. (I don't need the second column) My idea:


  1. split the whole number of files, whereas, the first column is common for all files, the second , third etc belongs to the files in order. (i use for it $my_files=$ARGV[0] - from this it takes only first column, and $my_pdb_files=$ARGV[1] - file, whithin the all files are listed), but I have to construct proper loop to read files and put the number in good order.

  2. Moreover, I'd like to give a name of each column, which corespond to the number in its name (occ_CXX - I'm interested the CXX only).



The occ_CXX.dat file look like following:

50000.000 1 291618
50100.000 1 291618
50200.000 0
50300.000 1 115401
50400.000 1 115401
50500.000 1 115401
50600.000 1 115401
50700.000 1 115401
50800.000 1 291618
50900.000 1 291618
51000.000 1 291618
51100.000 1 291618
51200.000 1 291618
51300.000 1 291618
51400.000 1 291618
51500.000 1 291618
51600.000 1 291618
51700.000 1 291618
51800.000 1 291618
51900.000 1 291618
52000.000 1 291618
52100.000 1 291618
52200.000 1 291618
52300.000 1 291618
52400.000 1 291618
52500.000 1 291618
52600.000 1 291618
52700.000 1 291618
52800.000 1 291618
52900.000 1 291618
53000.000 0
53100.000 1 291618
53200.000 1 291618


another, occ_C03.dat:

50000.000 1 58902
50100.000 1 58902
50200.000 1 58902
50300.000 1 58902
50400.000 1 58902
50500.000 1 58902
50600.000 1 58902
50700.000 1 58902
50800.000 0
50900.000 1 58902
51000.000 1 58902
51100.000 1 58902
51200.000 1 58902
51300.000 1 58902
51400.000 1 58902
51500.000 1 58902
51600.000 1 58902
51700.000 1 58902
51800.000 0
51900.000 1 58902
52000.000 1 58902
52100.000 1 58902
52200.000 1 58902
52300.000 1 58902
52400.000 1 58902
52500.000 1 58902
52600.000 1 58902
52700.000 0
52800.000 1 58902
52900.000 1 58902
53000.000 1 58902
53100.000 1 58902
53200.000 1 58902


My script:

#!/usr/bin/perl -w
use strict;

my $new_file="occ_sub.dat";
my $first=$ARGV[0];
my $data=$ARGV[1];

open(FILE,&first) or die;
open(LST,$data) or die;
open(DAT,">>$new_file") or die;

while(<FILE>) {
my $line=$_;
my $col1=substr $line,1,13;
print DAT " \n"; #empty line
printf(DAT " %9.3f",$col1);
while(<LST>) { #or I should use foreach?
my $line1-$_;
my $col3=substr $line,17,6; #third column only
#I stopped here, maybe I should create a table of my columns?
}
close LST;
}
close FILE;
close DAT;


the file: ARGV[0] is a occ_C02.dat (above pasted)
and ARGV[1] is a list.dat

list.dat:

occ_C02.dat
occ_C03.dat
.
.
.
occ_C10.dat


expected output file:

C02 C03
50000.000 52779 58902
50100.000 58902
50200.000 52779 58902
50300.000 58902
50400.000 58902
50500.000 52779 58902
50600.000 58902
50700.000 58902


The output don't correspond to above numbers. It's only example.

Answer

By the description of your problem and some of the comments you gave, I think the code below may produce what you want.

I still don't understand how you access the files. I don't think the list.dat will make it easier. Just my opinion. I used glob instead to get the occ_CXX.dat files.

And I entered the directory on the command line. In the program, the directory name is shifted to the $dir variable.

My command line was

perl test.pl .

where the '.' is the name of the directory. I used the dot because the files are located in the same directory as my program. You would have to specify the directory or path to your files here instead of dot if your program is running from a different directory from where the files are located.

Update: Overlooked the matter of where you want to print your output file. That is, to what directory. Here, I just print the output file to the same directory as the input files.

Update2: Change

for my $col1 (sort keys %data)

to

for my $col1 (sort {$a <=> $b} keys %data)

#!/usr/bin/perl
use strict;
use warnings;

my $dir = shift; # get directory from @ARGV (on the command line)

my @occ_files = sort by_number glob "$dir/occ_C*.dat";

my @headers = map /_(C\d+)/, @occ_files;

my %data;

for my $file (@occ_files) {
    open my $fh, '<', $file or die "Can't open $file $!";

    $file =~ /_(C\d+)/;
    my $col_head = $1;

    while (<$fh>) {
        my ($col1, undef, $col3) = split;
        $data{$col1}{$col_head} = $col3 || '';
    }
    close $fh or die "Can't close $file $!";
}

my $format = "%-15s" . "%-10s" x (@headers-1) . "%s\n";

my $new_file = "$dir/occ_sub.dat";
open my $out, '>', $new_file or die "Can't open $new_file $!";

printf $out $format, '', @headers;

for my $col1 (sort keys %data) {
    printf $out $format, $col1, @{ $data{$col1} }{@headers};    
}

sub by_number {
    my ($a_num) = $a =~ /_C(\d+)\.dat/;
    my ($b_num) = $b =~ /_C(\d+)\.dat/;
    $a_num <=> $b_num;
}

This created a file with this output (using your 2 sample input files):

               C01       C02
50000.000      291618    58902
50100.000      291618    58902
50200.000                58902
50300.000      115401    58902
50400.000      115401    58902
50500.000      115401    58902
50600.000      115401    58902
50700.000      115401    58902
50800.000      291618
50900.000      291618    58902
51000.000      291618    58902
51100.000      291618    58902
51200.000      291618    58902
51300.000      291618    58902
51400.000      291618    58902
51500.000      291618    58902
51600.000      291618    58902
51700.000      291618    58902
51800.000      291618
51900.000      291618    58902
52000.000      291618    58902
52100.000      291618    58902
52200.000      291618    58902
52300.000      291618    58902
52400.000      291618    58902
52500.000      291618    58902
52600.000      291618    58902
52700.000      291618
52800.000      291618    58902
52900.000      291618    58902
53000.000                58902
53100.000      291618    58902
53200.000      291618    58902