dr.nixon dr.nixon - 1 month ago 5
Perl Question

Building sorted data table using hash

I have built several scripts to organize data output from test equipment, but am hitting a mental roadblock with this one.

Test equipment monitors four types of input (Data1, Data2, Data3, Data4) from multiple subjects (with identifiers ID1, ID2, etc.), and records each in intervals with a date and time stamp. CSV file dumped by the equipment is organized like this:

Start,Date,Time0
Subject,ID1,ID2,[...],ID#

Date,Time1
Data1,aa1,aa2,[...],aa#
Data2,ba1,ba2,[...],ba#
Data3,ca1,ca2,[...],ca#
Data4,da1,da2,[...],da#

Date,Time2
Data1,ab1,ab2,[...],ab#
Data2,bb1,bb2,[...],bb#
Data3,cb1,cb2,[...],cb#
Data4,db1,db2,[...],db#


...and so on.

"Start" identifies this line as the beginning of the data; "Subject" identifies the line as the line containing subject IDs; "Data1"-"Data4" identify the line as those containing the data for that datatype in the specific time interval indicated by the preceding date and time.

Output data is thus split into multiple blocks, which is really an unfortunate choice on the part of the equipment manufacturer, especially as data is collected every few minutes over several days or weeks. To analyze the data without having to manually select every 6th line, we need to group all data types into blocks, like this:

Data1,Subject,ID1,ID2,[...],ID#
Date,Time1,aa1,aa2,[...],aa#
Date,Time2,ab1,ab2,[...],ab#
...

Data2,Subject,ID1,ID2,[...],ID#
Date,Time1,ba1,ba2,[...],ba#
Date,Time2,bb1,bb2,[...],bb#
...


The goal is to have each of the four data types in separate blocks, so that timecourse data for any given subject (ID1 through ID#) will be in a single column, with date and time as the initial columns. ("DataX" and "Subject" in the above are simply used as column headers.)

Currently I am doing this by putting each line into a separate array. This was a quick-and-dirty way of getting things done; the script grabs the time and date, and pushes the ID line into each of four arrays (one for each data type), then proceeds to add each data line sequentially based on the data type. Output just prints each array line by line, adds a blank line, then prints the next array. This works, but ideally I would like to sort the data columns across by subject ID, and then print out the data, without losing the vertical sort by date and timestamp. (Because the data is already vertically sorted I do not currently have a sort function on the arrays before printing.)

What's the simplest way to do this? Mentally I am having trouble trying to parse how to associate data in Row Y, Column X with the subject ID in Column X in the CSV file. Every other data output file I have used either keeps subject ID as the first item in each line or has one file per subject, which makes it easier.

Note: Because time/date are on their own line, I am using a variable for each; if the script detects a line containing a new time and/or date, it updates the variable value.

Edit --
I incorporated some of Borodin's suggestions (leaving FH handling by line rather than by paragraph). I have data from subject line pulled into an array (@ids), and am pushing data rows into a hash using date/time and ID as keys:

my ($datatype, @fields) = @line;
push @keys, $datatype unless exists $data{$datatype};
my $datetime = "$date\,$time";
push @timestamps, $datetime unless exists $data{$datetime};
for my $i ( 0 .. $#fields) {
push @{$data{$datetime}{$ids[$i]}}=>$fields[$i]
};


I am also dropping the date-time pairs into a second array to maintain order (@timestamps).
Problem at this point is that I am having issues printing the values back out. Currently trying:

foreach my $date (keys %data) {
print OUT $date;
foreach my $id (@ids) {
foreach my $s (keys %{$data{$date}}) {
if ( exists($data{$date}{$id}) ) {
print OUT ",", $data{$date}{$id}
}
else {
print OUT ",";
}
}
}
print OUT "\n"; # close printing on a given date
}


Keep getting garbage output (printing the hash reference, not the actual value!). Dumper output looks like this:

$VAR1 = {
'date,time' => [
'ID1' => [
'0.00'
]
'ID2' => [
'0.12',
]
'ID3' => [
'0.17',
]
'ID4' => [
'0.22',
]
]
}
};


and the printed output is like this:

date,time,ARRAY(0x7f91c1030f60),ARRAY(0x7f91c1030f60),ARRAY(0x7f91c1030f60),ARRAY(0x7f91c1030f60)


Sorry the examples so far have been causing issues in interpretation. There is a lot of excess data and text in the input files, I only included a highly simplified version of the portions I am trying to extract and sort.

Answer

This program reads all of the data into a hash and reproduces it in the transformed state you require. This is fine unless the data is enormous and will not easily fit into the memory you have available, in which case you will need a different solution.

The program expects the name of the input file as a command-line parameter, and defaults to data.csv if none was supplied. It sets $/ to the null string to enable Perl's "paragraph mode" of input, where data is read up to the next blank line or end of file. That means all data for a given time interval are read at once, and must be split further into individual rows before processing.

use strict;
use warnings;

my ($subject, @ids);
my @sort_order;
my (%data, @keys);

my ($file) = @ARGV;
$file //= 'data.csv';

open my $fh, '<', $file or die qq{Unable to open file "$file" for reading: $!};
local $/ = '';
while (<$fh>) {

  my @rows = split /\n/;

  unless ($subject) {
    ($subject, @ids) = split /,/, $rows[1];
    @sort_order = sort { $ids[$a] cmp $ids[$b] } 0 .. $#ids;
    next;
  }

  my ($date, $time) = split /,/, shift @rows;
  for (@rows) {
    my ($id, @fields) = split /,/;
    push @keys, $id unless exists $data{$id};
    push @{ $data{$id} }, [$date, $time, @fields[@sort_order]];
  }
}

for my $key (@keys) {
  print join(',', $key, $subject, @ids[@sort_order]), "\n";
  print join(',', @$_), "\n" for @{ $data{$key} };
  print "\n";
}

output

Data1,Subject,ID#,ID1,ID2,[...]
Date,Time1,aa#,aa1,aa2,[...]
Date,Time2,ab#,ab1,ab2,[...]

Data2,Subject,ID#,ID1,ID2,[...]
Date,Time1,ba#,ba1,ba2,[...]
Date,Time2,bb#,bb1,bb2,[...]

Data3,Subject,ID#,ID1,ID2,[...]
Date,Time1,ca#,ca1,ca2,[...]
Date,Time2,cb#,cb1,cb2,[...]

Data4,Subject,ID#,ID1,ID2,[...]
Date,Time1,da#,da1,da2,[...]
Date,Time2,db#,db1,db2,[...]