Kimbluey Kimbluey - 6 months ago 14
Perl Question

Perl: Populate 2D Array of Unknown Length with Multiline Data

Background



I have a Perl program that is going through directories and parsing text files for certain information. One such piece of information is the Analysis Block, which looks like this:

*ANALYSIS_START* [analysis ID]
Line(s) = [multi- or single-line Line(s) data]
Reason Code = [single-line Reason Code data]
CR = [single-line CR data]
Note = [multi-line Note data]
[multi-line Note data]
*ANALYSIS_END*


A text file can have zero Analysis Blocks, or it may have any number of these Analysis Blocks - the number of blocks and the size of each are unknown. What I'm hoping to do is to gather the information within these blocks in a 2D array. For example, if a text file had exactly 2 Analysis Blocks, the 2D array would look something like this:

$VAR1 = [
[
Lines = [multi- or single-line data]
Reason Code = [single-line data]
CR = [single-line data]
Note = [multi-line data]
[multi-line data]
]
[
Lines = [multi- or single-line data]
Reason Code = [single-line data]
CR = [single-line data]
Note = [multi-line data]
[multi-line data]
]
];


If someone has a better suggestion for gathering data while keeping each Analysis Block together as shown above, let me know. There may be a better solution than a 2D array that I'm unaware of.

Attempt



I'm fairly new to Perl, but I understand how to create a 2D array by looking at this SO question. The problem is that I'm not sure how to populate a 2D array with my specific case. So far, I have the following code:

while (my $current_line = <$textfile>) {

# Code that gets other, single-line information from file

$pattern = '\*ANALYSIS_START\*';
if ($current_line =~ $pattern) { # Find Analysis Block
push @analysis_IDs, $1; # Get the analysis ID
while(<$textfile>) {
last if /\*ANALYSIS_END\*/; # Stop at block's end
push @analysis_info, $_; # Append each line of data
}
}
}


Of course, this causes my array to look something like this, where each line of the file is separate but the Analysis Blocks are not:

$VAR1 = ''
$VAR2 = 'Lines = [lines data]'
$VAR3 = 'Reason Code = [reason code data]'
$VAR4 = 'CR = [cr data]'
$VAR5 = 'Note = [note data]'
$VAR6 = ' [note data...]'
$VAR7 = ''
$VAR8 = 'Lines = [lines data]'
$VAR9 = 'Reason Code = [reason code data]'
$VAR10= 'CR = [cr data]'
$VAR11= 'Note = [note data]'
$VAR12= ' [note data...]'


Question



I'm having trouble wrapping my head around how to iterate through each section of the file in order to create the desired 2D array. I've probably just been staring at it too long.

How can I create the array I need? All explanations, word-only or those with code examples, are very much appreciated.




Can my question be improved? Please let me know in the comments!

Answer

Here is a way to get what the question asks for, in particular by using array of arrays.

use warnings;
use strict;

my $file = 'data_analysis.txt';
open my $fh, '<', $file or die "Can't open $file -- $!";

# Prepare (and compile) START/END paterns, capturing ID in START
my $start_pattern = qr|\*ANALYSIS_START\*\s*\[([^[]+)\]|;
my $end_pattern   = qr(\*ANALYSIS_END\*);

my @analysis_IDs;
my @analysis_info;

while (my $line = <$fh>) 
{
    chomp($line);

    # Code that gets other, single-line information from file

    if ($line =~ $start_pattern .. $line =~ $end_pattern) 
    {   
        if ($line =~ $start_pattern) {
            push @analysis_IDs, $1;    # Get the analysis ID
            push @analysis_info, [];   # Add arrayref this block's lines
        }   
        elsif (not $line ~= $end_pattern) {
            push @{$analysis_info[-1]}, $line;  # add to last []
        }
    }   
}

print "$_\n" for @analysis_IDs;

use Data::Dumper;
print Dumper(\@analysis_info);

The code uses the range operator .. to determine when it is inside the patterns. This useful operator keeps the state across iterations so it knows when a condition has been satisfied and is still true (or not), saving us from maintaining a separate variable for that. It evaluates once the first condition becomes (and stays) true, as long as the second one stays flase. See Range Operators in perlop. Since start and end patterns need a different treatment, they are distinguished (again) inside. This is not the most efficient way but I am hoping that it is clear.

Matching uses $line =~ $pattern instead of the common $line =~ /$pattern/ the used patterns had been prepared with qr. An explicit $line is used in quest for clarity but one can just use (implicitly) $_ which provides for more compact code. In particular, the range condition simplifies to (/$start_pattern/ .. /$end_pattern/) (now we do need delimiters).

With this approach you keep a separate array with analysis ID and another one with blocks, as asked. They agree by indices, but that may not be a most reliable system.

Instead, a hash of arrays can be used, for example. Then an anonymous array for a block's content would be a 'value' for the key which is the ID. In this case you wouldn't have an order maintained though. That can be solved with another auxilliary structure, for example, if needed.

Here is a tutorial about Arrays of Arrays and a cookbook on Complex Data structures.

Comments