B. Monster B. Monster - 6 months ago 22
Perl Question

creating an array of arrays in perl and deleting from the array

I'm writing this to avoid a O(n!) time complexity but I only have pseudocode right now because there are some things I'm unsure about implementing.

This is the format of the file that I want to pass into this script. The data is sorted by the third column -- the start position.

93 Blue19 1 82
87 Green9 1 7912
76 Blue7 2 20690
65 Red4 2 170
...
...
256 Orange50 17515 66740
166 Teal68 72503 123150
228 Green89 72510 114530


Explanation of the code:

I want to create an array of arrays to find when two pieces of information have overlapping lengths.

Columns 3 and 4 of the input file are start and stop positions on a single track line. If any row(x) has a position in column 3 that is shorter than the position in column 4 in any row(y) then this means that x starts before y ends and there is some overlap.

I want to find every row that overlaps with asnyrow without having to compare every row to every row. Because they are sorted I simply add a string to an inner array of the array which represents one row.
If the new row being looked at does not overlap with one of the rows already in the array then (because the array is sorted by the third column) no further row will be able to overlap with the row in the array and it can be removed.

This is what I have an idea of

#!/usr/bin/perl -w

use strict;

my @array

while (<>) {

my thisLoop = ($id, $name, $begin, $end) = split;
my @innerArray = split; # make an inner array with the current line, to
# have strings that will be printed after it

push @array(@innerArray)

for ( @array ) { # loop through the outer array being made to see if there
# are overlaps with the current item

if ( $begin > $innerArray[3]) # if there are no overlaps then print
# this inner array and remove it
# (because it is sorted and everything
# else cannot overlap because it is
# larger)
# print @array[4-]
# remove this item from the array
else
# add to array this string
"$id overlap with innerArray[0] \t innerArray[0]: $innerArray[2], $innerArray[3] "\t" $id : $begin, $end
# otherwise because there is overlap add a statement to the inner
# array explaining the overlap


The code should produce something like

87 overlap with 93 93: 1 82 87: 1 7982
76 overlap with 93 93: 1 82 76: 1 20690
65 overlap with 93 93: 1 82 65: 2 170
76 overlap with 87 87: 1 7912 76: 2 20690
65 overlap with 87 87: 1 7912 65: 2 170
65 overlap with 76 76: 2 20690 65: 2 170
256 overlap with 76 76: 2 20690 256: 17515 66740
228 overlap with 166 166: 72503 123150 228: 72510 114530


This was tricky to explain so ask me if you have any questions

Answer

UPDATE Upon further review: adjusted comments to the fact that data is sorted usefully


I am using the posted input and output files as a guide on what is required.

A note on complexity. In principle, each line has to be compared to all following lines. The number of operations actually carried out depends on the data. Since it is stated that the data is sorted on the field to be compared the inner loop iterations can be cut as soon as overlapping stops.

This compares each line to the ones following it. For that all lines are first read into an array. If the data set is very large this should be changed to read line by line and then the procedure turned around, to compare the currently read line to all previous. This is a very basic approach. It may well be better to build auxiliary data structures first, possibly making use of suitable modules.

use warnings;
use strict;

my $file = 'data_overlap.txt';
my @lines = do { 
    open my $fh, '<', $file or die "Can't open $file -- $!";
    <$fh>;
};

# For each element compare all following ones, but cut out 
# as soon as there's no overlap since data is sorted
for my $i (0..$#lines) 
{  
    my @ref_fields = split '\s+', $lines[$i];
    for my $j ($i+1..$#lines) 
    {   
        my @curr_fields = split '\s+', $lines[$j]; 
        if ( $ref_fields[-1] > $curr_fields[-2] ) { 
            print "$curr_fields[0] overlap with $ref_fields[0]\t" .
                "$ref_fields[0]: $ref_fields[-2] $ref_fields[-1]\t" .
                "$curr_fields[0]: $curr_fields[-2] $curr_fields[-1]\n";
        }   
        else { print "\tNo overlap, move on.\n"; last }
    }   
}

With the input in file 'data_overlap.txt' this prints

87 overlap with 93      93: 1 82        87: 1 7912
76 overlap with 93      93: 1 82        76: 2 20690
65 overlap with 93      93: 1 82        65: 2 170
        No overlap, move on.
76 overlap with 87      87: 1 7912      76: 2 20690
65 overlap with 87      87: 1 7912      65: 2 170
        No overlap, move on.
65 overlap with 76      76: 2 20690     65: 2 170
256 overlap with 76     76: 2 20690     256: 17515 66740
        No overlap, move on.
        No overlap, move on.
        No overlap, move on.
228 overlap with 166    166: 72503 123150       228: 72510 114530