zara zara - 4 months ago 9
Bash Question

How to group values based on a “connectedness” metric if there be some missing numbers in first column?

There is a file with three columns. Columns 1 and 2 contain numbers are not from the same set. In fact, some numbers that exist in column 2, may not exist in column1.

Column 3 shows the amount of connectedness between numbers in columns 1 and 2.

I want to partition my numbers in column 1 into groups of consecutive values (i.e., ranges) for which connectedness is greater than or equal to 0.2. For example, in this small data set:

input:

1 2 0.000
1 3 0.213
1 4 0.014
1 5 0.001
1 6 0.555
1 7 0.509
1 8 0.509
3 4 0.995
3 5 0.323
3 6 0.555
3 7 0.225
3 8 0.000
4 5 0.095
4 6 0.058
4 7 0.335
4 8 0.000
5 6 0.995
5 7 0.658
5 8 0.000
6 7 0.431
6 8 0.000
7 8 0.000


the output should be like:

output:

G1: 1 3 G2: 4 G3 :5 6 7


As you see number 2 is missing in column1. so we not consider 2 in a group. Connectedness between 1 with 3, is greater than 0.2, so 1 and 3 should be placed in first group. In fact, any pair of numbers within a group must have enough connectedness together. Despite high relation between 1/3 and 6 (0.555 > 0.2), 6 should not be placed in the first group, since numbers between 1 and 6 in column 1(4 and 5) had low connectedness with 1. So we must not jump over 4 and 5 and connect numbers in the first group with 6.

Number 4 does not have the high connectedness with 5, so number 4 should be in the second group individually. No matter that 4 has a high connectedness with 7 since the previous numbers (5 and 6) were in low connectedness with 4 and we must not jump over numbers in between and connect 4 with 7.

5 has a high connectedness with 6, 7 and 8. Also, any pair of numbers (like 6/7, 6/8) have high connectedness together. Therefore they should be placed together in the third group. That is why all of these numbers can be placed in one group.

Note that the real data does not begin from number 1 and there are over 100,000 lines. Also, there might be a couple of missing numbers in column1 in which exist in column2. but always the connectedness of missing numbers in column 1 is zero with other numbers.Also, note that column1 may begin with a number biggar than 1.

here is a part of my real data:
input:

49996 49997 0.000
49996 49998 0.082
49996 49999 0.953
49996 50000 0.060
49996 50001 0.000
49998 49999 0.288
49998 50000 0.288
49998 50001 0.000
49999 50000 0.265
49999 50001 0.000
50000 50001 0.000


output should be:

G1:49996 G3: 49998 49999 50000

Answer

I had to start from scratch instead of editing my reply to your previous question, as I needed to prepreocess the input file first to get a list of numbers to consider (i.e. the numbers from the first column).

#!/usr/bin/perl
use warnings;
use strict;

my $THRESHOLD = 0.2;

my @considered;
open my $IN, '<', shift or die $!;
while (<$IN>) {
    my ($first) = split ' ', $_, 2;
    push @considered, $first unless @considered && $first == $considered[-1];
}

seek $IN, 0, 0;
my $considered_idx = 0;
my @groups = ([ $considered[$considered_idx] ]);
while (<$IN>) {
    my ($n1, $n2, $connectedness) = split;
    next if $n1 == $considered[$considered_idx]
         && $n2 < $considered[ 1 + $considered_idx ];

    next if $n2 > $considered[-1];

    if ($n1 == $considered[$considered_idx]) {
        if ($connectedness > $THRESHOLD) {
            push @{ $groups[-1] }, $n2;

        } else {
            ++$considered_idx until $considered_idx > $#considered
                                 || $considered[$considered_idx] >= $n2;
            push @groups, [ $considered[$considered_idx] ];
        }
    }
}

for my $i (0 .. $#groups) {
    print "$i\t@{ $groups[$i] }\n";
}