zara - 1 year ago 46

Bash Question

There is a file with three columns. Columns 1 and 2 contain numbers are not from the same set. In fact, some numbers that exist in column 2, may not exist in column1.

Column 3 shows the amount of connectedness between numbers in columns 1 and 2.

I want to partition my numbers in column 1 into groups of consecutive values (i.e., ranges) for which connectedness is greater than or equal to 0.2. For example, in this small data set:

input:

`1 2 0.000`

1 3 0.213

1 4 0.014

1 5 0.001

1 6 0.555

1 7 0.509

1 8 0.509

3 4 0.995

3 5 0.323

3 6 0.555

3 7 0.225

3 8 0.000

4 5 0.095

4 6 0.058

4 7 0.335

4 8 0.000

5 6 0.995

5 7 0.658

5 8 0.000

6 7 0.431

6 8 0.000

7 8 0.000

the output should be like:

output:

`G1: 1 3 G2: 4 G3 :5 6 7`

As you see number 2 is missing in column1. so we not consider 2 in a group. Connectedness between 1 with 3, is greater than 0.2, so 1 and 3 should be placed in first group. In fact, any pair of numbers within a group must have enough connectedness together. Despite high relation between 1/3 and 6 (0.555 > 0.2), 6 should not be placed in the first group, since numbers between 1 and 6 in column 1(4 and 5) had low connectedness with 1. So we must not jump over 4 and 5 and connect numbers in the first group with 6.

Number 4 does not have the high connectedness with 5, so number 4 should be in the second group individually. No matter that 4 has a high connectedness with 7 since the previous numbers (5 and 6) were in low connectedness with 4 and we must not jump over numbers in between and connect 4 with 7.

5 has a high connectedness with 6, 7 and 8. Also, any pair of numbers (like 6/7, 6/8) have high connectedness together. Therefore they should be placed together in the third group. That is why all of these numbers can be placed in one group.

Note that the real data does not begin from number 1 and there are over 100,000 lines. Also, there might be a couple of missing numbers in column1 in which exist in column2. but always the connectedness of missing numbers in column 1 is zero with other numbers.Also, note that column1 may begin with a number biggar than 1.

here is a part of my real data:

input:

`49996 49997 0.000`

49996 49998 0.082

49996 49999 0.953

49996 50000 0.060

49996 50001 0.000

49998 49999 0.288

49998 50000 0.288

49998 50001 0.000

49999 50000 0.265

49999 50001 0.000

50000 50001 0.000

output should be:

`G1:49996 G3: 49998 49999 50000`

Answer Source

I had to start from scratch instead of editing my reply to your previous question, as I needed to prepreocess the input file first to get a list of numbers to consider (i.e. the numbers from the first column).

```
#!/usr/bin/perl
use warnings;
use strict;
my $THRESHOLD = 0.2;
my @considered;
open my $IN, '<', shift or die $!;
while (<$IN>) {
my ($first) = split ' ', $_, 2;
push @considered, $first unless @considered && $first == $considered[-1];
}
seek $IN, 0, 0;
my $considered_idx = 0;
my @groups = ([ $considered[$considered_idx] ]);
while (<$IN>) {
my ($n1, $n2, $connectedness) = split;
next if $n1 == $considered[$considered_idx]
&& $n2 < $considered[ 1 + $considered_idx ];
next if $n2 > $considered[-1];
if ($n1 == $considered[$considered_idx]) {
if ($connectedness > $THRESHOLD) {
push @{ $groups[-1] }, $n2;
} else {
++$considered_idx until $considered_idx > $#considered
|| $considered[$considered_idx] >= $n2;
push @groups, [ $considered[$considered_idx] ];
}
}
}
for my $i (0 .. $#groups) {
print "$i\t@{ $groups[$i] }\n";
}
```