user3781528 user3781528 - 3 months ago 16
Perl Question

identify and insert the missing rows

An array is populated from a tab delimited text (5 column) file that sometimes is missing rows. I need to identify and insert the missing rows. Inserting a string "blank row found" is sufficient.

Here is an example of data from file:

chr1:11174372 MTOR 42939 42939 7
chr1:65310459 JAK1 1948 1948 3


I’ve created an array of elements that identifies the second column of each row that should be present in the file, in the order each row should be present. However, I'm not sure how to continue from here, since I'm unable to install any Perl modules on the server (e.g. Arrays::Utils).

Is comparing arrays the correct way of approaching this problem? Perhaps there is a straightforward solution, that doesn’t require installation of any CPAN modules? Thanks for your help.

#!perl
use strict;
use warnings;
use File::Basename;
#use Arrays::Utils;

opendir my $dir, "/data/test_all_runs" or die "Cannot open directory: $!";
my @run_folder = readdir $dir;
closedir $dir;

my $run_folder = pop @run_folder; print "The folder is".$run_folder."\n";

my $home="/data/";

my $CNV_file = $home."test_all_runs/".$run_folder."/CNV.txt";

my @CNVarray;
open(TXT2, "$CNV_file");
while (<TXT2>){
push (@CNVarray, $_);
}
close(TXT2);

foreach (@CNVarray){
chop($_);
}

my @array1 = map { $_->[1] } @CNVarray;

my @array2 = qw(MTOR JAK1 NRAS DDR2 MYCN ALK IDH1 ERBB4 RAF1 CTNNB1 PIK3CA DCUN1D1 FGFR3 PDGFRA KIT APC FGFR4 ROS1 ESR1 EGFR CDK6 MET SMO BRAF FGFR1 MYC JAK2 GNAQ RET FGFR2 HRAS CCND1 BIRC2 KRAS ERBB3 CDK4 AKT1 MAP2K1 IDH2 NF1 ERBB2 BRCA1 GNA11 MAP2K2 JAK3 AR MED12);

my %array1_hash;
my %array2_hash;

# Create a hash entry for each element in @array1
for my $element ( @array1 ) {
$array1_hash{$element} = @array1;
}

# Same for @array2: This time, use map instead of a loop
map { $array_2{$_} = 1 } @array2;

for my $entry ( @array2 ) {

if ( not $array1_hash{$entry} ) {
return 1; #Entry in @array2 but not @array1: Differ

}else {
return 0; #Arrays contain the same elements
}
#if ( keys %array_hash1 != keys %array_hash2 ) {
#return 1; #Arrays differ
}

Answer

If I get it right, you have a separate reference list of key-words that need to be in the second field in a row, in that order. One way to find skipped rows is to iterate through both lists.

That approach would be picky and error prone, but in this case it can be made easier by removing the element from the reference list each time. Then you always need to compare the current line against the first element in the reference list. Here is the basic logic of it, with the better version further below.

use warnings;
use strict;

open my $cnv_fh, '<', $CNV_file or die "Can't open $CNV_file: $!";
my @CNVarray = <$cnv_fh>;
close $cnv_fh;
# chomp(@CNVarray);

my @ref_list = qw(MTOR JAK1 ...);

foreach my $line (@CNVarray) 
{
    if ( (split "\t", $line)[1] eq $ref_list[0] ) {  # good row
        shift @ref_list;
        print $line, "\n";
    }
    else {
        shift @ref_list;
        print "blank row found\n";
        while ( (split "\t", $line)[1] ne $ref_list[0] ) {
            # no match, keep going through the reference list
            shift @ref_list;
            print "blank row found\n";
    }
 }

The while loop is needed since multiple rows can be missing (in a row), so we need to get to the place in the reference list that does match the current row. A few notes on the code.

  • The filehandle read <...> in the list context returns all lines.
  • The chop in the original code removes the last character, probably not what you want. It is the chomp that removes the new line (or really $/).

Tested against the reference list qw(AA BB CC DD EE) with the input file (note spaces)

1 AA first
2 BB more
5 EE last

it prints

1 AA first
2 BB more
blank row found
blank row found
5 EE last

The code above can be simplified. (Lines are also collected in an array, then printed to a new file.)

use warnings;
use strict;

open my $cnv_fh, '<', $CNV_file or die "Can't open $CNV_file: $!";
my @CNVarray = <$cnv_fh>;
close $cnv_fh;
chomp(@CNVarray);

my @ref_list = qw(MTOR JAK1 ...);
my @new_lines;

foreach my $line (@CNVarray) 
{
     while ( (split "\t", $line)[1] ne $ref_list[0] ) {
        shift @ref_list;
        print "blank row found\n";
        push @new_lines, 'blank row found';
    }
    shift @ref_list;
    print $line, "\n";
    push @new_lines, $line;         
}

my $filled_file = 'skipped_rows_added.txt';
open my $out_fh, '>', $filled_file  or die "Can't open $filled_file: $!";
print $out_fh "$_\n" for @new_lines;
close $out_fh;

This behaves the same with the test input above.