Bob Bob - 27 days ago 7
Perl Question

Using pointers to lines in a file in perl

I'm trying to use some sort of pointers in perl so that I can look at two at two files that are sorted in alphabetical order and match things in both the files if they have the same name in the first column. The way i'm searching through each file though is I'm looking at which lines first column is lower in alphabetical order and then moving the pointer on that file to the next line. Somewhat similar to the pointers in merge sort. The code below is an example of what I want.

Using these two files.

set1

apple 17 20
boombox 23 29
carl 25 29
cat 22 33
dog 27 44


set2

ants yes
boombox no
carl yes
dentist yes
dice no
dog no


I can make a script that does something like this

($name, $affirmation) = first line in set2; #part I'm confused about I just kind of need some sort of command of something that will do this
while (<>){
@set1 = split;
while ($name < set1[0]){
($name, $affirmation) = next line in set2; # part i'm confused about I just kind of need some sort of command of something that will do this
}
if ($name = $set[0]{
print @set1, $affirmation;
}


This is how I would run it

./script.txt set1


I would end up with

boombox 23 29 no
carl 25 29 yes
dog 27 44 no


.

.

Edit:

I tried some code in some of the answers to see if I could make some functional code out of it but I seem to be running into problems, and some of the syntax in the answers I could not understand so I'm having a lot of trouble figuring out how to debug or solve this.

This is my specific example using the folllowing two text files

text.txt

Apples 0 -1 -1 0 0 0 0 -1
Apricots 0 1 1 0 0 0 0 1
Fruit 0 -1 -1 0 0 0 0 -1
Grapes 0 -2 -1 0 0 0 0 -2
Oranges 0 1 1 0 0 0 0 -1
Peaches 0 -2 -1 0 0 0 0 -2


text2.txt

Apples CHR1 + 1167628 1170420 1 1 N
Apricots CHR1 - 2115898 2144159 1 1 N
Oranges CHR1 - 19665266 19812066 1 1 N
Noidberry CHR1 - 1337728 1329993 1 1 N
Peaches CHR1 - 1337275 1342693 1 1 N


And this script
script.pl

#!/usr/bin/perl
use warnings;
my $file_1 = $ARGV[0];
my $file_2 = $ARGV[1];

open my $fh1, '<', $file_1 or die "Can't open $file_1: $!";
open my $fh2, '<', $file_2 or die "Can't open $file_2: $!";

open(my $single, '>', 'text.txt');
open(my $deep, '>', 'text2.txt');
OUTER: while (my $outer = <$fh1>){
chomp $outer;
@CopyNumber = split(' ', $outer);
($title, $title2) = split('\|', $CopyNumber[0]);
#print 'title: ',$title,' title2: ',$title2,"\n";
my $numLoss = 0;
my $deepLoss = 0;
for ($i = 1; $i <= $#CopyNumber; $i++){
#print "$CopyNumber[$i], $#CopyNumber, $i, \n";
if ($CopyNumber[$i] < 0){
$numLoss = $numLoss + 1;
if ($CopyNumber[$i] <-1){
$deepLoss = $deepLoss + 1;
}
}
}
if ($GeneSym and (($GeneSym cmp $title)==0)){ #or (($GeneSym cmp $title2)==0))){
print $single $Chrom,"\t",$Start,"\t",$Stop,"\t",$numLoss/$#CopyNumber,"\n";
print $deep $Chrom,"\t",$Start,"\t",$Stop,"\t",$deepLoss/$#CopyNumber,"\n";
next OUTER;
}

INNER: while (my $inner = <$fh2>){
($GeneSym, $Chrom, $Strand, $Start, $Stop, $MapId, $TotalMap, $AbnormalMerge, $Overlap) = split(' ', $inner);
$Chrom =~ s/CHR/hs/ee;
my $cmp = ($GeneSym cmp $title);
next OUTER if $cmp < 0;
if ($cmp==0){ #or (($GeneSym cmp $title2)==0)){
print $single $Chrom,"\t",$Start,"\t",$Stop,"\t",$numLoss/$#CopyNumber,"\n";
print $deep $Chrom,"\t",$Start,"\t",$Stop,"\t",$deepLoss/$#CopyNumber,"\n";
next OUTER;
}
}
}


If I run ./script.pl text.txt text2.txt I should get this printed into Number.txt

//corresponding to columns 2,4,5 of text2.txt and the last column being the percentage of columns which have a number lower than 0

hs1 1167628 1170420 0.375 //For Apples
hs1 2115898 2144159 0 //For Apricots
hs1 19665266 19812066 0.125 //For Oranges
hs1 1337275 1342693 0.375 //For Peaches


Instead I get this

hs1 1167628 1170420 0.375
hs1 2115898 2144159 0
hs1 1337275 1342693 0.375


So I'm just getting an error where

hs1 19665266 19812066 0.125 //For Oranges


isn't printing

Answer

Quite like you state, with: use cmp for comparison, split line into two terms.

For each line of FILE1 file go through lines of FILE2 file, exiting when a match is found. Once the FILE2 overshoots alphabetically move to the next line of FILE1.

use warnings 'all';
use strict;

sub process {
   my ($name, $affirm_1, $affirm_2) = @_;
   print "$name $affirm_1 $affirm_2\n";
}

my $file_1 = 'set1.txt';
my $file_2 = 'set2.txt';

open my $fh1, '<', $file_1 or die "Can't open $file_1: $!";
open my $fh2, '<', $file_2 or die "Can't open $file_2: $!";

my ($name_2, $affirm_2);
FILE1: while (my $line1 = <$fh1>) {
    chomp $line1;
    my ($name_1, $affirm_1) = split ' ', $line1, 2;

    if ($name_2) {
        my $cmp = $name_1 cmp $name_2;
        next FILE1 if $cmp < 0;
        if ($cmp == 0) {
            process($name_1, $affirm_1, $affirm_2);
            next FILE1;
        }
    }

    FILE2: while (my $line2 = <$fh2>) {
        chomp $line2;
        ($name_2, $affirm_2) = split ' ', $line2, 2;
        my $cmp = $name_1 cmp $name_2;
        next FILE1 if $cmp < 0;
        if ($cmp == 0) {
            process($name_1, $affirm_1, $affirm_2);
            next FILE1;
        }
    }
}

Comments on a few remaining details.

Once a FILE2 line "overshoots," in the next iteration of FILE1 we need to first check that line, before entering the FILE2 loop to iterate over its remaining lines. For the first FILE1 line the $name_2 is still undef thus if ($name_2).


Updated for edited post. Please clarify whether the first field may have word|word.

use warnings 'all';
use strict;

sub process_line {
    my ($single, $deep, $rline, $GeneSym, $Chrom, $Start, $Stop) = @_;
    my ($numLoss, $deepLoss) = calc_loss($rline);
    $Chrom =~ s/CHR/hs/;
    print $single (join "\t", $Chrom, $Start, $Stop, $numLoss/$#$rline), "\n";
    print $deep   (join "\t", $Chrom, $Start, $Stop, $deepLoss/$#$rline), "\n";
}

sub calc_loss {
    my ($rline) = @_; 
    my ($numLoss, $deepLoss) = (0, 0); 
    for my $i (1.. $#$rline) {
        $numLoss  += 1  if $rline->[$i] < 0;
        $deepLoss += 1  if $rline->[$i] < -1; 
    }   
    return $numLoss, $deepLoss;
}

my ($Number,  $NumberDeep) = ('Number.txt', 'NumberDeep.txt');
open my $single, '>', $Number      or die "Can't open $Number: $!";
open my $deep,   '>', $NumberDeep  or die "Can't open $NumberDeep: $!";

my ($file_1, $file_2) = ('set1_new.txt', 'set2_new.txt');    
open my $fh1, '<', $file_1 or die "Can't open $file_1: $!";
open my $fh2, '<', $file_2 or die "Can't open $file_2: $!";

my ($GeneSym, $Chrom, $Strand, $Start, $Stop, 
    $MapId, $TotalMap, $AbnormalMerge, $Overlap);

FILE1: while (my $line1 = <$fh1>) {
    next if $line1 =~ /^\s*$/;
    chomp $line1;

    my @line = split ' ', $line1;

    if ($GeneSym) {
        my $cmp = $line[0] cmp $GeneSym;
        next FILE1 if $cmp < 0;
        if ($cmp == 0) {
            process_line($single, $deep, \@line,
                         $GeneSym, $Chrom, $Start, $Stop);
            next FILE1;
        }   
    }   

    FILE2: while (<$fh2>) {
        next if /^\s*$/;
        chomp;
        ($GeneSym, $Chrom, $Strand, $Start, $Stop, 
             $MapId, $TotalMap, $AbnormalMerge, $Overlap) = split;
        my $cmp = $line[0] cmp $GeneSym;
        next FILE1 if $cmp < 0;
        if ($cmp == 0) {
            process_line($single, $deep, \@line,
                         $GeneSym, $Chrom, $Start, $Stop);
            next FILE1;
        }
    }
}

This produces the desired output with given sample files. Some shortcuts are taken, please let me know if comments would be helpful. Here are a few more comments

  • Much error checking should be added around.

  • I assume that the first field of FILE1 has one word. Otherwise a few changes are needed.

  • Processing is split into two functions, calculations being separate. This is not necessary.

  • $#$rline is the index of the last element for an arrayref $rline. If this is too much syntax to stomach use @$rline - 1.

Some comments on the code posted in the question:

  • Always, always, please use warnings; (and use strict;)

  • loop over indices is best written foreach my $i (0..$#array)

  • The regex modifier /ee is very involved. There is absolutely no need for it here.

Comments