Rob Rob - 3 months ago 9
Perl Question

Perl: removing unique lines between two files

Instead of removing duplicate lines, I am interested in removing unique lines found between two files. The files have different formats.

File 1:

m160505_031746_42156_s1_p0|105337|10450_16161
m160505_031746_42156_s1_p0|104750|20537_27903
m160505_031746_42156_s1_p0|103809|17563_25308
m160505_031746_42156_s1_p0|103217|8075_11486


File 2 (tab separated):

acCAATCCCATCACCATCtt m160505_031746_42156_s1_p0|105337|10450_16161
atTAAAATACCATTATATgg m160505_031746_42156_s1_p0|104750|20537_27903
caAACTCCAACTACGAACtg m160505_031746_42156_s1_p0|103809|17563_25308
atCTATTTAAACCTAATCgg m160505_031746_42156_s1_p0|103217|8075_11486
acCAATCCCATCACCATCtt m160505_031746_42156_s1_p0|152092|36592_40830
atTAAAATACCATTATATgg m160505_031746_42156_s1_p0|143825|13009_23809
caAACTCCAACTACGAACtg m160505_031746_42156_s1_p0|143710|0_20191
atCTATTTAAACCTAATCgg m160505_031746_42156_s1_p0|140833|25358_34709


File 2 has the same lines as File 1 in column 2, preceded by 20 letters in column 1. The 20 letter pattern in column 1 is repeated in File 2 (several times, more than just twice), with unique associated sequences each occurrence.

I would like to match the sequences in File 1 with the second column in File 2. If there is a match, I would like to then generate a new file with both columns for each match, maintaining the relationship File 2 has between the two columns. In effect, I am looking to simply remove the rows in File 2 that do not have column 2 matches in File 1.

I realize my code needs help, but here is what I have so far to give you more of an idea of how I am thinking. I will probably end up needing to use a hash, although I am worried about doing so because of the repeats in column 1. I don't want to lose those and their relationships to column 2.

use strict;
use warnings;

open(OUT, '>', '/path/to/out.txt') or die $!;
open(FMT0, '<', '/path/to/fmt0.txt') or die $!;

my $regex = qr/m160505_.*/;
while(my $line = <FMT0>){
$line =~ $regex;
open(FMT6, '<', '/path/to/fmt6.txt') or die $!;
while(my $zero_fmt = <FMT6>){
if ($zero_fmt =~ /([A-Z]{20})\t($line)/i){
print OUT $zero_fmt;
}
}
}


Thanks for the help!

mwp mwp
Answer

Something like this might get the job done. :-)

grep -f <(grep ^m160505_ file1) file2

Here's a Perl solution, since that's what you asked:

#!/usr/bin/env perl

use strict;
use warnings;

die "usage: $0 <file1> <file2>\n"
  unless @ARGV == 2;

open(my $file1, '<', $ARGV[0])
  or die "Could not open file1: $!\n";

my %keys;
while (<$file1>) {
  chomp;
  $keys{$_} = 1 if /^m160505_/;
}

close($file1);

open (my $file2, '<', $ARGV[1])
  or die "Could not open file2: $!\n";

while (<$file2>) {
  chomp;
  my ($key) = /\t(.+)$/;
  print "$_\n" if $keys{$key};
}

close($file2);

In action:

$ grep -f <(grep ^m160505_ file1) file2
acCAATCCCATCACCATCtt    m160505_031746_42156_s1_p0|105337|10450_16161
atTAAAATACCATTATATgg    m160505_031746_42156_s1_p0|104750|20537_27903
caAACTCCAACTACGAACtg    m160505_031746_42156_s1_p0|103809|17563_25308
atCTATTTAAACCTAATCgg    m160505_031746_42156_s1_p0|103217|8075_11486

$ ./atgc.pl file1 file2
acCAATCCCATCACCATCtt    m160505_031746_42156_s1_p0|105337|10450_16161
atTAAAATACCATTATATgg    m160505_031746_42156_s1_p0|104750|20537_27903
caAACTCCAACTACGAACtg    m160505_031746_42156_s1_p0|103809|17563_25308
atCTATTTAAACCTAATCgg    m160505_031746_42156_s1_p0|103217|8075_11486