user3781528 user3781528 - 3 months ago 8
Perl Question

Matching string with substrings

I’m working with multiple vcf files in a directory (Linux server) and also a tab delimited key file that contains the sample names and the corresponding barcodes.

Here is how the files are named:

RA_4090_v1_RA_4090_RNA_v1.vcf
RA_4090_dup_v1_RA_4090_dup_RNA_v1.vcf
RA_565_v1.vcf
RA_565_dup_v1.vcf
RA_HCC-78-2.vcf


Here are contents of the key file:

Barcode ID Sample Name
IonSelect-2 RA_4090
IonSelect-4 RA_565
IonSelect-6 RA_HCC-78-2
IonSelect-10 RA_4090_dup
IonSelect-12 RA_565_dup


I need to correlate the correct sample names with each .vcf file and then rename each .vcf file.

There is always one vcf file for each sample. However, sometimes the samples names begin with the same substring and it’s impossible to match them up correctly, since the sample names are not standardized.

The following code works well when the sample names are different but fails if multiple sample names begin with the same substring. I have no idea how to account for multiple sample names that begging with the same substring.

Please suggest something that will work. Here is the current code:

#!/usr/bin/perl
use warnings;
use strict;
use File::Copy qw(move);

my $home="/data/";
my $bam_directory = $home."test_all_runs/".$ARGV[0];

my $matrix_key = $home."test_all_runs/".$ARGV[0]."/key.txt";

my @matrix_key = ();

open(TXT2, "$matrix_key") or die "Can't open '$matrix_key': $!";
while (<TXT2>){
push (@matrix_key, $_);
}
close(TXT2);

my @ant_vcf = glob "$bam_directory/*.vcf";

for my $tsv_file (@ant_vcf){

my $matrix_barcode_vcf = "";
my $matrix_sample_vcf = "";

foreach (@matrix_key){
chomp($_);
my @matrix_key = split ("\t", $_);##
if (index ($tsv_file,$matrix_key[1]) != -1) {
$matrix_barcode_vcf = $matrix_key[0]; print $matrix_key[0];
$matrix_sample_vcf = $matrix_key[1];
chomp $matrix_barcode_vcf;
chomp $matrix_sample_vcf;
#print $bam_directory."/".$matrix_sample_id."_".$matrix_barcode.".bam";
move $tsv_file, $bam_directory."/".$matrix_sample_vcf."_".$matrix_sample_vcf.".vcf";
}
}

}

Answer

The following code works well when the sample names are different but fails if multiple sample names begin with the same substring. I have no idea how to account for multiple sample names that begging with the same substring.

The key to solving your problem is sorting the 'Sample Name' names by length - longest first.

For example, MATCHES RA_4090_dup should be before MATCHES RA_4090 in the @matrix_key array so it will attempt to match the longer string first. Then, after a match, you stop searching (I used first from the List::Util module which is part of core perl since version 5.08).

#!/usr/bin/perl
use strict;
use warnings;
use List::Util 'first';

my @files = qw(
RA_4090_v1_RA_4090_RNA_v1.vcf
RA_4090_dup_v1_RA_4090_dup_RNA_v1.vcf
RA_565_v1.vcf
RA_565_dup_v1.vcf
RA_HCC-78-2.vcf
);

open my $key, '<', 'junk.txt' or die $!; # key file

<$key>; # throw away header line in key file (first line)

my @matrix_key = sort {length($b->[1]) <=> length($a->[1])} map [ split ],  <$key>;
close $key or die $!;

for my $tsv_file (@files) {
    if ( my $aref = first { index($tsv_file, $_->[1]) != -1 } @matrix_key ) {
        print "$tsv_file \t MATCHES $aref->[1]\n";
        print "\t$aref->[1]_$aref->[0]\n\n";    
    }
}

This produced this output:

RA_4090_v1_RA_4090_RNA_v1.vcf    MATCHES RA_4090
        RA_4090_IonSelect-2

RA_4090_dup_v1_RA_4090_dup_RNA_v1.vcf    MATCHES RA_4090_dup
        RA_4090_dup_IonSelect-10

RA_565_v1.vcf    MATCHES RA_565
        RA_565_IonSelect-4

RA_565_dup_v1.vcf        MATCHES RA_565_dup
        RA_565_dup_IonSelect-12

RA_HCC-78-2.vcf          MATCHES RA_HCC-78-2
        RA_HCC-78-2_IonSelect-6