fugu fugu - 1 year ago 43
Perl Question

How to search through array elements for match in hash keys

I've an array that contains unique IDs (numeric) for DNA sequences. I've put my DNA sequences in a hash so that each key contains a descriptive header, and its value is the DNA sequence. Each header in this list contains gene information and is suffixed with its unique ID number:

Unique ID: 14272

Header(hash key): PREDICTEDXenopusSiluranatropicalishypotheticalproteinLOCLOCmRNA14272

Sequence (hash value): ATGGGTC...

I want to cycle through each Unique ID and see if it matches the number at the end of each header(hash key) and, if so, print the hash key + value into a file. So far I've got this:

my %hash;
@hash{@hash_index} = @hash_seq;

foreach $hash_index (sort keys %hash) {
for ($i=0; $i <= $#scaffoldnames; $i++) {
if ($hash_index =~ /$scaffoldnames[$i]/) {
print GENE_ID "$hash_index\n$hash{$hash_index}\n";

Whereby the unique IDs are contained within @scaffoldnames.

This doesn't work! I'm unsure as to how best to loop through both the hash and the array to find a match.

I'll expand below:

Upstream code:

foreach(@scaffoldnames) {
} #Remove all non-numerics

my @genes = read_file('splice.txt'); #Splice.txt is a fasta file

my $hash_index = '';
my $hash_seq = '';
if (/^>/){
my $head = $_;
$hash_index .= $head; #Collect all heads for hash
else {
my $sequence = $_;
$hash_seq .= $sequence; #Collect all sequences for hash

my @hash_index = split(/\n/,$hash_index); #element[0]=head1, element[1]=head2
my @hash_seq = split(/\n/, $hash_seq); #element[0]=seq1, element[1]=seq2

my %hash; # Make hash from both arrays - heads as keys, seqs as values
@hash{@hash_index} = @hash_seq;

foreach $hash_index (sort keys %hash) {
for ($i=0; $i <= $#scaffoldnames; $i++) {
if ($hash_index =~ /$scaffoldnames[$i]$/) {
print GENE_ID "$hash_index\n$hash{$hash_index}\n";

I'm trying to isolate all differently expressed genes (by unique ID) as outputted by cuffdiff (RNA-Seq) and relate them to the scaffolds (in this case expressed sequences) from which they came.

I'm hoping therefore that I can take isolate each unique ID and search through the original fasta file to pull out the header it matches and the sequence it's associated with.

Answer Source

You seem to have missed the point of hashes: they are used to index your data by keys so that you can access the relevant information in one step, like you can with arrays. Looping over every hash element kinda spoils the point. For instance, you wouldn't write

my $value;

for my $i (0 .. $#data) {
  $value = $data[i] if $i == 5;

you would simply do this

my $value = $data[5];

It is hard to help properly without some more information about where your information has come from and exactly what it is you want, but this code should help.

I have used one-element arrays that I think look like what you are using, and built a hash that indexes both the header and the sequence as a two-element array, using the ID (the trailing digits of the header) as a key. The you can just look up the information for, say, ID 14272 using $hash{14272}. The header is $hash{14272}[0] and the sequence is $hash{14272}[1]

If you provide more of an indication about your circumstances then we can help you further.

use strict;
use warnings;

my @hash_index = ('PREDICTEDXenopusSiluranatropicalishypotheticalproteinLOCLOCmRNA14272');
my @hash_seq = ('ATGGGTC...');

my @scaffoldnames = (14272);

my %hash = map {
  my ($key) = $hash_index[$_] =~ /(\d+)\z/;
  $key => [ $hash_index[$_], $hash_seq[$_] ];
} 0 .. $#hash_index;

open my $gene_fh, '>', 'gene_id.txt' or die $!;

for my $name (@scaffoldnames) {
  next unless my $info = $hash{$name};
  printf $gene_fh "%s\n%s\n", @$info;

close $gene_fh;


From the new code you have posted it looks like you can replace that section with this code.

It works by taking the trailing digits from every sequence header that it finds, and using that as a key to choose a hash element to append the data to. The hash values are the header and the sequence, all in a single string. If you have a reason for keeping them separate then please let me know.

foreach (@scaffoldnames) {
}    # Remove all non-numerics

open my $splice_fh, '<', 'splice.txt' or die $!;    # splice.txt is a FASTA file

my %sequences;

my $id;
while (<$splice_fh>) {
    ($id) = /(\d+)$/ if /^>/;
    $sequences{$id} .= $_ if $id;

for my $id (@scaffoldnames) {
    if (my $sequence = $sequences{$id}) {
        print GENE_ID $sequence;