DKru DKru - 4 months ago 11
Perl Question

Perl: duplicate keys not overwriting in hash

I have a problem that I can't seem to find an answer for.

I have a CSV file that contains performance records for different individuals. There is only suppose to be one record per individual, however, there are some individuals that have several records with different information. I would like to compare this first file to another that also has a list of individuals, though I would only like to compare whether the individual in file 1 also has a record in file two (file 2 does not have duplicates). The individuals' IDs are unique.

Example for file 1:

ID number A B C D
4011NM16001 apple 24 sunday 2016-01-01
4011NM16001 apple 16 wednesday 2016-01-01
4012NM15687 pear 16 sunday 2015-04-19
4012NM15002 banana 8 monday 2015-09-09
4012NM14301 peach 10 wednesday 2014-03-18
4012NM14301 peach 18 wednesday 2014-03-18


I have opened the first file and tried to put the data into a hash (or rather a combination of a hash and array if I understand the concepts correctly) so as to remove the duplicates, using the ID as the unique key. However, instead of overwriting entries with the same ID, it still seems to add it, so I still end up with the duplicate records.

I want to see this:

ID number
4011NM16001
4011NM15687
4012NM15002
4012NM14301


But instead I still see this:

ID number
4011NM16001
4011NM16001
4012NM15687
4012NM15002
4012NM14301
4012NM14301


Have I typed something wrong in my code or am I not using the hash correctly? I'm still new to Perl so I use parts of previous programs and try to learn as I go..

#!/usr/bin/env perl
use DBI;

use strict;
use warnings;

my $file1 = 'location1.csv'; #file1 containing records with duplicates
my $exists = 'location3.csv'; #output file with unique IDs that will be compared to file2

open (EXISTS, ">$exists") or die "Cannot open $exists";
print EXISTS "ID number\n";

open (FILE1, "$file1") or die "Cannot open $file1";

while (<FILE1>){

my %file1;

my $line = $_;
$line =~ s/\s*$//g;

my ($ID, $a, $b, $c, $d) = split('\,', $line);
next if !$ID or substr($ID,0,2) eq 'ID';

$file1{$ID}[0]=$ID; #unique ID number
$file1{$ID}[1]=$a; #record a
$file1{$ID}[2]-$b; #record b
$file1{$ID}[3]=$c; #record c
$file1{$ID}[4]=$d; #record d

print EXISTS "$file1{$ID}[0]\n";

}

exit;


Thanks so much!

Answer

In addition to choroba's diagnosis you need to declare the hash outside the while loop, otherwise each iteration of the loop is dealing with a new empty hash

Here's a version of your code that uses best-practice Perl and produces the result that you wanted. Note that I've had to alter the format of your input file location1.csv as the values you show don't contain any commas

#!/usr/bin/env perl

use strict;
use warnings;

my $file1  = 'location1.csv';    # file1 containing records with duplicates
my $exists = 'location3.csv';    # output file with unique IDs that will be compared to file2

open my $exists_fh, '>', $exists or die qq{Unable to open "$exists" for output: $!};
print $exists_fh "ID number\n";

open my $file1_fh, '<', $file1 or die qq{Unable to open "$file1" for input: $!};
<$file1_fh>; # skip header line

my %file1;

while ( <$file1_fh> ) {

    next unless /\S/; # Skip blank lines

    s/\s+\z//;

    my @fields = split /,/;
    my $id = $fields[0];

    next if $file1{$id}; # Skip this record if the ID is already known

    $file1{$id} = \@fields;

    print $exists_fh "$id\n"
}

output

ID number
4011NM16001
4012NM15687
4012NM15002
4012NM14301
Comments