cvirus96 cvirus96 - 4 months ago 15
Perl Question

Is there a better way to pull duplicate strings out of a file in Perl?



I currently am looping through a log file pulling certain characteristics out. I have to check for a unique string if it is duplicated and if the string is duplicated then ignore that log. Currently, my code takes an absurd amount of time to run (or I'm in an infinite loop) yippey. Is there a better way to pull duplicates out of a file and check them for uniqueness?

close($handle);

$test = "testFile.txt";

open( $handle, '<', $domainAnalysis ) or die "Cannot open file: $!";
open( $hand, '>', $test ) or die "Cannot open file: $!";

my %uniq;

while ( $search = <$handle> ) {

if ( $search =~ /Mail ID: ([^:]*)\n/g ) {
$uniq{$search}++;
}

my @sortedHash = sort keys %uniq;

foreach $i (@sortedHash) {

if ( $i eq $search ) {
print $hand $search;
print $hand scalar <$handle> for 1 .. 2;
}
}
}


Any help would be greatly appreciated. I am kinda stuck.

Edit:

It currently is reading a log file and pulling needed information to a new file. The new file is printed in the format of this

Mail ID: b12342534
Domain : someEmail@email.com
Status Message = Sent

Mail ID: a32432234
Domain : someEmail@email.com
Status Message = Deferred


Output: well the program never actually stops. It takes forever and my patience won't let it run all the way.

Answer

I'm pretty sure your problem is that inner loop - as you iterate the log, presumably you'll accumulate a significant number of 'Mail ID' entries.

And each loop, you sort them all, and then iterate them all and compare them.

And - more importantly - your $search that you're inserting into each hash is EACH LINE which means it'll be getting huge.

Anyway - I'd suggest given your input data, first off you use $/:

local $/ = ''; #read in paragraph mode.  
my %seen; 
while ( <$input> ) { 
    my ( $id ) = m/Mail ID: ([^:]*)/;
    print unless $seen{$id}++; 
}

And this will print only the first time a particular mail ID is spotted.

(Of course, if you only want to print duplicates you can use 'if' instead of 'unless')

Comments