Ram Ram - 7 months ago 12
Perl Question

Finding the highest frequency of unique words in a string using perl

I am writing the script to get and print the highest frequency of unique words in paragraph.

For code - refer the attachment.

my $freq;
my $word;
my $textdata = <<"END_MSG";
mm mm mm mm kk kk kk kkkk kk To pp pp pp pp pp pp.

foreach $word ( split ( ' ', lc $textdata)
#print $freq{$word};
#print "..";

use sort 'stable';
my @listing = ( sort { $freq{$b} <=> $freq{$a} } keys %freq)[0..5];
foreach my $word ( @listing )
print $freq{$word}." $word\n";

Output 1:

5 pp
4 mm
4 kk
1 pp.
1 to
1 kkkk

Output 2:

5 pp
4 kk
4 mm
1 pp.
1 to
1 kkkk

The words
frequency are 4 - but when I am running each time the order varies.

I want the output to remain the same.

How can I sort based such that this is the case?

with regards,


Right, the problem here is - you are trying to sort two values that have no relative ordering.

Your result is therefore based on the return order from keys which is by definition random (ish). The two are equivalent as far as sort is concerned, so it's irrelevant which way they come out.

If you run keys %freq several times, you'll get different key-orders. And because sometimes kk will come before mm as a result - and because as far as sort is concerned, they are identical because they have the same frequency - that's why you get what you do.

So what you need is a secondary sort order, such that this is consistent. Like - for example - sort alphabetically if the frequency is the same.

Usefully - we can use the fact that the <=> combined with the || operator lets you chain conditionals. Because if you do;

 $condition1 || $condition2;

Condition 1 is returned if "true" - and condition 2 if it's false. <=> returns 1, 0 or -1. Only 0 is false here, so you can do:

 $a <=> $b || $a cmp $b

Or in your case:

$freq{$b} <=> $freq{$a} || $a cmp $b

Like this:

#!/usr/bin/env perl
use strict;
use warnings;
use Data::Dumper;

my @words = qw ( mm mm mm mm kk kk kk kkkk kk To pp pp pp pp pp pp nn nn ); 
my %freq;
$freq{$_}++ for @words; 
print Dumper \%freq; 

foreach my $word ( sort { $freq{$b} <=> $freq{$a} 
                       ||        $a cmp $b        } keys %freq ) { 
    print "$word => $freq{$word}\n";

This will sort on frequency first, and if there are two with the same frequency - will sort alphanumerically. So kk => 4 will always be ordered before mm => 4.

You should also note in your code:

  • You're not declaring %freq - you should. You should also switch on use strict; and use warnings; which would have told you about this.

  • use sort 'stable'; doesn't seem to do anything. Which isn't really a surprise, because sort is working perfectly.