Ahmed Ahmed - 2 months ago 23
Perl Question

cosine similarity between strings perl

i have a file contain for example this text:

perl java python php scala
java pascal perl ruby ada
ASP awk php java perl
C# ada python java scala


I found a module which calculates cosine similaity, http://search.cpan.org/~wollmers/Bag-Similarity-0.019/lib/Bag/Similarity/Cosine.pm

I did a simple test in the bignning,

my $cosine = Bag::Similarity::Cosine->new;
my $similarity = $cosine->similarity(['perl','java','python','php','scala'],['java','pascal','perl','ruby','ada']);
print $similarity;


The rusult was 0.4;

The problem when i read from the file and calculate the cosine between each line, the results are different, this is the code:

open(F,"/home/ahmed/FILE.txt") or die " Pb pour ouvrir";
my @data; # containt each line of the FILE in each case

while(<F>) {
chomp;
push @data, $_;
}
#print join " ", @data;

my $cosine = Bag::Similarity::Cosine->new;

for my $i ( 0 .. $#data-1 ) {

for my $j ( $i + 1 .. $#data ) {

my $similarity = $cosine->similarity($data[$i],$data[$j]);

print "line $i a une similarite de $similarity avec line $j\n";

$i + 1,

$j + 1;
}
}


the results :

line 0 has a similarity of 0.933424735647156 with line 1
line 0 has a similarity of 0.953945734121021 with line 2
line 0 has a similarity of 0.939759036144578 with line 3
line 1 has a similarity of 0.917585834612093 with line 2
line 1 has a similarity of 0.945092544842746 with line 3
line 2 has a similarity of 0.908826679128811 with line 3


the similarity must be 0.4 between line 1 and 2;

I changed the FILE like this :

['perl','java','python','php','scala']
['java','pascal','perl','ruby','ada']
['ASP','awk','php','java','perl']
['C#','ada','python','java','scala']


but the same result,
Thank you.

Answer

There is syntax error in your program. Were you trying to use printf and used print mistakenly? Not sure about you but below works fine for me.

#!/usr/bin/perl
use strict;
use warnings;
use Bag::Similarity::Cosine;

my $cosine = Bag::Similarity::Cosine->new;
my @data;

while ( <DATA> ) {
    push @data, { map { $_ => 1 } split };
}

for my $i ( 0 .. $#data-1 ) {
    for my $j ( $i + 1 .. $#data ) {
        my $similarity = $cosine->similarity($data[$i],$data[$j]);
        print "line $i has a similarity of $similarity with line $j\n";
    }
}

__DATA__
perl java python php scala
java pascal perl ruby ada
ASP awk php java perl
C# ada python java scala

Output:

line 0 has a similarity of 0.4 with line 1
line 0 has a similarity of 0.6 with line 2
line 0 has a similarity of 0.6 with line 3
line 1 has a similarity of 0.4 with line 2
line 1 has a similarity of 0.4 with line 3
line 2 has a similarity of 0.2 with line 3