Cambiata Cambiata - 5 months ago 38
PHP Question

Algorithms for string similarities (better than Levenshtein, and similar_text)? Php, Js

Where can I find algorithms that values the spelling of misplaced characters more accurately than levenshtein() and php similar_text() methods?

Example:

similar_text('jonas', 'xxjon', $similar); echo $similar; // returns 60
similar_text('jonas', 'asjon', $similar); echo $similar; // returns 60 <- although more similar!
echo levenshtein('jonas', 'xxjon'); // returns 4
echo levenshtein('jonas', 'asjon'); // returns 4 <- although more similar!


/ Jonas

Answer

Here's a solution that I've come up to. It's based on Tim's suggestion of comparing the order of subsequent charachters. Some results:

  • jonas / jonax : 0.8
  • jonas / sjona : 0.68
  • jonas / sjonas : 0.66
  • jonas / asjon : 0.52
  • jonas / xxjon : 0.36

I'm sure i isn't perfect, and that it could be optimized, but nevertheless it seems to produce the results that I'm after... One weak spot is that when strings have different length, it produces different result when the values are swapped...

static public function string_compare($str_a, $str_b) 
{
    $length = strlen($str_a);
    $length_b = strlen($str_b);

    $i = 0;
    $segmentcount = 0;
    $segmentsinfo = array();
    $segment = '';
    while ($i < $length) 
    {
        $char = substr($str_a, $i, 1);
        if (strpos($str_b, $char) !== FALSE) 
        {               
            $segment = $segment.$char;
            if (strpos($str_b, $segment) !== FALSE) 
            {
                $segmentpos_a = $i - strlen($segment) + 1;
                $segmentpos_b = strpos($str_b, $segment);
                $positiondiff = abs($segmentpos_a - $segmentpos_b);
                $posfactor = ($length - $positiondiff) / $length_b; // <-- ?
                $lengthfactor = strlen($segment)/$length;
                $segmentsinfo[$segmentcount] = array( 'segment' => $segment, 'score' => ($posfactor * $lengthfactor));
            } 
            else 
            {
                $segment = '';
                $i--;
                $segmentcount++;
            } 
        } 
        else 
        {
            $segment = '';
            $segmentcount++;
        }
        $i++;
    }   

    // PHP 5.3 lambda in array_map      
    $totalscore = array_sum(array_map(function($v) { return $v['score'];  }, $segmentsinfo));
    return $totalscore;     
}