Francesco Francesco - 4 months ago 10
MySQL Question

MySQL: how to efficiently select strings that have only one character difference?

Is it possible in MySQL to select all strings from a table that have only one character difference from a given string?

eg.

Iphone 5
and
Iphone 5s
or
Model x-1
and
Models x1

Answer

If you can add a user function to MySQL, then you could use Levenshtein's distance. See this other question for the code.

Then you can query WHERE LEVENSHTEIN(description, 'iphone 5') <= 2 for example. You would find "iphone 5S", and also "ipohne 5", which might be a plus.

Otherwise, specific cases are easy (e.g, REGEX 'iphone.*' or similar), but the general case would be a nightmare to implement.

Performances

This version is modified to accept a third parameter, MAXCOST. When the maxcost for a levenshtein subloop is reached, the search function will simply abort. Excluding pathological cases, and keeping in mind that the actual cost might be slightly off (i.e. excluding a cost of 12 could include strings with cost 12, and exclude strings which would have cost 11), this cuts down on those cases where the Levenshtein distance is unreasonable. Those cases are probably of no interest to you.

So if you want strings with distance less than 3, you would use WHERE levenshtein(string1, string2, 4) < 3. All strings with distance 3-4 or above will now return 4 quickly, and be excluded.

Test:

mysql> select BENCHMARK(1000,levenshtein('PIPPORIDICOLO','LUKASPERICOLO', 99));
+------------------------------------------------------------------+
| BENCHMARK(1000,levenshtein('PIPPORIDICOLO','LUKASPERICOLO', 99)) |
+------------------------------------------------------------------+
|                                                                0 |
+------------------------------------------------------------------+
1 row in set (2.50 sec)

mysql> select BENCHMARK(1000,levenshtein('PIPPORIDICOLO','LUKASPERICOLO', 3));
+-----------------------------------------------------------------+
| BENCHMARK(1000,levenshtein('PIPPORIDICOLO','LUKASPERICOLO', 3)) |
+-----------------------------------------------------------------+
|                                                               0 |
+-----------------------------------------------------------------+
1 row in set (0.08 sec)

Here the cost has decreased from 2500 to 80 milliseconds per 1000 evaluations.

This is the modified code. You will have to drop the old function before defining this new one, unless you change the name.

DELIMITER $$
CREATE FUNCTION levenshtein( s1 VARCHAR(255), s2 VARCHAR(255), maxcost INTEGER)
RETURNS INT
DETERMINISTIC
BEGIN
DECLARE s1_len, s2_len, i, j, c, c_temp, cost INT;
DECLARE s1_char CHAR;
-- max strlen=255
DECLARE cv0, cv1 VARBINARY(256);
SET s1_len = CHAR_LENGTH(s1), s2_len = CHAR_LENGTH(s2), cv1 = 0x00, j = 1, i = 1, c = 0;
IF s1 = s2 THEN
    RETURN 0;
ELSEIF s1_len = 0 THEN
    RETURN s2_len;
ELSEIF s2_len = 0 THEN
    RETURN s1_len;
ELSE
    WHILE j <= s2_len DO
        SET cv1 = CONCAT(cv1, UNHEX(HEX(j))), j = j + 1;
    END WHILE;
    WHILE i <= s1_len DO
        SET s1_char = SUBSTRING(s1, i, 1), c = i, cv0 = UNHEX(HEX(i)), j = 1;
        WHILE j <= s2_len DO
            SET c = c + 1;
            IF c > maxcost THEN
                 RETURN maxcost;
            END IF;
            IF s1_char = SUBSTRING(s2, j, 1) THEN
                SET cost = 0;
            ELSE
                SET cost = 1;
            END IF;
            SET c_temp = CONV(HEX(SUBSTRING(cv1, j, 1)), 16, 10) + cost;
            IF c > c_temp THEN SET c = c_temp; END IF;
            SET c_temp = CONV(HEX(SUBSTRING(cv1, j+1, 1)), 16, 10) + 1;
            IF c > c_temp THEN SET c = c_temp; END IF;
            SET cv0 = CONCAT(cv0, UNHEX(HEX(c))), j = j + 1;
        END WHILE;
        SET cv1 = cv0, i = i + 1;
    END WHILE;
END IF;
RETURN c;
END$$
DELIMITER ;