Francesco - 4 months ago 10

MySQL Question

Is it possible in MySQL to select all strings from a table that have only one character difference from a given string?

eg.

`Iphone 5`

`Iphone 5s`

`Model x-1`

`Models x1`

Answer

**If** you can add a user function to MySQL, then you could use Levenshtein's distance. See this other question for the code.

Then you can query `WHERE LEVENSHTEIN(description, 'iphone 5') <= 2`

for example. You would find "iphone 5S", and also "ipohne 5", which might be a plus.

Otherwise, *specific* cases are easy (e.g, `REGEX 'iphone.*'`

or similar), but the general case would be a nightmare to implement.

This version is modified to accept a third parameter, MAXCOST. When the maxcost for a levenshtein subloop is reached, the search function will simply abort. Excluding pathological cases, and keeping in mind that the actual cost might be slightly off (i.e. excluding a cost of 12 could include strings with cost 12, and exclude strings which would have cost 11), this cuts down on those cases where **the Levenshtein distance is unreasonable**. Those cases are probably of no interest to you.

So if you want strings with distance less than 3, you would use `WHERE levenshtein(string1, string2, 4) < 3`

. All strings with distance 3-4 or above will now return 4 *quickly*, and be excluded.

Test:

```
mysql> select BENCHMARK(1000,levenshtein('PIPPORIDICOLO','LUKASPERICOLO', 99));
+------------------------------------------------------------------+
| BENCHMARK(1000,levenshtein('PIPPORIDICOLO','LUKASPERICOLO', 99)) |
+------------------------------------------------------------------+
| 0 |
+------------------------------------------------------------------+
1 row in set (2.50 sec)
mysql> select BENCHMARK(1000,levenshtein('PIPPORIDICOLO','LUKASPERICOLO', 3));
+-----------------------------------------------------------------+
| BENCHMARK(1000,levenshtein('PIPPORIDICOLO','LUKASPERICOLO', 3)) |
+-----------------------------------------------------------------+
| 0 |
+-----------------------------------------------------------------+
1 row in set (0.08 sec)
```

Here the cost has decreased from 2500 to 80 milliseconds per 1000 evaluations.

This is the modified code. You will have to drop the old function before defining this new one, unless you change the name.

```
DELIMITER $$
CREATE FUNCTION levenshtein( s1 VARCHAR(255), s2 VARCHAR(255), maxcost INTEGER)
RETURNS INT
DETERMINISTIC
BEGIN
DECLARE s1_len, s2_len, i, j, c, c_temp, cost INT;
DECLARE s1_char CHAR;
-- max strlen=255
DECLARE cv0, cv1 VARBINARY(256);
SET s1_len = CHAR_LENGTH(s1), s2_len = CHAR_LENGTH(s2), cv1 = 0x00, j = 1, i = 1, c = 0;
IF s1 = s2 THEN
RETURN 0;
ELSEIF s1_len = 0 THEN
RETURN s2_len;
ELSEIF s2_len = 0 THEN
RETURN s1_len;
ELSE
WHILE j <= s2_len DO
SET cv1 = CONCAT(cv1, UNHEX(HEX(j))), j = j + 1;
END WHILE;
WHILE i <= s1_len DO
SET s1_char = SUBSTRING(s1, i, 1), c = i, cv0 = UNHEX(HEX(i)), j = 1;
WHILE j <= s2_len DO
SET c = c + 1;
IF c > maxcost THEN
RETURN maxcost;
END IF;
IF s1_char = SUBSTRING(s2, j, 1) THEN
SET cost = 0;
ELSE
SET cost = 1;
END IF;
SET c_temp = CONV(HEX(SUBSTRING(cv1, j, 1)), 16, 10) + cost;
IF c > c_temp THEN SET c = c_temp; END IF;
SET c_temp = CONV(HEX(SUBSTRING(cv1, j+1, 1)), 16, 10) + 1;
IF c > c_temp THEN SET c = c_temp; END IF;
SET cv0 = CONCAT(cv0, UNHEX(HEX(c))), j = j + 1;
END WHILE;
SET cv1 = cv0, i = i + 1;
END WHILE;
END IF;
RETURN c;
END$$
DELIMITER ;
```