Matt Matt - 1 month ago 10
SQL Question

Best way to index a SQL table to find best matching string

Let's say I have a SQL table with an

int
PK column and an
nvarchar(max)
. In the the
nvarchar(max)
column, I have a bunch of table entries that are all like this:

SOME_PEOPLE_LIKE_APPLES
SOME_PEOPLE_LIKE_APPLES_ON_TUESDAY
SOME_PEOPLE_LIKE_APPLES_ON_THE_MOON
SOME_PEOPLE_LIKE_APPLES_ON_THE_MOON_CAFE
SOME_PEOPLE_LIKE_APPLES_ON_THE_RIVER
.
.
.
SOME_ANTS_HATE_SYRUP
SOME_ANTS_HATE_SYRUP_WITH_STRAWBERRIES


There's millions of these rows - Then let's say my goal is to find the row with the most overlap for an input
searchTerm
- So in this case, if I input
SOME PEOPLE_LIKE_APPLES_ON_THE_MOON_MOUNTAIN
, the returned entry would be the third entry from the table above,
SOME_PEOPLE_LIKE_APPLES_ON_THE_MOON


I have a SPROC that does this very naively, it goes through the entire table as follows:

SELECT DISTINCT phrase, len(phrase) l, [id] FROM X WHERE searchTerm LIKE phrase + '%'

-- phrase is the row entry being searched against
-- searchTerm is the phrase we're searching for


I then
ORDER BY
length and pick the
TOP
only

Would there be a way to speed this up, perhaps by doing some indexing?

If this is confusing, think of it as
tableRowEntry + wildcard = searchTerm


I'm on MSSQL 2008 if that makes any difference

Answer

If there is an index on your NVARCHAR-column a LIKE 'Something%' -search will be able to use it and should be pretty fast.

If there is a wildcard in the beginning you are out of luck. But - in your case - this should work.

You might use an indexed persistant computed column storing the length of the string. In this case you might reduce the workload enormously by filtering out all string which are to short or to long.

If there are certain words in your search terms which appear often but not everywhere, you might use side columns again and filter like AND InlcudePEOPLE=1 AND IncludeMOON=1

UPDATE

Here is an example

CREATE TABLE Phrase(ID INT IDENTITY
                   ,Phrase NVARCHAR(100)
                   ,PhraseLength AS LEN(Phrase) PERSISTED);
CREATE INDEX IX_Phrase_Phrase ON Phrase(Phrase);
CREATE INDEX IX_Phrase_PhraseLength ON Phrase(PhraseLength);

INSERT INTO Phrase
VALUES
 ('SOME_PEOPLE_LIKE_APPLES')
,('SOME_PEOPLE_LIKE_APPLES_ON_TUESDAY')
,('SOME_PEOPLE_LIKE_APPLES_ON_THE_MOON')
,('SOME_PEOPLE_LIKE_APPLES_ON_THE_MOON_CAFE')
,('SOME_PEOPLE_LIKE_APPLES_ON_THE_RIVER')
,('SOME_ANTS_HATE_SYRUP')
,('SOME_ANTS_HATE_SYRUP_WITH_STRAWBERRIES');

DECLARE @SearchTerm NVARCHAR(100)=N'SOME_PEOPLE_LIKE_APPLES_ON_THE_MOON_MOUNTAIN';

--This uses the index (checked against execution plan)

SELECT TOP 1 * 
FROM Phrase 
WHERE @SearchTerm LIKE Phrase + '%'
ORDER BY PhraseLength DESC;

--This might be even better, check with your high row count.

SELECT TOP 1 *
FROM Phrase
WHERE Phrase=LEFT(@SearchTerm,PhraseLength)
ORDER BY PhraseLength DESC;
GO

--Clean-Up

DROP TABLE Phrase;
Comments