Joel Joel - 13 days ago 9
Java Question

Fast algorithm for searching for substrings in a string

I'd like an efficient algorithm (or library) that I can use in Java to search for substrings in a string.

What I would like to do is:

Given an input string - INSTR:


"BCDEFGH"


And a set of candidate strings - CAND:


"AB", "CDE", "FG", "H", "IJ"


Find any CAND strings that match as substrings within INSTR

In this example I would match "CDE", "FG", and "H" (but not "AB" and "IJ")

There could be many thousand candidate strings (in CAND), but more importantly I will be doing this search many millions of times so I need it to be FAST.

I'd like to work with char arrays. Also, I'm not intested in architectural solutions, like distributing the search - just the most efficient function/algorithm for doing it locally.

Additionally, all the strings in CAND and INSTR will all be relatively small (< 50 chars) - i.e. the target string INSTR is NOT long relative to the candidate strings.




Update I should have mentioned, the set of CAND strings is invariant across all values of INSTR.

Update I only need to know that there was a match - and i don't need to know what the match was.

Final Update
I opted to try AhoCorsick and Rabin-Karp, due to simplicity of implementation.
Because I have variable length patterns I used a modified Rabin-Karp that hashes the first n characters of each pattern, where n is the length of the smallest pattern, N was then the length of my rolling substring search window.
For the Aho Corsick I used this

In my test i searched for 1000 patterns in two documents news paper articles, averaged across 1000 iterations etc...
Normalised times to complete were:

AhoCorsick: 1

RabinKarp: 1.8

Naive Search (check each pattern & use string.contains): 50




*Some resources describing the algos mentioned in the answers below:

http://www.seas.gwu.edu/~simhaweb/cs151/lectures/module5/module5.html

http://www.cs.princeton.edu/courses/archive/spr09/cos226/lectures/18SubstringSearch-2x2.pdf

http://www-igm.univ-mlv.fr/~lecroq/string/index.html*

Answer

Your best options are the Aho-Corasick algorithm and the Rabin-Karp algorithm. Because I am no Java developer I cannot tell if there are any out-of-the-box framework functions or libraries.

Just to add - if your input is not that large, you don't want to repeat the search many times and you do not have many patterns, it might be even a good idea to use a single pattern algorithm several times. The Wikipedia article on search algorithms gives many algorithms with running and preprocessing times so you can judge the options.

Implementations:

Presentations: