I'd like an efficient algorithm (or library) that I can use in Java to search for substrings in a string.
What I would like to do is:
Given an input string - INSTR:
And a set of candidate strings - CAND
Find any CAND strings that match as substrings within INSTR
"AB", "CDE", "FG", "H", "IJ"
In this example I would match "CDE", "FG", and "H" (but not "AB" and "IJ")
There could be many thousand candidate strings (in CAND), but more importantly I will be doing this search many millions of times so I need it to be FAST.
I'd like to work with char arrays. Also, I'm not intested in architectural solutions, like distributing the search - just the most efficient function/algorithm for doing it locally.
Additionally, all the strings in CAND and INSTR will all be relatively small (< 50 chars) - i.e. the target string INSTR is NOT long relative to the candidate strings.
I should have mentioned, the set of CAND strings is invariant across all values of INSTR.
I only need to know that there was a match - and i don't need to know what the match was.
I opted to try AhoCorsick and Rabin-Karp, due to simplicity of implementation.
Because I have variable length patterns I used a modified Rabin-Karp that hashes the first n characters of each pattern, where n is the length of the smallest pattern, N was then the length of my rolling substring search window.
For the Aho Corsick I used this
In my test i searched for 1000 patterns in two documents news paper articles, averaged across 1000 iterations etc...
Normalised times to complete were:
(check each pattern & use string.contains): 50
*Some resources describing the algos mentioned in the answers below: