Joel - 7 months ago 46

Java Question

I'd like an efficient algorithm (or library) that I can use in Java to search for substrings in a string.

What I would like to do is:

Given an input string - **INSTR**:

"BCDEFGH"

And a set of candidate strings -

"AB", "CDE", "FG", "H", "IJ"

In this example I would match "CDE", "FG", and "H" (but not "AB" and "IJ")

There could be many thousand candidate strings (in CAND), but more importantly I will be doing this search many millions of times so I need it to be FAST.

I'd like to work with char arrays. Also, I'm not intested in architectural solutions, like distributing the search - just the most efficient function/algorithm for doing it locally.

Additionally, all the strings in CAND and INSTR will all be relatively small (< 50 chars) - i.e. the target string INSTR is NOT long relative to the candidate strings.

I opted to try AhoCorsick and Rabin-Karp, due to simplicity of implementation.

Because I have variable length patterns I used a modified Rabin-Karp that hashes the first n characters of each pattern, where n is the length of the smallest pattern, N was then the length of my rolling substring search window.

For the Aho Corsick I used this

In my test i searched for 1000 patterns in two documents news paper articles, averaged across 1000 iterations etc...

Normalised times to complete were:

*Some resources describing the algos mentioned in the answers below:

http://www.seas.gwu.edu/~simhaweb/cs151/lectures/module5/module5.html

http://www.cs.princeton.edu/courses/archive/spr09/cos226/lectures/18SubstringSearch-2x2.pdf

http://www-igm.univ-mlv.fr/~lecroq/string/index.html*

Answer

Your best options are the Aho-Corasick algorithm and the Rabin-Karp algorithm. Because I am no Java developer I cannot tell if there are any out-of-the-box framework functions or libraries.

Just to add - if your input is not that large, you don't want to repeat the search many times and you do not have many patterns, it might be even a good idea to use a single pattern algorithm several times. The Wikipedia article on search algorithms gives many algorithms with running and preprocessing times so you can judge the options.

Implementations:

- https://hkn.eecs.berkeley.edu/~dyoo/java/index.html
- http://algs4.cs.princeton.edu/53substring/RabinKarp.java.html
- https://github.com/hankcs/AhoCorasickDoubleArrayTrie
- https://github.com/RokLenarcic/AhoCorasick
- https://github.com/robert-bor/aho-corasick

Presentations:

Source (Stackoverflow)