santiagobasulto santiagobasulto - 1 year ago 85
Java Question

Java Regex performance

I'm trying to parse links with Regex with Java.

But, i think its getting to slow. For example. To extract all links from:

is spending 34642 milliseconds (34 seconds!!!)

Here is the regexp:

private final String regexp = "<a.*?\\shref\\s*=\\s*([\\\"\\']*)(.*?)([\\\"\\'\\s].*?>|>)";

The flags for the pattern:

private static final int flags = Pattern.CASE_INSENSITIVE | Pattern.DOTALL |Pattern.MULTILINE | Pattern.UNICODE_CASE | Pattern.CANON_EQ;

And the code may be something like this:

private void processURL(URL url){
URLConnection connection;
Pattern pattern = Pattern.compile(regexp, flags);
try {
connection = url.openConnection();
InputStream in = connection.getInputStream();
BufferedReader bf = new BufferedReader(new InputStreamReader(in));
String html = new String();
String line = bf.readLine();
line = bf.readLine();
Matcher matcher = pattern.matcher(html);
while (matcher.find()) {
}catch(Exception e){

Can you give me a Hint?

Extra Data:
Core 2 Duo
Single Threaded

Answer Source

Hint: Don't use regexes for link extraction or other HTML "parsing" tasks!

Your regex has 6 (SIX) repeating groups in it. Executing it will entail a lot of backtracking. In the worst case, it could even approach O(N^6) where N is the number of input characters. You could ease this a bit by replacing eager matching with lazy matching, but it is almost impossible to avoid pathological cases; e.g. when the input data is sufficiently malformed that the regex does not match.

A far, far better solution is to use some existing strict or permissive HTML parser. Even writing an ad-hoc parser by hand is going to be better than using gnarly regexes.

This page that lists various HTML parsers for Java. I've heard good things about TagSoup and HtmlCleaner.