santiagobasulto santiagobasulto - 7 months ago 42
Java Question

Java Regex performance

I'm trying to parse links with Regex with Java.

But, i think its getting to slow. For example. To extract all links from:



is spending 34642 milliseconds (34 seconds!!!)

Here is the regexp:

private final String regexp = "<a.*?\\shref\\s*=\\s*([\\\"\\']*)(.*?)([\\\"\\'\\s].*?>|>)";


The flags for the pattern:

private static final int flags = Pattern.CASE_INSENSITIVE | Pattern.DOTALL |Pattern.MULTILINE | Pattern.UNICODE_CASE | Pattern.CANON_EQ;


And the code may be something like this:

private void processURL(URL url){
URLConnection connection;
Pattern pattern = Pattern.compile(regexp, flags);
try {
connection = url.openConnection();
InputStream in = connection.getInputStream();
BufferedReader bf = new BufferedReader(new InputStreamReader(in));
String html = new String();
String line = bf.readLine();
while(line!=null){
html+=line;
line = bf.readLine();
}
bf.close();
Matcher matcher = pattern.matcher(html);
while (matcher.find()) {
System.out.println(matcher.group(2));
}
}catch(Exception e){
}
}


Can you give me a Hint?

Extra Data:
1Mbit
Core 2 Duo
1Gb RAM
Single Threaded

Answer

Hint: Don't use regexes for link extraction or other HTML "parsing" tasks!

Your regex has 6 (SIX) repeating groups in it. Executing it will entail a lot of backtracking. In the worst case, it could even approach O(N^6) where N is the number of input characters. You could ease this a bit by replacing eager matching with lazy matching, but it is almost impossible to avoid pathological cases; e.g. when the input data is sufficiently malformed that the regex does not match.

A far, far better solution is to use some existing strict or permissive HTML parser. Even writing an ad-hoc parser by hand is going to be better than using gnarly regexes.

This page that lists various HTML parsers for Java. I've heard good things about TagSoup and HtmlCleaner.