santiagobasulto santiagobasulto - 4 months ago 18
Java Question

Java Regex performance

I'm trying to parse links with Regex with Java.

But, i think its getting to slow. For example. To extract all links from:



is spending 34642 milliseconds (34 seconds!!!)

Here is the regexp:

private final String regexp = "<a.*?\\shref\\s*=\\s*([\\\"\\']*)(.*?)([\\\"\\'\\s].*?>|>)";


The flags for the pattern:

private static final int flags = Pattern.CASE_INSENSITIVE | Pattern.DOTALL |Pattern.MULTILINE | Pattern.UNICODE_CASE | Pattern.CANON_EQ;


And the code may be something like this:

private void processURL(URL url){
URLConnection connection;
Pattern pattern = Pattern.compile(regexp, flags);
try {
connection = url.openConnection();
InputStream in = connection.getInputStream();
BufferedReader bf = new BufferedReader(new InputStreamReader(in));
String html = new String();
String line = bf.readLine();
while(line!=null){
html+=line;
line = bf.readLine();
}
bf.close();
Matcher matcher = pattern.matcher(html);
while (matcher.find()) {
System.out.println(matcher.group(2));
}
}catch(Exception e){
}
}


Can you give me a Hint?

Extra Data:
1Mbit
Core 2 Duo
1Gb RAM
Single Threaded

Answer

Hint: Don't use regexes for link extraction or other HTML "parsing" tasks!

Your regex has 6 (SIX) repeating groups in it. Executing it will entail a lot of backtracking. In the worst case, it could even approach O(N^6) where N is the number of input characters. You could ease this a bit by replacing eager matching with lazy matching, but it is almost impossible to avoid pathological cases; e.g. when the input data is sufficiently malformed that the regex does not match.

A far, far better solution is to use some existing strict or permissive HTML parser. Even writing an ad-hoc parser by hand is going to be better than using gnarly regexes.

This page that lists various HTML parsers for Java. I've heard good things about TagSoup and HtmlCleaner.

Comments