Peter Ream Peter Ream - 4 months ago 21
Java Question

Regex expression in java htmlunit

I am trying to advance my knowledge of java, by trying to automate webpage scraping and form input. I have experimented with jsoup and now htmlunit. I found a htmlunit example that I am trying to run.

public class GoogleHtmlUnitTest {
static final WebClient browser;

static {
browser = new WebClient();
browser.getOptions().setJavaScriptEnabled(false);
// browser.setJavaScriptEnabled(false);
}

public static void main(String[] arguments) {
boolean result;
try {
result = searchTest();
} catch (Exception e) {
e.printStackTrace();
result = false;
}

System.out.println("Test " + (result? "passed." : "failed."));
if (!result) {
System.exit(1);
}
}

private static boolean searchTest() {
HtmlPage currentPage;

try {
currentPage = (HtmlPage) browser.getPage("http://www.google.com");
} catch (Exception e) {
System.out.println("Could not open browser window");
e.printStackTrace();
return false;
}
System.out.println("Simulated browser opened.");

try {
((HtmlTextInput) currentPage.getElementByName("q")).setValueAttribute("qa automation");
currentPage = currentPage.getElementByName("btnG").click();
System.out.println("contents: " + currentPage.asText());
return containsPattern(currentPage.asText(), "About .* results");
} catch (Exception e) {
System.out.println("Could not search");
e.printStackTrace();
return false;
}
}

public static boolean containsPattern(String string, String regex) {
Pattern pattern = Pattern.compile(regex);

// Check for the existence of the pattern
Matcher matcher = pattern.matcher(string);
return matcher.find();
}
}


It works with some htmlunit errors, that I have found on stackoverflow to ignore. The program runs correctly, so I am taking the advice and ignoring the errors.

Jul 31, 2016 7:29:03 AM com.gargoylesoftware.htmlunit.DefaultCssErrorHandler error
WARNING: CSS error: 'https://www.google.com/search?q=qa+automation&sa=G&gbv=1&sei=_eCdV63VGMjSmwHa85kg' [1:1467] Error in declaration. '*' is not allowed as first char of a property.


My problem at the moment is the regex expression being used for the search. If I am understanding this correctly, “qa automation” is being googled and the retrieved page is being searched by:

return containsPattern(currentPage.asText(), "About .* results");

What is throwing me is “About .* results”. This is the regex, but I don't get how it is being interpreted. What is being searched for on the retrieved page?

Answer

.* means "zero or more of any character," in another words, a complete wildcard. It can be

About 28 results
About 2864 results
About 2,864 results
About ERROR results
About  results