masa masa - 1 month ago 15
Javascript Question

Getting access to the original HTML in HtmlUnit HtmlElement?

I am using HtmlUnit to read content from a web site.

Everything works perfectly to the point where I am reading the content with:

HtmlDivision div = page.getHtmlElementById("my-id");


Even
div.asText()
returns the expected String object, but I want to get the original HTML inside
<div>...</div>
as a String object. How can I do that?

I am not willing to change
HtlmUnit
to something else, as the web site expects the client to run JavaScript, and
HtmlUnit
seems to be capable of doing what is required.

Answer

If by original HTML you mean the HTML code that HTMLUnit has already formatted then you can use div.asXml(). Now, if you really are looking for the original HTML the server sent you then you won't find a way to do so (at least up to v2.14).

Now, as a workaround, you could get the whole text of the page that the server sent you with this answer: How to get the pure raw HTML of a page in HTMLUnit while ignoring JavaScript and CSS?

As a side note, you should probably think twice why you need the HTML code. HTMLUnit will let you get the data from the code, so there shouldn't be any need to store the source code but rather the information it is contained in it. Just my 2 cents.