I was testing some code and parsing XML was included. For simple testing I requested / of my localhost and the response was my Apache2 default page.
So far, so good.
The response is XHTML and therefore XML. So I took it for my parsing (~11k of size).
XML::LibXML->load_xml (string => $response);
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
One possibility is that each time you parse the document the parser is downloading the DTD from W3C. You could confirm this using strace or similar tools depending on your platform.
The DTD contains (among other things) the named entity definitions which map for example the string
to the character
U+00A0. So in order to parse HTML documents, the parser does need the DTD, however fetching it via HTTP each time is obviously not a good idea.
One approach is to install a copy of the DTD locally and use that. On Debian/Ubuntu systems you can just install the
w3c-dtd-xhtml package which also sets up the appropriate XML catalog entries to allow libxml to find it.
Another approach is to use
XML::LibXML->load_html instead of
XML::LibXML->load_xml. In HTML parsing mode, the parser is more forgiving of markup errors and I think also always uses a local copy of the DTD.
The parser also provides options which allow you to specify your own handler routine for retrieving reference URIs.