Matthew Ruston Matthew Ruston - 2 months ago 14
HTML Question

Regular Expression to Extract HTML Body Content

I've been playing around with RegExBuddy for over an hour trying to figure out what I thought would be a trivial RegEx. I am looking for a RegEx statement that will let me extract the HTML content from just between the body tags from a XHTML document.

The XHTML that I need to parse will be very simple files, I do not have to worry about JavaScript content or

<![CDATA[
tags, for example.

Below is the expected structure of the HTML file is that I have to parse. Since I know exactly all of the content of the HTML files that I am going to have to work with, this HTML snippet pretty much covers my entire use case. If I can get a RegEx to extract the body of this example, I'll be happy.

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>
</title>
</head>
<body contenteditable="true">
<p>
Example paragraph content
</p>
<p>
&nbsp;
</p>
<p>
<br />
&nbsp;
</p>
<h1>Header 1</h1>
</body>
</html>


Conceptually, I've been trying to build a RegEx string that matches everything BUT the inner body content. With this, I would use the C#
RegEx.Split()
function to obtain the body content. I thought the statement
((.|\n)*<body (.)*>)|((</body>(*|\n)*)
would do the trick, but it doesn't seem to work at all with my test content in RegExBuddy.

Answer

Would this work ?

((?:.(?!<body[^>]*>))+.<body[^>]*>)|(</body\>.+)

Of course, you need to add the necessary \s in order to take into account < body ...> (element with spaces), as in:

((?:.(?!<\s*body[^>]*>))+.<\s*body[^>]*>)|(<\s*/\s*body\s*\>.+)

On second thought, I am not sure why I needed a negative look-ahead... This should also work (for a well-formed xhtml document):

(.*<\s*body[^>]*>)|(<\s*/\s*body\s*\>.+)