Matthew Ruston Matthew Ruston - 1 year ago 85
HTML Question

Regular Expression to Extract HTML Body Content

I've been playing around with RegExBuddy for over an hour trying to figure out what I thought would be a trivial RegEx. I am looking for a RegEx statement that will let me extract the HTML content from just between the body tags from a XHTML document.

The XHTML that I need to parse will be very simple files, I do not have to worry about JavaScript content or

tags, for example.

Below is the expected structure of the HTML file is that I have to parse. Since I know exactly all of the content of the HTML files that I am going to have to work with, this HTML snippet pretty much covers my entire use case. If I can get a RegEx to extract the body of this example, I'll be happy.

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
<html xmlns="">
<body contenteditable="true">
Example paragraph content
<br />
<h1>Header 1</h1>

Conceptually, I've been trying to build a RegEx string that matches everything BUT the inner body content. With this, I would use the C#
function to obtain the body content. I thought the statement
((.|\n)*<body (.)*>)|((</body>(*|\n)*)
would do the trick, but it doesn't seem to work at all with my test content in RegExBuddy.

Answer Source

Would this work ?


Of course, you need to add the necessary \s in order to take into account < body ...> (element with spaces), as in:


On second thought, I am not sure why I needed a negative look-ahead... This should also work (for a well-formed xhtml document):