Aliaksei Kuzmin Aliaksei Kuzmin - 1 month ago 6
Python Question

How to preserve the mutual arrangement of html <b> and <center> tags when lxml.html.fromstring is used

The

lxml.html.fromstring
function parses the combination of html
<b>
and
<center>
tags in a strange way:

lxml.html.tostring(lxml.html.fromstring("<b><center>hello</center></b>"))


gives:
<div><b></b><center>hello</center></div>
.
Please notice that
<center>hello</center>
was moved out of
<b></b>
braces.

The question is how to preserve the layout and span of the pair of
<b>
tags the same as in the initial text?

FYI. If you swap the application of tags

lxml.html.tostring(lxml.html.fromstring("<center><b>hello</b></center>"))


you'll have the the correct result:
<center><b>hello</b></center>


I use Python 2.7.9 and lxml 3.4.2.

Answer

Because your original code is not actually valid HTML.

<center> is a block-level element, and <b> is an inline element. Inline elements cannot contain block elements. lxml is doing its best to interpret the code as valid HTML.

Note also that center has been deprecated anyway since HTML4; you really shouldn't be using it.

Comments