Nishant Nishant - 4 months ago 7
Python Question

lxml is not detecting empty div as expected

For the below input,

lxml
modifies the
div
as if it understands that
div
can't be inside
p
.

Can anyone tell me how to just get the
<div></div>
for this type of input? I want to correct the input HTML.

Do I need to switch to
BeautifulSoup
?

from lxml import etree

html_string = """
<html>
<head>
<title></title>
</head>
<body>
<p align="center">
<div></div>
This line should be centered.
</p>
<table>
<tbody>
<tr>
<td>
<div></div>
</td>
</tr>
</tbody>
</table>
</body>
</html>
"""

html_element = etree.fromstring(html_string)

page_break_elements = html_element.xpath("//div")

(Pdb) etree.tostring(html_element[1][0][0])
b'<div/>\n This line should be centered.\n '


I just want the below element to move it around.

<div></div>


For anyone curious, these are page-break
div
s used for PDF generation
<div style="page-break-after:always"></div>
that specify page-breaks. I get input from TinyMCE which doesn't position it correctly so I am trying to move it to the
body
element.

Output Desired

from lxml import etree

html_string = """
<html>
<head>
<title></title>
</head>
<body>
<div></div>
<p align="center">
This line should be centered.
</p>
<div></div>
<table>
<tbody>
<tr>
<td>
</td>
</tr>
</tbody>
</table>
</body>
</html>
"""

Answer

You can use the soupparser in lxml and still process the data with xpaths etc..:

from lxml.html.soupparser import fromstring

html_element = fromstring(html_string)

That will maintain <div></div> inside the p.

Comments