Dave Dave - 6 months ago 31
Python Question

getting article text using xpath but omit some tags

I'm trying to parse (article) text only using xpath.

I want to get all text which are direct children and all nested descendants text of a node, except for the following nodes/tags:

<script>, <ul class="pager pagenav">, <style>

Example html to match using xpath:

<section class="entry-content">
want this article text
<script>dont want this</script>
more text i want
<p>want this text too</p>
<any>also this</any>
<style>dont want this either</style>
<ul class="pager pagenav">nope, dont want this <a>Prev Next</a></ul>

Currently, i have something like:

result = tree.xpath('//section[@class="entry-content"]/*[not(descendant-or-self::script or self::ul[@class="pager pagenav"] or self::style)]/../descendant-or-self::text()')

..but it doesn't quite work.


Use the child::node() to match both regular children and text child nodes:

child::node() selects all the children of the context node, whatever their node type

self:: would help to filter unwanted elements having specific names:

//section[@class="entry-content"]/child::node()[not(self::script or self::ul or self::style)]/descendant-or-self::text()