Mho Mho - 5 days ago 6
HTML Question

Python scrape value between static HTML tags containing static text

This is my first post in this forum and i believe that this forum would answer my basic question here.

My requirement here consists of two steps.


  1. In the first step, i need to extract the value "Paid Death Notice" based on the tag span and class c8 and c2 for the below html data where "DOCUMENT-TYPE:" text is static and it will always be there in my HTML.



<SPAN CLASS="c8">DOCUMENT-TYPE: </SPAN><SPAN CLASS="c2">Paid Death Notice</SPAN></P>


similarly for the below html data, i need to extract "Newspaper" value based on "PUBLICATION TYPE" with span and class as c8 and c2

<SPAN CLASS="c8">PUBLICATION-TYPE: </SPAN><SPAN CLASS="c2">Newspaper</SPAN>


Solution i have tried:

from bs4 import BeautifulSoup
import re

data = """<SPAN CLASS="c8">DOCUMENT-TYPE: </SPAN><SPAN CLASS="c2">**Paid Death Notice**</SPAN>
<SPAN CLASS="c8">PUBLICATION-TYPE: </SPAN><SPAN CLASS="c2">Newspaper</SPAN>"""


soup = BeautifulSoup(data,'lxml')
doc=soup.find('span',class_='c8')
doctext=re.compile('<SPAN(.*DOCUMENT-TYPE: </SPAN><SPAN.*?)</SPAN>')
print(doctext.match(doc.text))


Result:

None


Where i should get only Paid Death Notice as result


  1. Similarly there could be many HTMl tags having same DOCUMENT-TYPE: field where it differs by value only, so in this case, how will i iterate the loop based under what condition?



<SPAN CLASS="c8">DOCUMENT-TYPE: </SPAN><SPAN CLASS="c2">Paid Notice: Deaths THORNTON, ROBERT</SPAN>


Pls help me to resolve the issue.

Note: I have searched in the web and tried many ways but cannot able to find right solution and i am finally posting here with the hope that i may get right solution for my question.

Answer
import re

data = """<SPAN CLASS="c8">DOCUMENT-TYPE: </SPAN><SPAN CLASS="c2">**Paid Death Notice**</SPAN>
           <SPAN CLASS="c8">PUBLICATION-TYPE: </SPAN><SPAN CLASS="c2">Newspaper</SPAN>
           <SPAN CLASS="c8">DOCUMENT-TYPE: </SPAN><SPAN CLASS="c2">Paid Notice: Deaths THORNTON, ROBERT</SPAN>
           """
pattern="\<SPAN CLASS=\"c8\"\>DOCUMENT-TYPE: </SPAN><SPAN CLASS=\"c2\"\>(.*)\</SPAN>"
print [a.strip("*") for a in re.findall(pattern,data)]

Output:

['Paid Death Notice', 'Paid Notice: Deaths THORNTON, ROBERT']