fumarat fumarat - 1 year ago 82
HTML Question

Extracting specific information from fetched HTML code using python

I'm a relatively newb in python. I need some advice for a bioinformatics project. It's about converting certain enzyme IDs to others.

What I already did and what works, is fetch the html code for a list of IDs from the Rhea database:

53 url2 = "http://www.rhea-db.org/reaction?id=16952"
54 f_xml2 = open("xml_tempfile2.txt", "w")
55
56 fetch2 = pycurl.Curl()
57 fetch2.setopt(fetch2.URL, url2)
58 fetch2.setopt(fetch.WRITEDATA, f_xml2)
59 fetch2.perform()
60 fetch2.close


So the HTML code is saved to a temporary txt file (I know, possibly not the most elegant way to do stuff, but it works for me ;).

Now what I am interested in is the following part from the HTML:

<p>
<h3>Same participants, different directions</h3>
<div>
<a href="./reaction?id=16949"><span>RHEA:16949</span></a>
<span class="icon-question">myo-inositol + NAD(+) &lt;?&gt; scyllo-inosose + H(+) + NADH</span>
</div><div>
<a href="./reaction?id=16950"><span>RHEA:16950</span></a>
<span class="icon-arrow-right">myo-inositol + NAD(+) =&gt; scyllo-inosose + H(+) + NADH</span>
</div><div>
<a href="./reaction?id=16951"><span>RHEA:16951</span></a>
<span class="icon-arrow-left-1">scyllo-inosose + H(+) + NADH =&gt; myo-inositol + NAD(+)</span>
</div>
</p>


I want to go through the code until the class "icon-arrow-right" is reached (this expression is unique in the HTML). Then I want to extract the information of "RHEA:XXXXXX" from the line above. So in this example, I want to end up with 16950.

Is there a simple way to do this? I've already experimented with HTMLparser but couldn't get it to work in a way that it looks for a certain class and then gives me the ID from the line above.

Thank you very much in advance!

Answer Source

You can use an HTML parser like BeautifulSoup to do this:

>>> from bs4 import BeautifulSoup
>>> html = """ <p>
...             <h3>Same participants, different directions</h3>
...             <div>
...                 <a href="./reaction?id=16949"><span>RHEA:16949</span></a>
...                 <span class="icon-question">myo-inositol + NAD(+) &lt;?&gt; scyllo-inosose + H(+) + NADH</span>
...             </div><div>
...                 <a href="./reaction?id=16950"><span>RHEA:16950</span></a>
...                 <span class="icon-arrow-right">myo-inositol + NAD(+) =&gt; scyllo-inosose + H(+) + NADH</span>
...             </div><div>
...                 <a href="./reaction?id=16951"><span>RHEA:16951</span></a>
...                 <span class="icon-arrow-left-1">scyllo-inosose + H(+) + NADH =&gt; myo-inositol + NAD(+)</span>
...             </div>
...         </p>"""
>>> soup = BeautifulSoup(html, 'html.parser')
>>> soup.find('span', class_='icon-arrow-right').find_previous_sibling().get_text()
'RHEA:16950'