emma perkins emma perkins - 5 months ago 22
Python Question

Python: pulling bold text and the text that follows

Using the below html I would like to pull 2 bits of data out and add them into a list in python. each bold text his a horse name and following that is the comments.



<div id="ANALYSIS" class="tabContent tabSelected">A weak handicap that looked wide open.
<br>
<br> <b class="black">LADY MAKFI</b> showed vastly improved form to shed her maiden tag on this seasonal debut for a new yard. The filly offered little for Tony Martin last year, but did show some ability on her debut and is evidently capable when fresh.
She saw it out well and it´ll be interesting to see how she copes with a rise.
<br>
<br> <b class="black">Weardiditallgorong</b> went down fighting over this longer trip and probably improved again on her last-time-out second at Bath. This was her best effort yet on the AW.
<br>
<br> <b class="black">Chauvelin</b>, in second-time blinkers, turned in his most encouraging effort for some time and is certainly well treated on his best form.
<br>
<br> <b class="black">Happy Jack</b> not for the first time travelled easily until making heavy weather of it when asked for his effort. [David Orton]
<br>
<br>
<div id="resultRaceReport" class="hide"></div>
</div>





from the above output i would like it to look like the following


[LADY MAKFI, showed vastly improved form to shed her maiden tag on
this seasonal debut for a new yard. The filly offered little for Tony
Martin last year, but did show some ability on her debut and is
evidently capable when fresh. She saw it out well and it´ll be
interesting to see how she copes with a rise.]

[Weardiditallgorong, went down fighting over this longer trip and
probably improved again on her last-time-out second at Bath. This was
her best effort yet on the AW.]

[Chauvelin, in second-time blinkers, turned in his most encouraging
effort for some time and is certainly well treated on his best form.]

[Happy Jack, not for the first time travelled easily until making
heavy weather of it when asked for his effort. [David Orton]]


but im just not sure how to get the desired output (more the logic behind it)

I currently use lxml to scrape content and would need to match the bold (horses name) against my table so I can add the comments (text after the bold) to my database

Answer

using lxml:

h = """<div id="ANALYSIS" class="tabContent tabSelected">A weak handicap that looked wide open.<br><br> <b class="black">LADY MAKFI</b> showed vastly improved form to shed her maiden tag on this seasonal debut for a new yard. The filly offered little for Tony Martin last year, but did show some ability on her debut and is evidently capable when fresh. She saw it out well and it´ll be interesting to see how she copes with a rise.<br><br> <b class="black">Weardiditallgorong</b> went down fighting over this longer trip and probably improved again on her last-time-out second at Bath. This was her best effort yet on the AW.<br><br> <b class="black">Chauvelin</b>, in second-time blinkers, turned in his most encouraging effort for some time and is certainly well treated on his best form.<br><br> <b class="black">Happy Jack</b> not for the first time travelled easily until making heavy weather of it when asked for his effort. [David Orton]<br><br> <div id="resultRaceReport" class="hide"></div></div>"""

from lxml import html

x = html.fromstring(h)

div = x.xpath("//*[@id='ANALYSIS']")[0]

# find bold tags by class name
for b in div.xpath(".//b[@class='black']"):
    # get bold text
    print(b.text)
    # get text between current bold up to next br tag.
    print(b.xpath("./following::text()[1]"))

Would give you:

LADY MAKFI
[u' showed vastly improved form to shed her maiden tag on this seasonal debut for a new yard. The filly offered little for Tony Martin last year, but did show some ability on her debut and is evidently capable when fresh. She saw it out well and it\xc2\xb4ll be interesting to see how she copes with a rise.']
Weardiditallgorong
[' went down fighting over this longer trip and probably improved again on her last-time-out second at Bath. This was her best effort yet on the AW.']
Chauvelin
[', in second-time blinkers, turned in his most encouraging effort for some time and is certainly well treated on his best form.']
Happy Jack
[' not for the first time travelled easily until making heavy weather of it when asked for his effort. [David Orton]']

If you want it all in a single list exactly as posted:

from lxml import html

x = html.fromstring(h)
div = x.xpath("//*[@id='ANALYSIS']")[0]
out = [b.text + "," +  b.xpath("./following::text()[1]")[0].lstrip(",") for b in div.xpath(".//b[@class='black']")]

Which gives you:

[u'LADY MAKFI, showed vastly improved form to shed her maiden tag on this seasonal debut for a new yard. The filly offered little for Tony Martin last year, but did show some ability on her debut and is evidently capable when fresh. She saw it out well and it\xc2\xb4ll be interesting to see how she copes with a rise.',
 'Weardiditallgorong, went down fighting over this longer trip and probably improved again on her last-time-out second at Bath. This was her best effort yet on the AW.',
 'Chauvelin, in second-time blinkers, turned in his most encouraging effort for some time and is certainly well treated on his best form.',
 'Happy Jack, not for the first time travelled easily until making heavy weather of it when asked for his effort. [David Orton]']
Comments