user7400738 user7400738 -4 years ago 183
Python Question

Parsing html forms input tags with Beautiful Soup

I am trying to scarp a website. There is no problem if there is only one opening and one closing form-Tag and data is in between that. But when the data on the website is displayed under checked box, then data in the codes is in strange position. Does anybody have the same problem?

Here is a basic example Webpage where I want the data:

<div class="label"></div>
<input disabled="" type="checkbox" name="t_pow_ports:f_p_a:3486" class="forminput" id="ajaxField-76" checked="">
&nbsp;&nbsp;Airport
<div class="label"></div>
<input disabled="" type="checkbox" name="t_pow_ports:f_p_b:3486" checked="" class="forminput" id="ajaxField-77">
&nbsp;&nbsp;Bunkers
<div class="label"></div>
<input disabled="" type="checkbox" name="t_pow_ports:f_p_c:3486" class="forminput" id="ajaxField-78">
&nbsp;&nbsp;Containers
<div class="label"></div>
<input disabled="" type="checkbox" name="t_pow_ports:f_p_l:3486" class="forminput" id="ajaxField-79">
&nbsp;&nbsp;Cruise
<div class="label"></div>
....


I need to fetch the data: Airport,Bunkers, etc(data) which have 'checked =""' in their input array.
1st Problem: To make sure I only get checked value
2nd Problem: How to fetch the data which is between

<div>..</div><input...> data <div>...</div>


By using the following code:

import requests
import bs4
from bs4 import BeautifulSoup
import pandas

r = requests.get("http://directories.lloydslist.com/?p=1635")
c = r.content
soup = BeautifulSoup(c, "html.parser")
print(soup.prettify())
all = soup.find_all("div",{"id":"section-1785-body"},{"class":"sectionbody"})


I get the following format:

<div class="label"></div>
<input checked="" class="forminput" disabled="" id="ajaxField-115" name="t_pow_ports:f_p_a:5779" type="checkbox"/>
Airport
<div class="label"></div>
<input checked="" class="forminput" disabled="" id="ajaxField-116" name="t_pow_ports:f_p_b:5779" type="checkbox"/>
Bunkers
<div class="label"></div>
.....
....
<input checked="" class="forminput" disabled="" id="ajaxField-119" name="t_pow_ports:f_p_y:5779" type="checkbox"/> Dry Bulk
<div class="label"></div></div>


So if I use the following code:

abc = all[0].find_all("input", {"class":"forminput"},"checked")


I don't get any data:

<input class="forminput" disabled="" id="ajaxField-20" name="t_pow_ports:f_p_a:595" type="checkbox"/>,
<input class="forminput" disabled="" id="ajaxField-21" name="t_pow_ports:f_p_b:595" type="checkbox"/>,
<input class="forminput" disabled="" id="ajaxField-22" name="t_pow_ports:f_p_c:595" type="checkbox"/>,
....


Does anyone know a way around this problem?

Answer Source

You need to use navigableString for getting the next sibling after the checked input.

Try the following:

from bs4 import BeautifulSoup as Soup

html_str = """
<div>
    <div class="label"></div>
    <input disabled="" type="checkbox" name="t_pow_ports:f_p_a:3486" class="forminput" id="ajaxField-76" checked=""/>
    &nbsp;&nbsp;Airport

    <div class="label"></div>
    <input disabled="" type="checkbox" name="t_pow_ports:f_p_b:3486" checked="" class="forminput" id="ajaxField-77"/>
    &nbsp;&nbsp;Bunkers

    <div class="label"></div>
    <input disabled="" type="checkbox" name="t_pow_ports:f_p_c:3486" class="forminput" id="ajaxField-78"/>
    &nbsp;&nbsp;Containers

    <div class="label"></div>
    <input disabled="" type="checkbox" name="t_pow_ports:f_p_l:3486" class="forminput" id="ajaxField-79"/>
    &nbsp;&nbsp;Cruise

    <div class="label"></div>
</div>
"""

soup = Soup(html_str, "html.parser")

forminput = soup.find_all("input", {"class":"forminput"})
for item in forminput:
    if item.get('checked') is not None:
        # now work with navigable string! be careful for empty lines
        name = item.next_sibling.strip()
        print(name)

The output of this snippet is:

Airport
Bunkers
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download