HelloToEarth HelloToEarth - 9 months ago 132
Python Question

Finding and storing children of roots in Beautiful Soup

I'm trying to find and store the children

<orgname>
from the parent
<assignee>
. My code so far runs through the XML document already picking up certain other tags - I have set it up as such:

for xml_string in separated_xml(infile): # Calls the output of the separated and read file to parse the data
soup = BeautifulSoup(xml_string, "lxml") # BeautifulSoup parses the data strings where the XML is converted to Unicode
pub_ref = soup.findAll("publication-reference") # Beginning parsing at every instance of a publication

lst = [] # Creating empty list to append into

with open('./output.csv', 'ab') as f:
writer = csv.writer(f, dialect = 'excel')

for info in pub_ref: # Looping over all instances of publication

# The final loop finds every instance of invention name, patent number, date, and country to print and append

for inv_name, pat_num, date_num, country, city, state in zip(soup.findAll("invention-title"), soup.findAll("doc-number"), assign.find("orgname"), soup.findAll("date"), soup.findAll("country"), soup.findAll("city"), soup.findAll("state")):

writer.writerow([inv_name.text, pat_num.text, org_name.text, date_num.text, country.text, city.text, state.text])


I already have this in sequence so that each invention name and patent pairs and need the organization assignee name along with it. Problem is that there are other tags associated with things like attorneys and such organizations looking like this:

<agent sequence="01" rep-type="attorney">
<addressbook>
<orgname>Sawyer Law Group LLP</orgname>
<address>
<country>unknown</country>
</address>
</addressbook>
</agent>
</agents>
</parties>
<assignees>
<assignee>
<addressbook>
<orgname>International Business Machines Corporation</orgname>
<role>02</role>
<address>
<city>Armonk</city>
<state>NY</state>
<country>US</country>
</address>
</addressbook>
</assignee>
</assignees>


I only want the orgname under
<assignee>
tag. I've tried:

assign = soup.findAll("assignee")
org_name = assign.findAll("orgname")

But to no avail. It simply shoots out:


"ResultSet object has no attribute '%s'. You're probably treating a
list of items like a single item. Did you call find_all() when you
meant to call find()?" % key

AttributeError: ResultSet object has no attribute 'find'. You're
probably treating a list of items like a single item. Did you call
find_all() when you meant to call find()?


How can I add these tags and find all the orgname under assignee tags?
It seems simple but I can't get it.

Thanks in advance.

Answer Source

assign = soup.findAll("assignee") returns a list , so that's why calling org_name = assign.findAll("orgname") fails, you'd have to go through each element of assign and call it's .findAll("orgname"), but it seems there's only one <orgname> in each <assignee>, so there's no need to use .findAll instead of .find. Try using .find to each element of assign using list comprehension:

orgnames = [item.find("orgname") for item in assign]

Or, to directly get their texts, checking before if the <orgname> exists within that <assignee>:

orgnames = [item.find("orgname").text for item in assign if item.find("orgname")]
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download