Penny Penny - 7 months ago 65
Python Question

How to use BeautifulSoup to get parent tag's name value when the parent tag's children has certain attribute values?

To make this question easier to understand, below is an example

<Tag name="Thumbnail" inline="no" nonsearchable="yes">
<Attribute>
<Attribute name="AText" Searchable="yes"></Attribute>
</Attribute>
</Tag>

<Tag name="Label" inline="no" nonsearchable="yes">
<Attribute>
<Attribute name="AText" Searchable="no"></Attribute>
</Attribute>
</Tag>

<Tag name="Image" inline="no" nonsearchable="yes">
<Attribute>
<Attribute name="BText" Searchable="yes">
</Attribute>
</Tag>

<Tag name="Wonder" inline="no" nonsearchable="yes">
<Attribute>
<Attribute name="BText" Searchable="yes"></Attribute>
</Attribute>
</Tag>


Expected Result

enter image description here

So in the excel, the first row should be the the Attribute tag's name value if the Attribute tag's Searchable value is "yes"; Then these "qualified" Attributes tag's parent tag - Tag - name value would be listed underneath.

Currently, I can only find all the Tag's name value if it's children's Searchable value is "yes", but can't categorize them under corresponding Attribute tag's name value. Below is my initial code:

import os, openpyxl
from bs4 import BeautifulSoup

cwd = os.getcwd()

def func(x):
for file in os.listdir(cwd):
if file.endswith('.xml'):
f = open(file, encoding = 'utf=8', mode = 'r+')
soup = BeautifulSoup(f, 'lxml')
AttrYES = soup.find_all(attrs={"Searchable": "yes"})
for items in AttrYES:
tagName = items.parent.parent.get('name')
print (tagName)

x = os.listdir(cwd)
func(x)


I'll try to work this out as well, but to make the process faster, if you have any ideas, please advice. Thank you!!

Answer Source

Your code cannot find anything, if you print AttrYES, it will be [] . The thing is that when you use bs4 with parser lxml, all tag and attr name will convert to lowercase,refer to official doc. If you print the soup, it will give you:

<html><body><tag inline="no" name="Thumbnail" nonsearchable="yes">
<attribute>
<attribute name="AText" searchable="yes"></attribute>
</attribute>
</tag>
<tag inline="no" name="Label" nonsearchable="yes">
<attribute>
<attribute name="AText" searchable="no"></attribute>
</attribute>
</tag>
<tag inline="no" name="Image" nonsearchable="yes">
<attribute>
<attribute name="BText" searchable="yes">
</attribute>
</attribute></tag>
<tag inline="no" name="Wonder" nonsearchable="yes">
<attribute>
<attribute name="BText" searchable="yes"></attribute>
</attribute>
</tag></body></html>

Hence, you could amend your code like this:

import bs4
f = open('test.xml',mode = 'r+')
soup = bs4.BeautifulSoup(f, 'lxml')
AttrYES = soup.findAll(attrs={'searchable': 'yes'})
result = dict()
for items in AttrYES:
    result[items.get('name')] = result.get(items.get('name'),[])+[items.parent.parent.get('name')]    
print(result)

The print will be :

{'AText': ['Thumbnail'], 'BText': ['Image', 'Wonder']}

Then you could write them to your excel file:

import xlsxwriter

workbook = xlsxwriter.Workbook('result.xlsx')
worksheet = workbook.add_worksheet()

# Write header
worksheet.write(0, 0, result.keys()[0])
worksheet.write(0, 1, result.keys()[1])

# Write data.
worksheet.write_column(1, 0, result.values()[0])
worksheet.write_column(1, 1, result.values()[1])

workbook.close()

The result.xlsx will be:

enter image description here

Update: with openpyxl

from openpyxl import Workbook
wb = Workbook()

# grab the active worksheet
ws = wb.active

# Data can be assigned directly to cells
i,j = 1,1
for keys,values in a.items():
    ws.cell(column=i, row=1, value=keys)
    for row in range(len(values)):
        ws.cell(column=i, row=j+1, value=values[row])
        j+=1
    j=1
    i+=1
wb.save("result.xlsx")