JanM JanM - 6 months ago 14
HTML Question

python - extracting tags and attributes from HTML - the hard way

after a long struggling I have managed to get one input long string processed into the following form - one list:

['<', 'p', '>', '<', 'a', 'href', '>', '<', 'a', '>', '<', 'p', '>', '<', 'div', 'class', '>', '<', 'a', 'href', '>', '<', 'a', '>', '<', 'div', '>']


how can I now efficiently and in hard-coding-way process that list furthermore to get each HTML tag and the attribute it covers ?

so after that i will confirm that p does not have any attributes, a has href and div has a class attribute ?

Jan Jan
Answer

Just for the sake of academic challenge, you could use (slightly adopted from this answer on Stackoverflow)

your_list = ['<', 'p', '>', '<', 'a', 'href', '>', '<', 'a', '>', '<', 'p', '>', '<', 'div', 'class', '>', '<', 'a', 'href', '>', '<', 'a', '>', '<', 'div', '>']

for prev,cur,next in zip([None]+your_list[:-1], your_list, your_list[1:]+[None]):
    if prev == '<' and next == '>':
        print "%s is an empty element" % cur

But: This is certainly not the best/fastest/safest way to achieve your goal, better use an appropriate parser like BeautifulSoup in the first place. That being said, see a demo on ideone.com

Comments