Bluelily Bluelily - 2 months ago 9
HTML Question

I want to inherit from Beautifulsoup Class to do the following tasks

I am running on Python 3.5.1 with Beautifulsoup4.

I currently have this code:

from bs4 import BeautifulSoup
import html5lib

class LinkFinder(BeautifulSoup):

def __init__(self):
super().__init__()

def handle_starttag(self, name, attrs):
print(name)


When I instantiate the class by the following code:
findmylink = LinkFinder()
and when I load my html with the following code
findmylink.feed("""<html><head><title>my name is good</title></head><body>hello world</body></html>""",'html5lib')
.

I got the following error in my console:

'NoneType' object is not callable


I actually wish to duplicate the following sample code (in my case, I wishes to use Beautifulsoup instead of
html.parser
)

from html.parser import HTMLParser
class LinkFinder(HTMLParser):

def __init__(self):
super().__init__()

def handle_starttag(self, tag, attrs):
print(tag)


When I re-instantiate the class by the following code:
findmylink = LinkFinder()
and when I load my html with the following code
findmylink.feed("""<html><head><title>my name is good</title></head><body>hello world</body></html>""")
I get the following output:

html
head
title
body


which is the desired output.

Answer

If you want to go this way, change your implementation to accept the markup during initialization and handle_starttag to grab all passed args:

class LinkFinder(BeautifulSoup):

  def __init__(self, markup):
    super().__init__(markup, 'html.parser')

  def handle_starttag(self, name, namespace, nsprefix, attrs):
    print(name)

Initializing with:

l = LinkFinder("""<html><head><title>my name is good</title></head><body>hello world</body></html>""")

Prints out:

html
head
title
body

I'm pretty sure the BeautifulSoup class has overloaded __getattr__ to return None on non-defined attributes instead of raising AttributeError; that's what's causing your error:

print(type(BeautifulSoup().feed))
NoneType
print(type(BeautifulSoup().feedededed))
NoneType

and, BeautifulSoup doesn't have a feed function as HTMLParser does (it does have a _feed and that calls the underlying feed of the builder object with self.markup) so you get a None object which you call.

Comments