parsecer parsecer - 3 months ago 19
Python Question

Beautifulsoup constructors and its arguments

I've seen here on SO many ways to initialize a Beautifulsoup object. As far as I can see, you can either pass a string=url or to pass some object. For instance, it's common to use

urllib
:

url="https://somesite.com"
url_html="<html><body><h1>Some header</h1><p>asdas</p></body></html>"
soup1=BeautifulSoup(url_html, "html.parser") #1st way
print(soup1.find("p").text) #can get the text "asdas"

soup2=BeautifulSoup(urllib.request.urlopen(url).read(), "html.parser") #2nd way

soup3=BeautifulSoup(urllib.request.urlopen(url), "html.parser") #3rd way

print(soup1.prettify())
print(soup2.prettify())
print(soup3.prettify())


But what happens inside the two last ways of initializing the soup? As far as I can see,
urllib.request.urlopen(url).read()
is the same thing as a pure html string
url_html
. But what about soup3?
Does it works because BeautifulSoup's constructor expects a string and there is a toString method in the object returned by
urlopen()
? And the object is converted into string and in reality 3rd method is the same as the 2nd?

Are there any other ways of initializing BeautifulSoup? Which is preferable?

Answer

urlopen() returns an open file-like object. The constructor of Beautifulsoup uses type-checking to see whether it got a file or a string (to be precise, it does markup.hasattr("read"). In the first case, it simply calls its read() method.

This is a common pattern in Python libraries that deal with big amounts of user-provided text data.

The difference in Soup's case is non-existent. Other libraries might do something more intelligent with a file object, e.g. partition it and not load it to memory en bloque.

Comments