David542 David542 - 1 month ago 11
Python Question

res.text or res.content in requests

I occasionally use

res.content
or
res.text
, and neither seems to make a difference with
requests
(at least in the use cases I have had). What is the main difference in parsing html? For example:

import requests
from lxml import html
res = requests.get(...)
node = html.fromstring(res.content)


In the above situation, should I be using
res.content
or
res.text
? What is a good rule of thumb for when to use each?

Answer

From the documentation:

When you make a request, Requests makes educated guesses about the encoding of the response based on the HTTP headers. The text encoding guessed by Requests is used when you access r.text. You can find out what encoding Requests is using, and change it, using the r.encoding property:

>>> r.encoding
'utf-8'
>>> r.encoding = 'ISO-8859-1'

If you change the encoding, Requests will use the new value of r.encoding whenever you call r.text. You might want to do this in any situation where you can apply special logic to work out what the encoding of the content will be. For example, HTTP and XML have the ability to specify their encoding in their body. In situations like this, you should use r.content to find the encoding, and then set r.encoding. This will let you use r.text with the correct encoding.

So I think the only case where you should use r.content is when the server is sending bogus encoding headers, to try to find the correct encoding inside a meta tag.

Comments