I am crawling the web using urllib3. Example code:
from urllib3 import PoolManager
pool = PoolManager()
response = pool.request("GET", url)
If the server supplies a
Content-Length header, then you can use that to determine if you'd like to continue downloading the remainder of the body or not. If the server does not provide the header, then you'll need to stream the response until you decide you no longer want to continue.
To do this, you'll need to make sure that you're not preloading the full response.
from urllib3 import PoolManager pool = PoolManager() response = pool.request("GET", url, preload_content=False) # Maximum amount we want to read max_bytes = 1000000 content_bytes = response.headers.get("Content-Length") if content_bytes and int(content_bytes) < max_bytes: # Expected body is smaller than our maximum, read the whole thing data = response.read() # Do something with data ... elif content_bytes is None: # Alternatively, stream until we hit our limit amount_read = 0 for chunk in r.stream(): amount_read += len(chunk) # Save chunk ... if amount_read > max_bytes: break # Release the connection back into the pool response.release_conn()