janos janos - 7 months ago 58
Python Question

urllib.request.urlopen cannot fetch the primaries page of Stack Overflow elections

I have a little script to summarize and sort the candidate scores in Stack Exchange election primaries. It works for most sites, except for Stack Overflow, where retrieving the URL using

request.urlopen
of
urllib
fails with 403 error (Forbidden). To demonstrate the problem:

from urllib import request

urls = (
'http://math.stackexchange.com/election/5?tab=primary',
'http://serverfault.com/election/5?tab=primary',
'http://stackoverflow.com/election/7?tab=primary',
)

for url in urls:
print('fetching {} ...'.format(url))
request.urlopen(url).read()


Output, the URLs of Math SE and Server Fault work fine, but Stack Overflow fails:


fetching http://math.stackexchange.com/election/5?tab=primary ...
fetching http://serverfault.com/election/5?tab=primary ...
fetching http://stackoverflow.com/election/7?tab=primary ...
Traceback (most recent call last):
File "examples/t.py", line 11, in <module>
request.urlopen(url).read()
File "/opt/local/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/urllib/request.py", line 161, in urlopen
return opener.open(url, data, timeout)
File "/opt/local/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/urllib/request.py", line 469, in open
response = meth(req, response)
File "/opt/local/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/urllib/request.py", line 579, in http_response
'http', request, response, code, msg, hdrs)
File "/opt/local/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/urllib/request.py", line 507, in error
return self._call_chain(*args)
File "/opt/local/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/urllib/request.py", line 441, in _call_chain
result = func(*args)
File "/opt/local/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/urllib/request.py", line 587, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden



Using
curl
, all URLs work. So the problem seems to be specific to
request.urlopen
of
urllib
. I tried in OSX and Linux, same result. What's going on? How to explain this?

Answer

It appears to be the user-agent that gets sent with urllib. This code works for me:

from urllib import request

urls = (
    'http://math.stackexchange.com/election/5?tab=primary',
    'http://serverfault.com/election/5?tab=primary',
    'http://stackoverflow.com/election/7?tab=primary',
)

for url in urls:
    print('fetching {} ...'.format(url))
    try:
        request.urlopen(url).read()
    except:
        print('got an exception, changing user-agent to urllib3 default')
        req = request.Request(url)
        req.add_header('User-Agent', 'Python-urllib/3.4')
        try:
            request.urlopen(req)
        except:
            print('got another exception, changing user-agent to something else')
            req.add_header('User-Agent', 'not-Python-urllib/3.4')
            request.urlopen(req)

And here's the current output (2015-11-16) with blank lines added for readability:

fetching http://math.stackexchange.com/election/5?tab=primary ...
success with url: http://math.stackexchange.com/election/5?tab=primary

fetching http://serverfault.com/election/5?tab=primary ...
success with url: http://serverfault.com/election/5?tab=primary

fetching http://stackoverflow.com/election/7?tab=primary ...
got an exception, changing user-agent to urllib default
got another exception, changing user-agent to something else
success with url: http://stackoverflow.com/election/7?tab=primary
Comments