zerohedge zerohedge - 11 months ago 185
Python Question

python requests weird error on PythonAnywhere

So the following code works perfectly when I run it on my local machine in PyCharm/from shell-script:

# -*- coding: utf-8 -*-

import requests
from lxml import etree, html
import chardet

def gimme_pairs():

url = "https://halbidoncom/sha.xml"
page = requests.get(url).content
encoding = chardet.detect(page)['encoding']

if encoding != 'utf-8':
page = page.decode(encoding, 'replace').encode('utf-8')

doc = html.fromstring(page, base_url=url)
print(doc)
print(page)
wanted = doc.xpath('//location')

print(wanted)

date_list = None
tashkif_list = None

for elem in wanted:
date_list = elem.xpath('locationdata/timeunitdata/date/text()')
tashkif_list = elem.xpath('locationdata/timeunitdata/element/elementvalue/text()')


But on PythonAnywhere I get this output for
doc
:


b'\n\n\nChallenge=355121;\nChallengeId=58551073;\nGenericErrorMessageCookies="Cookies
must be enabled in order to view this
page.";\n\n\nfunction test(var1)\n{\n\tvar
var_str=""+Challenge;\n\tvar var_arr=var_str.split("");\n\tvar
LastDig=var
_arr.reverse()[0];\n\tvar minDig=var_arr.sort()[0];\n\tvar subvar1 = (2 * (var_arr[2]))+(var_arr[1]*1);\n\tvar subvar2 = (2 * var_arr[2])+v
ar_arr[1];\n\tvar
my_pow=Math.pow(((var_arr[0]*1)+2),var_arr[1]);\n\tvar
x=(var1*3+subvar1)1;\n\tvar y=Math.cos(Math.PIsubvar2);\n\tvar a
nswer=x*y;\n\tanswer-=my_pow*1;\n\tanswer+=(minDig*1)-(LastDig*1);\n\tanswer=answer+subvar2;\n\treturn
answer;\n}\n\n\ncli ent = null;\nif
(window.XMLHttpRequest)\n{\n\tvar client=new
XMLHttpRequest();\n}\nelse\n{\n\tif
(window.ActiveXObject)\n\t{\n\t\tclient = new
ActiveXObject(\'MSXML2.XMLHTTP.3.0\');\n\t};\n}\nif
(!((!!client)&&(!!Math.pow)&&(!!Math.cos)&&(!![].sort)&&(!![].reverse)))\n{\n\tdocu
ment.write("Not all needed JavaScript methods are
supported.
");\n\n}\nelse\n{\n\tclient.onreadystatechange =
function()\n\t{\n\t\tif(c lient.readyState == 4)\n\t\t{\n\t\t\tvar
MyCookie=client.getResponseHeader("X-AA-Cookie-Value");\n\t\t\tif
((MyCookie == null) || (MyCooki
e==""))\n\t\t\t{\n\t\t\t\tdocument.write(client.responseText);\n\t\t\t\treturn;\n\t\t\t}\n\t\t\t\n\t\t\tvar
cookieName = MyCookie.split(\'= \')[0];\n\t\t\tif
(document.cookie.indexOf(cookieName)==-1)\n\t\t\t{\n\t\t\t\tdocument.write(GenericErrorMessageCookies);\n\t\t\t\treturn;\
n\t\t\t}\n\t\t\twindow.location.reload(true);\n\t\t}\n\t};\n\ty=test(Challenge);\n\tclient.open("POST",window.location,true);\n\tclient.set
RequestHeader(\'X-AA-Challenge-ID\',
ChallengeId);\n\tclient.setRequestHeader(\'X-AA-Challenge-Result\',y);\n\tclient.setRequestHeader(\'X-
AA-Challenge\',Challenge);\n\tclient.setRequestHeader(\'Content-Type\'
, \'text/plain\');\n\tclient.send();\n}\n\n\n\
nJavaScript must be enabled in order to view this
page.\n\n'


Things I've tried:


  • Swapping requests for urllib.open()

  • Adding headers manually

  • ensuring same packages are installed

  • upgrading to PA premium account



What gives? what strikes me is that requests is supposed to have the same function on both my machine and theirs.

Answer Source

Looks like the servers you're trying to scrape have protection that tries to make sure you're using a real browser/there's a human behind the request. If you format that response nicely you'll see that it's setting some headers on the page using the Challenge and ChallengeId at the beginning.

I assume the IPs/servers that PythonAnywhere uses have been added to a list by the server owners to block the requests (maybe someone really spammed them in the past?)

Having a look around for the same headers, I've found this project which seems to have solved the same problem: https://github.com/niryariv/opentaba-server/

They check for the challenge: https://github.com/niryariv/opentaba-server/blob/master/lib/mavat_scrape.py#L31 and parse them with this helper: https://github.com/niryariv/opentaba-server/blob/master/lib/helpers.py#L109

Hope that helps!

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download