So the following code works perfectly when I run it on my local machine in PyCharm/from shell-script:
# -*- coding: utf-8 -*-
import requests
from lxml import etree, html
import chardet
def gimme_pairs():
url = "https://halbidoncom/sha.xml"
page = requests.get(url).content
encoding = chardet.detect(page)['encoding']
if encoding != 'utf-8':
page = page.decode(encoding, 'replace').encode('utf-8')
doc = html.fromstring(page, base_url=url)
print(doc)
print(page)
wanted = doc.xpath('//location')
print(wanted)
date_list = None
tashkif_list = None
for elem in wanted:
date_list = elem.xpath('locationdata/timeunitdata/date/text()')
tashkif_list = elem.xpath('locationdata/timeunitdata/element/elementvalue/text()')
doc
b'\n\n\nChallenge=355121;\nChallengeId=58551073;\nGenericErrorMessageCookies="Cookies
must be enabled in order to view this
page.";\n\n\nfunction test(var1)\n{\n\tvar
var_str=""+Challenge;\n\tvar var_arr=var_str.split("");\n\tvar
LastDig=var
_arr.reverse()[0];\n\tvar minDig=var_arr.sort()[0];\n\tvar subvar1 = (2 * (var_arr[2]))+(var_arr[1]*1);\n\tvar subvar2 = (2 * var_arr[2])+v
ar_arr[1];\n\tvar
my_pow=Math.pow(((var_arr[0]*1)+2),var_arr[1]);\n\tvar
x=(var1*3+subvar1)1;\n\tvar y=Math.cos(Math.PIsubvar2);\n\tvar a
nswer=x*y;\n\tanswer-=my_pow*1;\n\tanswer+=(minDig*1)-(LastDig*1);\n\tanswer=answer+subvar2;\n\treturn
answer;\n}\n\n\ncli ent = null;\nif
(window.XMLHttpRequest)\n{\n\tvar client=new
XMLHttpRequest();\n}\nelse\n{\n\tif
(window.ActiveXObject)\n\t{\n\t\tclient = new
ActiveXObject(\'MSXML2.XMLHTTP.3.0\');\n\t};\n}\nif
(!((!!client)&&(!!Math.pow)&&(!!Math.cos)&&(!![].sort)&&(!![].reverse)))\n{\n\tdocu
ment.write("Not all needed JavaScript methods are
supported.
");\n\n}\nelse\n{\n\tclient.onreadystatechange =
function()\n\t{\n\t\tif(c lient.readyState == 4)\n\t\t{\n\t\t\tvar
MyCookie=client.getResponseHeader("X-AA-Cookie-Value");\n\t\t\tif
((MyCookie == null) || (MyCooki
e==""))\n\t\t\t{\n\t\t\t\tdocument.write(client.responseText);\n\t\t\t\treturn;\n\t\t\t}\n\t\t\t\n\t\t\tvar
cookieName = MyCookie.split(\'= \')[0];\n\t\t\tif
(document.cookie.indexOf(cookieName)==-1)\n\t\t\t{\n\t\t\t\tdocument.write(GenericErrorMessageCookies);\n\t\t\t\treturn;\
n\t\t\t}\n\t\t\twindow.location.reload(true);\n\t\t}\n\t};\n\ty=test(Challenge);\n\tclient.open("POST",window.location,true);\n\tclient.set
RequestHeader(\'X-AA-Challenge-ID\',
ChallengeId);\n\tclient.setRequestHeader(\'X-AA-Challenge-Result\',y);\n\tclient.setRequestHeader(\'X-
AA-Challenge\',Challenge);\n\tclient.setRequestHeader(\'Content-Type\'
, \'text/plain\');\n\tclient.send();\n}\n\n\n\
nJavaScript must be enabled in order to view this
page.\n\n'
Looks like the servers you're trying to scrape have protection that tries to make sure you're using a real browser/there's a human behind the request. If you format that response nicely you'll see that it's setting some headers on the page using the Challenge
and ChallengeId
at the beginning.
I assume the IPs/servers that PythonAnywhere uses have been added to a list by the server owners to block the requests (maybe someone really spammed them in the past?)
Having a look around for the same headers, I've found this project which seems to have solved the same problem: https://github.com/niryariv/opentaba-server/
They check for the challenge: https://github.com/niryariv/opentaba-server/blob/master/lib/mavat_scrape.py#L31 and parse them with this helper: https://github.com/niryariv/opentaba-server/blob/master/lib/helpers.py#L109
Hope that helps!