benzineengine benzineengine - 28 days ago 6
Python Question

Python: downloading a file that resists usual techniques

I am trying to write a python code to download and save a file from this url:
http://obiee.banrep.gov.co/analytics/saw.dll?Download&Format=excel&Extension=.xls&BypassCache=true&lang=es&NQUser=publico&NQPassword=publico&Path=/shared/Consulta%20Series%20Estadisticas%20desde%20Excel/1.%20IPC%20base%202008/1.3.%20Por%20rango%20de%20fechas/1.3.2.%20Por%20grupo%20de%20gasto&ViewState=h09v965dvurdtkj0iuni7m1kbe&ContainerID=o%3ago%7er%3areport&RootViewID=go

The expected result should be to download and save the served Excel file.

The file is behind some sort of oracle database. The file downloads fine using any browser. "Live HTTP headers" firefox extension tells me it's a GET request. Anyway I've tried usual techniques and I always end up downloading "saw.dll", which is a simple xml file and not the expected Excel file.

Here's what I tried:

import urllib,urlib2,shutil

url = 'http://obiee.banrep.gov.co/analytics/saw.dll?Download'
values = {
'Format' : 'excel',
'Extension' : '.xls',
'BypassCache' : 'true',
'lang' : 'es',
'NQUser' : 'publico',
'NQPassword' : 'publico',
'Path' : '/shared/Consulta Series Estadisticas desde Excel/1. IPC base 2008/1.3. Por rango de fechas/1.3.2. Por grupo de gasto',
'ViewState' : 'h09v965dvurdtkj0iuni7m1kbe',
'ContainerID' : 'o%3ago%7er%3areport',
'RootViewID' : 'go',
}

data = urllib.urlencode(values)
req = urllib2.Request(url,data)
response = urllib2.urlopen(req)
myfile = open('test.xls', 'wb')
shutil.copyfileobj(response.fp, myfile)
myfile.close()


Other code I tried:

import requests,shutil

response = requests.get("http://obiee.banrep.gov.co/analytics/saw.dll?Download&Format=excel&Extension=.xls&BypassCache=true&lang=es&NQUser=publico&NQPassword=publico&Path=/shared/Consulta%20Series%20Estadisticas%20desde%20Excel/1.%20IPC%20base%202008/1.3.%20Por%20rango%20de%20fechas/1.3.2.%20Por%20grupo%20de%20gasto&ViewState=h09v965dvurdtkj0iuni7m1kbe&ContainerID=o%3ago%7er%3areport&RootViewID=go",stream=True)

with open('test.xls', 'wb') as out_file:
shutil.copyfileobj(response.raw, out_file)
del response


I also tried other stuff such as using wget, putting some delay between the request and the saving, etc.

Any ideas ?

Thanks, best.

Answer

Did you tried to change the user agent?

...
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
requests.get(url=url, stream=True, headers=headers)

Maybe the server return different responses to different user agents.