Mr. A Mr. A - 1 month ago 14
HTML Question

Cookies, JavaScript, Python, Browsing-but-not-really

Before, I think I danced around the bush, because I wasn't clear on the ethics of prancing around someone's website with python. I saw one answer on the stackoverflow that was close to what I needed, but it got deleted because ticketmaster.com requested for that to happen. But, I'll put those reservations aside.

I want to automatically grab a bunch of prices from a grocery store website. I began my project somewhat new and rusty with python. I grabbed the URLfiles as a human from my browser sessions and ran a bunch of loops to extract the data I wanted (a lot of '.find'). The problem was, I was, at the time, searching (.find()) the html files which I had downloaded manually. When I switched my code over to using "urlopen" I ran into a problem I didn't immediately recognize.

This page, for example, shows two different things depending on what your browsing status is.

http://www.hannaford.com/thumbnail/Produce/Fruits/pc/28546/46815.uts?displayAll=true


And I suppose it ought to, because in a business like this, products and prices could be very sensitive to geography.

My idea has been to start the 'Python-ing' at this page where I already know the store I want to select:
www.hannaford.com/custserv/store_detail.jsp?viewStoreId=21026

and I have this form in particular:

<form action="/custserv/save_user_store.cmd"
method="post" name="selectThisStoreForm"
onsubmit="return StoreLocator.change.store(this,false,false,21026);"
>
<input type='hidden' name='form_state' value='selectThisStoreForm'/>
<input name="storeId" type="hidden" value="21026"/><p class="browseStoreLink">
<a href="javascript:void(0);"
onclick="this.form.submit();"
class="altLink"
>
<input class="shopNow" type="image" src="/assets/hf/assets/images/buttons/btn_shopNow.gif" border="0" alt="Shop Now"/>
</a>
</p>
</form>


So I have the onsubmit sending a JS function to a page that isnt meant to be seen by humans.

Chrome says I have always 10 cookies when I am in a session with hannaford. 7 from "hannaford.com" and 3 from "www.hannaford.com".

So, just flailing a little bit:

sesh = requests.Session()
Params = {'selectThisStoreForm':''}
url = "http://www.hannaford.com/custserv/save_user_store.cmd"
sesh.post(url,param=Params)

urlopen(urlFRUITS,cookies=sesh.cookies)#??


I am getting cookies out of Sessions. I am not getting the number of them that Chrome says it does get. I am also not able to ".find" the tags I want to find in each of these pages.

Answer

There is no need to use urllib.urlopen just use sesh.get([url]), the cookies will automatically be sent. You are also not sending the right parameters for the form, try:

params = { 'form_state' : 'selectThisStoreForm', 'storeid' : '21026' }
sesh.post('http://www.hannaford.com/custserv/save_user_store.cmd', params=params)
resp = sesh.get(urlFRUITS)

Alternatively, You could try the requests library and the Session object, it automatically manages cookies, e.g.:

>>> import requests
>>> s = requests.Session()
>>> r = s.get('http://www.THEWEBSITE.com/custserv/locate_store.cmd')
>>> print r.status_code
200
>>> for c in s.cookies:
>>>     print c
<Cookie JSESSIONID=<ID> for www.THEWEBSITE.com/>
<Cookie PIPELINE_SESSION_ID=<ID> for www.THEWEBSITE.com/>
>>> payload = { 'form_state' : 'selectThisStoreForm', 'storeId' : '62012' } 
>>> r = s.post('http://www.THEWEBSITE.com/custserv/save_user_store.cmd', data=payload)
>>> print r.status_code
200
>>> for c in s.cookies:
>>>    print c
<Cookie JSESSIONID=<ID> for www.THEWEBSITE.com/>
<Cookie PIPELINE_SESSION_ID=<ID> for www.THEWEBSITE.com/>
<Cookie USER_SESSION_VALIDATE_COOKIE=false for www.THEWEBSITE.com/>

Without knowing exactly what you are doing, I would try the requests.Session object.

Comments