Mr. A Mr. A - 1 year ago 118
HTML Question

Cookies, JavaScript, Python, Browsing-but-not-really

Before, I think I danced around the bush, because I wasn't clear on the ethics of prancing around someone's website with python. I saw one answer on the stackoverflow that was close to what I needed, but it got deleted because requested for that to happen. But, I'll put those reservations aside.

I want to automatically grab a bunch of prices from a grocery store website. I began my project somewhat new and rusty with python. I grabbed the URLfiles as a human from my browser sessions and ran a bunch of loops to extract the data I wanted (a lot of '.find'). The problem was, I was, at the time, searching (.find()) the html files which I had downloaded manually. When I switched my code over to using "urlopen" I ran into a problem I didn't immediately recognize.

This page, for example, shows two different things depending on what your browsing status is.

And I suppose it ought to, because in a business like this, products and prices could be very sensitive to geography.

My idea has been to start the 'Python-ing' at this page where I already know the store I want to select:

and I have this form in particular:

<form action="/custserv/save_user_store.cmd"
method="post" name="selectThisStoreForm"
<input type='hidden' name='form_state' value='selectThisStoreForm'/>
<input name="storeId" type="hidden" value="21026"/><p class="browseStoreLink">
<a href="javascript:void(0);"
<input class="shopNow" type="image" src="/assets/hf/assets/images/buttons/btn_shopNow.gif" border="0" alt="Shop Now"/>

So I have the onsubmit sending a JS function to a page that isnt meant to be seen by humans.

Chrome says I have always 10 cookies when I am in a session with hannaford. 7 from "" and 3 from "".

So, just flailing a little bit:

sesh = requests.Session()
Params = {'selectThisStoreForm':''}
url = "",param=Params)


I am getting cookies out of Sessions. I am not getting the number of them that Chrome says it does get. I am also not able to ".find" the tags I want to find in each of these pages.

Answer Source

There is no need to use urllib.urlopen just use sesh.get([url]), the cookies will automatically be sent. You are also not sending the right parameters for the form, try:

params = { 'form_state' : 'selectThisStoreForm', 'storeid' : '21026' }'', params=params)
resp = sesh.get(urlFRUITS)

Alternatively, You could try the requests library and the Session object, it automatically manages cookies, e.g.:

>>> import requests
>>> s = requests.Session()
>>> r = s.get('')
>>> print r.status_code
>>> for c in s.cookies:
>>>     print c
<Cookie JSESSIONID=<ID> for>
>>> payload = { 'form_state' : 'selectThisStoreForm', 'storeId' : '62012' } 
>>> r ='', data=payload)
>>> print r.status_code
>>> for c in s.cookies:
>>>    print c
<Cookie JSESSIONID=<ID> for>

Without knowing exactly what you are doing, I would try the requests.Session object.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download