Dan Savage Dan Savage - 3 months ago 23
HTML Question

Web scraping, distinguishing between resources and elements or a webpage

import mechanize
from bs4 import BeautifulSoup
import urllib2
import cookielib


cj = cookielib.CookieJar()
br = mechanize.Browser()
br.set_handle_robots(False)
br.set_cookiejar(cj)
br.open("*******")



br.select_form(nr=0)
br.form['ctl00$BodyContent$Username'] = '****'
br.form['ctl00$BodyContent$Password'] = '****'
br.submit()

print br.response().read()


At the moment this scrapes a web page and returns the resources, but not the actual html of the page (content and such). How do I change it so that I can get the html instead?

Answer

Your'e close, you should use beautiful soup to get the tags into a nice xml format.

import mechanize
from bs4 import BeautifulSoup
import urllib2 
import cookielib


cj = cookielib.CookieJar()
br = mechanize.Browser()
br.set_handle_robots(False)
br.set_cookiejar(cj)
br.open("*******")



br.select_form(nr=0)
br.form['ctl00$BodyContent$Username'] = '****'
br.form['ctl00$BodyContent$Password'] = '****'
br.submit()

soup =  BeautifulSoup(br.response().read())

print soup
or
for row in soup:
    print row