Hsun-Yi Hsieh Hsun-Yi Hsieh - 4 months ago 12
Python Question

Parsing specific values in multiple pages

I have the following code with a purpose to parse specific information from each of multiple pages. The http of each of the multiple pages is structured and therefore I use this structure to collect all links at the same time for further parsing.

import urllib
import urlparse
import re
from bs4 import BeautifulSoup

Links = ["http://www.newyorksocialdiary.com/party-pictures?page=" + str(i) for i in range(2,27)]


This command gives me a list of http links. I go further to read in and make soups.

Rs = [urllib.urlopen(Link).read() for Link in Links]
soups = [BeautifulSoup(R) for R in Rs]


As these make the soups that I desire, I cannot achieve the final goal - parsing structure
<a href= ""> </a>
. For instance,

<a href="/party-pictures/2007/something-for-everyone">Something for Everyone</a>


I am specifically interested in obtaining things like this:
'/party-pictures/2007/something-for-everyone'
. However, the code below cannot serve this purpose.

As = [soup.find_all('a', attr = {"href"}) for soup in soups]


Could someone tell me where went wrong? I highly appreciate your assistance. Thank you.

Answer

This should work :

As = [soup.find_all(href=True) for soup in soups]

This should give you all href tags

If you only want hrefs with name 'a', then the following would work :

As = [soup.find_all('a',href=True) for soup in soups]
Comments