Guru Guru - 20 days ago 9
Python Question

filter hyperlinks - python

I want to get all the hyperlinks from a website whose URL text includes words like

product
service
solution
index


So I came up with this

site = 'https://www.similarweb.com'
resp = requests.get(site)
encoding = resp.encoding if 'charset' in resp.headers.get('content-type', '').lower() else None
soup = BeautifulSoup(resp.content, from_encoding=encoding)

contact_links = []
for a in soup.find_all('a', href=True):
if 'product' in a['href'] or 'service' in a['href'] or 'solution' in a['href'] or 'about' in a['href'] or 'index' in a['href']:
contact_links.append(a['href'])

contact_links2 = []
for i in contact_links:
string2 = i
if string2[:4] == 'http':
contact_links2.append(i)
else:
contact_links2.append(site+i)

for i in contact_links2:
print i


When running this snippet on https://www.similarweb.com it gives several links some of which are

https://www.similarweb.com/apps/top/google/app-index/us/all/top-free
https://www.similarweb.com/corp/solution/travel/
https://www.similarweb.com/corp/about/
http://www.thedailybeast.com/articles/2016/10/17/drudge-limbaugh-fall-for-twitter-joke-about-postal-worker-destroying-trump-ballots.html
https://www.similarweb.com/apps/top/google/app-index/us/all/top-free


Following this result, I want only those links where after these words
product
service
solution
index
there should not be any more words

expected output:
(considering only previous 5 links)

https://www.similarweb.com/corp/about/


How can i do that?

Answer

You should have backslashes before and after the words you are checking in if condition. It should be if '/product/' in a['href'] ... and so on.

As mentioned in the comments that it should be the last word then better check a['href'].endswith('/product/'). As endswith function can take tuple as parameter so you can do this way

if a['href'].endswith(('/product/', '/index/', '/about/', '/solution/', 'service')).

This condition will evaluate to true for all urls that ends with any of the strings mentioned in the tuple.

Comments