Florian Schramm Florian Schramm - 5 months ago 32
Python Question

Flexible Web Crawler

i am stucked with my web crawler for the moment.
The code until now is:

import requests
from bs4 import BeautifulSoup

def search_spider(max_pages):
page = 1
while page <= max_pages:
url = 'https://www.thenewboston.com/search.php?type=1&sort=pop&page=' + str(page)
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, "html.parser")
for link in soup.findAll('a', {'class': 'user-name'}):
href = "https://www.thenewboston.com/" + link.get('href')
print(href)
search_spider()


This is an example from a YT tutorial. Does anyone know how I have to alter the code when i don't have website endings like 1,2,3... but various numbers like 021587, 0874519, NI875121? The anker website domain is always the same but the ending is not straight forward like in this example. So what i would need to know is how to insert a variable for str(page) that gets the website ending numbers either from a .txt file on my computer (a few hundreds) or from a list when i copy and paste them into my code? For sure Python should stop when the end of the list is reached.

As i am pretty know to python, i don't know how to solve this issue for the moment. If you need further information let me know. Appreciate your responses!

Flo

Answer

Well, if you have a list of pages that you want to visit rather than a range of numbers, you could do something like:

pages = ['021587', '0874519', 'NI875121']

for page in pages:
    url = 'http://example.com/some-path/' + str(page)

To read in from a file:

with open('filename.txt') as f:
    contents = f.read()

Assuming that your pages are separated by whitespace, you can then run

pages = contents.split()

Check out the documentation for str.split()

Comments