I am trying to scrape data from a few thousands pages. The code I have works fine for about a 100 pages, but then slows down dramatically. I am pretty sure that my Tarzan-like code could be improved, so that the speed of the webscrapping process increases. Any help would be appreciated. TIA!
Here is the simplified code:
csvfile=open('test.csv', 'w', encoding='cp850', errors='replace')
for url in list_url:
base_url_parts = urllib.parse.urlparse(url)
raw_html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(raw_html, "lxml")
#### scrape the page for the desired info
#Zip the data
output_data=zip(variable_1, variable_2, variable_3, ..., variable_10)
#Write the observations to the CSV file
writer=csv.writer(open('test.csv','a',newline='', encoding='cp850', errors='replace'))
url_test = base2+url_part2
if url_test != None:
url = url_test
The main bottleneck is that your code is synchronous (blocking). You don't proceed to the next URL until you finish processing the current one.