I am trying to scrape data from a few thousands pages. The code I have works fine for about a 100 pages, but then slows down dramatically. I am pretty sure that my Tarzan-like code could be improved, so that the speed of the webscrapping process increases. Any help would be appreciated. TIA!
Here is the simplified code:
csvfile=open('test.csv', 'w', encoding='cp850', errors='replace')
writer=csv.writer(csvfile)
list_url= ["http://www.randomsite.com"]
i=1
for url in list_url:
base_url_parts = urllib.parse.urlparse(url)
while True:
raw_html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(raw_html, "lxml")
#### scrape the page for the desired info
i=i+1
n=str(i)
#Zip the data
output_data=zip(variable_1, variable_2, variable_3, ..., variable_10)
#Write the observations to the CSV file
writer=csv.writer(open('test.csv','a',newline='', encoding='cp850', errors='replace'))
writer.writerows(output_data)
csvfile.flush()
base="http://www.randomsite.com/page"
base2=base+n
url_part2="/otherstuff"
url_test = base2+url_part2
try:
if url_test != None:
url = url_test
print(url)
else:
break
except:
break
csvfile.close()
The main bottleneck is that your code is synchronous (blocking). You don't proceed to the next URL until you finish processing the current one.
You need to make things asynchronously either by switching to Scrapy
which solves this problem out-of-the-box, or by building something yourself via, for example, grequests
.