anne_t anne_t - 4 months ago 22
Python Question

Optimization of this Python code - webscraping and output results to CSV file

I am trying to scrape data from a few thousands pages. The code I have works fine for about a 100 pages, but then slows down dramatically. I am pretty sure that my Tarzan-like code could be improved, so that the speed of the webscrapping process increases. Any help would be appreciated. TIA!

Here is the simplified code:

csvfile=open('test.csv', 'w', encoding='cp850', errors='replace')
writer=csv.writer(csvfile)

list_url= ["http://www.randomsite.com"]
i=1

for url in list_url:
base_url_parts = urllib.parse.urlparse(url)
while True:
raw_html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(raw_html, "lxml")

#### scrape the page for the desired info

i=i+1
n=str(i)

#Zip the data
output_data=zip(variable_1, variable_2, variable_3, ..., variable_10)

#Write the observations to the CSV file
writer=csv.writer(open('test.csv','a',newline='', encoding='cp850', errors='replace'))
writer.writerows(output_data)
csvfile.flush()

base="http://www.randomsite.com/page"
base2=base+n
url_part2="/otherstuff"
url_test = base2+url_part2

try:
if url_test != None:
url = url_test
print(url)
else:
break
except:
break

csvfile.close()


EDIT: Thanks for all the answers, I learn quite a lot from them. I am (slowly!) learning my way around Scrapy. However, I found that the pages are available via bulk download, which will be an ever better way to solve the performance issue.

Answer

The main bottleneck is that your code is synchronous (blocking). You don't proceed to the next URL until you finish processing the current one.

You need to make things asynchronously either by switching to Scrapy which solves this problem out-of-the-box, or by building something yourself via, for example, grequests.