Merithor Merithor - 4 months ago 10
Python Question

Speed up web-scraping

I have a project where I have to scrape all the ratings of 50 actors/actresses, which means I have to access and scrape around 3500 web pages. This takes way longer than I expected and I'm looking for a way to speed up things. I know there are frameworks like scrapy, but I'd like to work without any other modules. Is there a fast and easy way to rewrite my code, or would this take too much time?
My code is as follows:

def getMovieRatingDf(movie_links):

counter = -1
movie_name = []
movie_rating = []
movie_year = []

for movie in movie_links.tolist()[0]:
counter += 1

request = requests.get('' + movie_links.tolist()[0][counter])
film_soup = BeautifulSoup(request.text, 'html.parser')

if (film_soup.find('div', {'class': 'title_wrapper'}).find('a').text).isdigit():
movie_year.append(int(film_soup.find('div', {'class': 'title_wrapper'}).find('a').text))

# scrap the name and year of the current film

movie_rating.append(float(film_soup.find('span', {'itemprop': 'ratingValue'}).text))

except AttributeError:

rating_df = pd.DataFrame(data={"movie name": movie_name, "movie rating": movie_rating, "movie year": movie_year})
rating_df = rating_df.sort_values(['movie rating'], ascending=False)

return rating_df


The main bottleneck is easy to determine by just looking at the code. It is of a blocking nature. You don't download/parse the next page until the current is being processed.

If you want to speed things up, do it asynchronously in a non-blocking manner. This is what Scrapy offers out-of-the-box:

Here you notice one of the main advantages about Scrapy: requests are scheduled and processed asynchronously. This means that Scrapy doesn’t need to wait for a request to be finished and processed, it can send another request or do other things in the meantime. This also means that other requests can keep going even if some request fails or an error happens while handling it.

Another option would be to switch from requests to grequests, sample code can be found here:

We can also improve couple things at the HTML-parsing stage:

  • switch to lxml from html.parser (requires lxml to be installed):

    film_soup = BeautifulSoup(request.text, 'lxml')
  • use SoupStrainer to parse only the relevant part of the document