edsheeran edsheeran - 3 months ago 9
Python Question

Python after running script for a long time memory allocation error

I have this code which scrapes usernames:

def fetch_and_parse_names(url):
html = requests.get(url).text
soup = BeautifulSoup(html, "lxml")
return (a.string for a in soup.findAll(href=USERNAME_PATTERN))

def get_names(urls):
# Create a concurrent executor
with concurrent.futures.ThreadPoolExecutor(max_workers=10) as executor:

# Apply the fetch-and-parse function concurrently with executor.map,
# and join the results together
return itertools.chain.from_iterable(executor.map(fetch_and_parse_names, urls))

def get_url(region, page):
return 'http://lolprofile.net/leaderboards/%s/%d' % (region, page)


When it starts putting all the names in a list like this

urls = [get_url(region, i) for i in range(start, end + 1)]
names = (name.lower() for name in get_names(urls) if is_valid_name(name))


After an hour off running I get memory allocation errors, obviously I know why this happens but how can I fix it? I was thinking just getting the usernames from a single page and output them to file immediately, delete contents of list, repeat, but I didn't know how to implement this.

Answer

The code you use keeps all the downloaded documents in memory for two reasons:

  • you return a.string, which is not just a str but a bs4.element.NavigableString and as such keeps a reference to its parent and ultimately to the whole document tree.
  • you return a generator expression, which will capture the local context (in this case the soup) until it is used.

One way to fix this would be to use:

return [str(a.string) for a in soup.findAll(href=USERNAME_PATTERN)]

This way no references to the soup objects are kept, and the expression is executed immediately and a list of strs returned.