Vladimir Vargas Vladimir Vargas - 2 months ago 16
Python Question

Selenium iterating over paginator: optimization

I have a website with a paginator in it. Each page shows 32 links and I get each of them and store them in separate files in a folder. I'm using the Firefox driver from Selenium in Python.

The procedure is basically:

get the 32 elements
for element in elements:
open new file and save element
repeat


I am monitoring the time that each cycle takes. I started with 4 seconds, then 8 seconds (when I had saved 10000 links), and now it's spending 10 seconds, and I have saved 13000 links.

Before, I was opening the same file and appending the links, this also slowed down the cycles, I guess because as the size of the file increased, it had to load it and append during each cycle.

But right now I don't know what could be slowing the cycles. Going to the next page always spends 3-4 seconds, so this is not the source of the problem. What could be slowing down the cycles?

This is the cycle:

while True:
propiedades = driver.find_elements_by_xpath("//*[@class='hlisting']")
info_propiedades = [propiedad.find_element_by_xpath(".//*[@class='propertyInfo item']")
for propiedad in propiedades]

for propiedad in info_propiedades:
try:
link = [l.get_attribute("href") for l in propiedad.find_elements_by_xpath(".//a")]
thelink = link[0]
id_ = thelink.split("id-")[-1]
with open(os.path.join(linkspath, id_), "w") as f:
f.write(link[0])
numlinks += 1
except:
print("link not found")

siguiente = driver.find_element_by_id("paginador_pagina_{0}".format(paginador))
siguiente.click() # goes to the next page
while new_active_page == old_active_page: # checks if page has loaded completely
try:
new_active_page = driver.find_element_by_class_name("pagina_activa").text
except:
new_active_page = old_active_page
time.sleep(0.3)
old_active_page = new_active_page
paginador += 1

Answer

A few suggestions...

  1. You have a lot of nested .find_elements_* at the start. You should be able to craft a single find that gets the elements you are looking for. From the site and your code, it looks like you are getting codes that look like, "MC1595226". If you grab one of these MC codes and do a search in the HTML, you will find that code all over that particluar listing. It's in the URL, it's part of ids of a bunch of the elements, and so on. A faster way to find this code is to use a CSS selector, "a[id^='btnContactResultados_'". It searches for A tags that contain an id that starts with "btnContactResultados_". The rest of that id is the MC number, e.g.

    <a id="btnContactResultados_MC1595226" ...>
    

    So with that CSS selector we find the desired elements and then grab the ID and split it by "_" and grab the last part. NOTE: This is more of a code efficiency. I don't think this is going to make your script go super fast but it should speed up the search portion some.

  2. I would recommend writing a log per page and write only once per page. So basically you process the codes for the page and append the result to a list. Once all the codes for the page are processed, you write that list to the log. Writing to disk is slow... you should do it as little as possible. In the end you can write a little script that opens all those files and appends them to get the final product all in one file. You can also do some middle ground where you write once to file per page but write 100 pages to file before closing that file and using a different one. You'll have to play with these settings to see where you get the best performance.

If we combine the logic for these two, we get something like this...

while True:
    links = driver.find_elements_by_css_selector("a[id^='btnContactResultados_'")

    codes = []
    for link in links:
        codes.append(link.get_attribute("id").split("_")[-1])

    with open(os.path.join(linkspath, paginador), "w") as f:
        f.write(codes)
    driver.find_element_by_link_text("Siguiente ยป").click()  # this should work

    while new_active_page == old_active_page:  # checks if page has loaded completely
        try:
            new_active_page = driver.find_element_by_class_name("pagina_activa").text
        except:
            new_active_page = old_active_page
        time.sleep(0.3)
    old_active_page = new_active_page
    paginador += 1

NOTE: python is not my native language... I'm more of a Java/C# guy so you may find errors, inefficiencies, or non-pythony code in here. You have been warned... :)

Comments