Nema Ga Nema Ga - 1 year ago 110
Python Question

Python / Celery / Selenium continuous task (avoid reopening browser)

Biggest issue I have with selenium is long re-opening time of browser(using it to scrape every few minutes). I am also using proxies and running multiple browsers with python's threading - All starting/stopping every few minutes(when new job comes)

Threading also means only 1 CPU is used and performance suffers.

I've been thinking about starting to use celery(out-of-box multi-core support) and make workers(different proxy/browser) run indefinitely(while loop) with open instances of selenium browsers waiting to get exact URLs to scrape - feed via something like redis.

Is it a good idea to be running continuous tasks like this with celery? Is there any better way to do it?

Answer Source

Its never a good idea to hold open instances of selenium indefinitely, best practice is to reopen with each task.

so for you question, in my opinion its not a good idea.

let me offer you another architecture instead.

use Docker to run your selenium machines,
basically create selenium-grid (first result in google link) using Docker

once everything is setup correctly the task will become easy, with multiprocessing send to your selenium hub all the jobs in parallel,
and they will run simultaneously on as many containers as you need.
once the job is done, you can destroy the containers and start fresh ones with the next cycle

Using docker will also allow you to scale you operation very easily