I have about a million rows of data with lat and lon attached, and more to come. Even now reading the data from SQLite file (I read it with pandas, then create a point for each row) takes a lot of time.
Now, I need to make a spatial joint over those points to get a zip code to each one, and I really want to optimise this process.
So I wonder: if there is any relatively easy way to parallelize those computations?
As it turned out, the most convenient solution in my case is to use pandas.read_SQL function with specific chunksize parameter. In this case, it returns a generator of data chunks, which can be effectively feed to the mp.Pool().map() along with the job; In this (my) case job consists of 1) reading geoboundaries, 2) spatial joint of the chunk 3) writing the chunk to the database.