Philipp_Kats Philipp_Kats - 1 month ago 15
Python Question

Fastest approach for geopandas (reading and spatialJoin)

I have about a million rows of data with lat and lon attached, and more to come. Even now reading the data from SQLite file (I read it with pandas, then create a point for each row) takes a lot of time.

Now, I need to make a spatial joint over those points to get a zip code to each one, and I really want to optimise this process.

So I wonder: if there is any relatively easy way to parallelize those computations?

Answer

As it turned out, the most convenient solution in my case is to use pandas.read_SQL function with specific chunksize parameter. In this case, it returns a generator of data chunks, which can be effectively feed to the mp.Pool().map() along with the job; In this (my) case job consists of 1) reading geoboundaries, 2) spatial joint of the chunk 3) writing the chunk to the database.