I have a shapely polygon representing the boundaries of the city of Los Angeles. I also have a set of ~1 million lat-long points in a geopandas GeoDataFrame, all of which fall within that polygon's minimum bounding box. Some of these points lie within the polygon itself, but others do not. I want to retain only those points within Los Angeles's boundaries, and due to Los Angeles's irregular shape, only approximately 1/3 of the points within its minimum bounding box are within the polygon itself.
Using Python, what is the fastest way to identify which of these points lie within the polygon, given that the points and the polygon have the same minimum bounding box?
I tried using geopandas and its r-tree spatial index:
sindex = gdf['geometry'].sindex
possible_matches_index = list(sindex.intersection(polygon.bounds))
possible_matches = gdf.iloc[possible_matches_index]
points_in_polygon = possible_matches[possible_matches.intersects(polygon)]
To summarize the problem: when the polygon's bounding box is the same as the set of points', r-tree identifies every point as a possible match, thus offering no speed up. When coupled with lots of points and a polygon with lots of vertices, the intersection process is extremely slow.
Solution: from this geopandas r-tree spatial index tutorial, use a quadrat routine to divide the polygon into sub-polygons. Then, for each sub-polygon, intersect it first with the points' r-tree index to get a small set of possible matches, then intersect those possible matches with the sub-polygon to get the set of precise matches. This offers speed-ups of approx 100x.