RedHack RedHack - 7 months ago 32
SQL Question

Performing Multiple Avg Queries in PostgreSQL

I am having some trouble with the following operation: I have a database table called entries which (for all intents and purposes) has 3 columns in addition to the primary key:

value
,
gps_lat
,
gps_long
all of which are doubles.

My ultimate goal is to be able to define a grid, say 100x100 with an interval and bounded by a given latitude and longitude value and for each square of the grid I want to compute the average value of all the points in that grid square. I am having a lot of trouble doing this efficiently however.

Part of the problem is that I want to set this up either as a stored procedure or as a query that I can generate with a piece of code and reuse later because every time I run the query the grid will not be the same (so caching is pretty much out the question).

My first attempt at doing this was to define the following function:

CREATE OR REPLACE FUNCTION gridSquareAverageValue (double precision
, double precision, double precision, double precision)
RETURNS double precision as $avgValue$
declare
avgValue double precision;
BEGIN
SELECT AVG(value) into avgValue FROM entries
WHERE gps_lat BETWEEN $1 AND $2 AND gps_long BETWEEN $3 AND $4;
RETURN avgValue;
END;
$avgValue$ LANGUAGE plpgsql;


This function works very well and does exactly what I need it to do, except that it does it for only one grid square. Running the function for a 100x100 grid involves 10,000 individual queries and is therefore inordinately slow.

The next attempt was this:

WITH Grid(lat_offset,long_offset) AS
(SELECT *
FROM generate_series(1,10) lat_offset
CROSS JOIN generate_series(1,10) long_offset)
SELECT AVG(value)
FROM Grid
JOIN entries
ON entries.gps_lat BETWEEN 41.79604807005128 + (0.000247908106797 * Grid.lat_offset)
AND 41.82083888073101 + (0.002479081067973 * (Grid.lat_offset + 1))
AND entries.gps_long BETWEEN -72.2759199142456 + (0.000527858734131 * Grid.long_offset)
AND -72.22313404083252 + (0.005278587341308 * (Grid.long_offset + 1))
GROUP BY lat_offset,long_offset;


This somehow turned out to be even worse. I attempted to generate a sequence of offsets and then join it with the table of entries forcing each entry into a box that is calculated with the math you can see above. This is impossibly slow. I tried to get it to just output the values without computing averages and it took even longer than running 10k individual queries.

The above is also probably the most promising approach because all I really want to do after generating a cartesian join of two series is to use them in a simple function, but I cannot figure out any decent way to do that except what you see above =/

Finally I tried this:

# $1 height $2 width $3 lat start $4 lat interval $5 long start $6 long interval
CREATE OR REPLACE FUNCTION gridAverageValue (integer, integer, double precision, double precision, double precision, double precision)
RETURNS TABLE (avg double precision) as $restbl$
BEGIN
SELECT * INTO $restbl$ FROM entries WHERE 1 = 2;
FOR lat_offset IN 0..$1 LOOP
FOR long_offset IN 0..$2 LOOP
INSERT INTO restbl
SELECT AVG(value)
FROM entries
WHERE gps_lat
BETWEEN $3 + ($4 * lat_offset) AND $3 + ($4 * (lat_offset + 1))
AND gps_long
BETWEEN $5 + ($6 * long_offset) AND $5 + ($6 * (long_offset + 1));
END LOOP;
END LOOP;
RETURN QUERY SELECT * FROM restbl;
END;
$restbl$ LANGUAGE plpgsql;


This final attempt is getting a bunch of syntax errors and I honestly do not know where it is coming from. The general idea is to generate a bunch of queries that ultimately compute the values I care about.

If anyone has a suggestion about how to fix any of the approaches above, that would be greatly appreciated.

Answer

Only populated cells

Use the built-in function width_bucket() to get only grid cells with one or more matching rows in entries:

For a grid of 100 x 100 cells in the outer frame of box(point(_lat_start, _long_start), point(_lat_end, _long_end)):

SELECT width_bucket(gps_lat , _lat_start , _lat_end , 100) AS grid_lat
     , width_bucket(gps_long, _long_start, _long_end, 100) AS grid_long
     , avg(value) AS avg_val
FROM   entries
WHERE  point(gps_lat, gps_long) <@ box(point(_lat_start, _long_start)
                                     , point(_lat_end  , _long_end))
GROUP  BY 1,2
ORDER  BY 1,2;

<@ is the "contained in" operator for geometric types.

It's easy to wrap this into a function and parameterize the outer box and the number of grid cells.

A multicolumn GiST expression index will help performance if only a small fraction of rows lies within the outer box. You'll need to install the btree_gist module first, once per database:

Then:

CREATE INDEX entries_point_idx ON entries
USING gist (point(gps_lat, gps_long), value);

Adding value to the index only makes sense if you can get an index-only scan out of this in Postgres 9.2+.

If you are reading large parts of the table anyway, you don't need an index and it might be cheaper to run simple a between x and y checks in the WHERE clause.

This is assuming a flat earth (which may be good enough for your purpose). If you want to be precise, you will have to dig deeper into PostGIS.

All cells in the grid

To get all cells use LEFT JOIN to a pre-generated grid like you already tried:

SELECT grid_lat, grid_long, g.avg_val  -- or use COALESCE
FROM        generate_series(1,100) grid_lat
CROSS  JOIN generate_series(1,100) grid_long
LEFT   JOIN (<query from above>) g USING (grid_lat, grid_long)

Related: