Matthew Gunn Matthew Gunn - 6 months ago 10
SQL Question

Create index to speedup join on (1) ids match and (2) dates within a range

What index should I create for Table B to optimize a join on (1) ids matching and (2) date falls within some range.

Update: the issue is I have hundreds of thousands of dates but only 25 portfolios... condition (2) does all the work in cutting down the size of the resulting join...

More specifically, imagine I have a Table A:

portfolio id | begin_date | end_date
1 | 20150101 | 20150130

and a Table B:

portfolio_id | date | daily_ret
1 20150102 .00001
1 20150103 -.01023

The query I would like to not totally suck is:

FROM tablea a
LEFT JOIN tableb b on (a.portfolio_id = b.portfolio_id) and
(a.begin_date < and
( <= a.end_date)

I think I want some kinda index like:
CREATE INDEX tableb_idx ON tableb (portfolio_id, date)

But I'm a total n00b with SQL indexes... I imagine there's an issue with the
(a.begin_date < and ( <= a.end_date)

Update with more specific info:

Column | Type | Modifiers
yyyymmdd | integer | not null
size_index | integer | not null
beme_index | integer | not null
ret | double precision |
"portfolio_25_size_beme_pkey" PRIMARY KEY, btree (yyyymmdd, size_index, beme_index)

Exact query:

explain analyze
SELECT exp(sum(log(1 + f_sbm.ret))) - 1 as size_beme_ret, count(*) T,
permno, period_start, period_end
FROM eb2.deleteme x
JOIN fama_french.portfolio_25_size_beme f_sbm
ON f_sbm.size_index = x.me_idx5 AND f_sbm.beme_index = x.beme_idx5 AND < f_sbm.yyyymmdd and f_sbm.yyyymmdd <=
GROUP BY permno, period_start, period_end

Results (note I limited eb2.deleteme to 1000 rows because all 180000 wouldn't finish...)

HashAggregate (cost=83599.59..83602.59 rows=200 width=24) (actual time=11209.030..11209.455 rows=949 loops=1)
Group Key: x.permno, x.period_start, x.period_end
-> Merge Join (cost=78736.24..83566.19 rows=1909 width=24) (actual time=1183.677..11184.172 rows=19492 loops=1)
Merge Cond: ((f_sbm.size_index = x.me_idx5) AND (f_sbm.beme_index = x.beme_idx5))
Join Filter: (( < f_sbm.yyyymmdd) AND (f_sbm.yyyymmdd <=
Rows Removed by Join Filter: 22659494
-> Sort (cost=78655.60..80136.72 rows=592450 width=20) (actual time=1182.834..1647.155 rows=545055 loops=1)
Sort Key: f_sbm.size_index, f_sbm.beme_index
Sort Method: external merge Disk: 17328kB
-> Seq Scan on portfolio_25_size_beme f_sbm (cost=0.00..9698.50 rows=592450 width=20) (actual time=0.014..155.770 rows=592450 loops=1)
-> Sort (cost=80.64..83.54 rows=1160 width=32) (actual time=0.797..1618.708 rows=22678987 loops=1)
Sort Key: x.me_idx5, x.beme_idx5
Sort Method: quicksort Memory: 102kB
-> Seq Scan on deleteme x (cost=0.00..21.60 rows=1160 width=32) (actual time=0.020..0.357 rows=1000 loops=1)
Planning time: 0.574 ms
Execution time: 11228.745 ms

Maybe this is a hack, but forcing the join on DATE first then pruning the portfolio mismatches later runs 11x faster...

with stage1 as(
FROM eb2.deleteme x
JOIN fama_french.portfolio_25_size_beme f_sbm
ON < f_sbm.yyyymmdd and f_sbm.yyyymmdd <=
select exp(sum(log(1+ret))) - 1 as size_beme_ret, count(*) T,
permno, period_start, period_end
FROM stage1 f_sbm
WHERE size_index = me_idx5 and beme_index = beme_idx5
GROUP BY permno, period_start, period_end


Always create indexes that are intended to being compared through equality first, and through greater/lower than (or between) last (and just that final one, since including any further columns will generally not assist with any query of this nature).

Adding the following index will speed-up your query considerably:

CREATE INDEX ON fama_french.portfolio_25_size_beme (size_index, beme_index, yyyymmdd);