wbaeckelmans wbaeckelmans - 21 days ago 7
SQL Question

Performance tuning of a query processing millions of rows

I've got a query processing 2 tables with over 61 million records each.


  • WB_YH_BCUPDATE_FULL_BASE: table containing customers and all the months they were active. (from 2014 till now)




CUSTOMERNUMBER | CAR MONTH




99999 | 201401

99999 | 201402

99999 | 201403

....



  • WB_YH_BCUPDATE_MATCH_MONTH: table containing customers and all the months that were active with + a fictive extra field containing the CAR_MONTH + 6 months.




CUSTOMERNUMBER | CAR_MONTH | MATCH_MONTH_6




99999 | 201401 | 201407

99999 | 201402 | 201408

99999 | 201403 | 201409

...


Now I want to check for all customers and all of their corresponding CAR_MONTHS if they were still active (= they appear in the table) after 6 months. For this I need to use the field I created, being MATCH_MONTH_6.

I'm using the following query:

select distinct a.CUSTOMERNUMBER
, a.CAR_MONTH
, b.MATCH_MONTH_6
, CASE WHEN b.CUSTOMERNUMBER is null then 0
ELSE 1
END FL_MATCH_6
from WB_YH_BCUPDATE_FULL_BASE a left join WB_YH_BCUPDATE_MATCH_MONTH b
on a.CUSTOMERNUMBER = b.CUSTOMERNUMBER
and a.CAR_MONTH = b.CAR_MONTH
and b.MATCH_MONTH_6 in (
select CAR_MONTH
from WB_YH_BCUPDATE_FULL_BASE
where customernumber = a.customernumber
);


The performance of my query is really poor as you can see from the following Execution Plan:

Plan Hash Value : 3376431373

-----------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost | Time |
-----------------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 25897713 | 673340538 | 371846479 | 02:56:04 |
| 1 | HASH UNIQUE | | 25897713 | 673340538 | 371846479 | 02:56:04 |
| 2 | NESTED LOOPS OUTER | | 61874441 | 1608735466 | 371674345 | 02:55:59 |
| 3 | TABLE ACCESS STORAGE FULL | WB_YH_BCUPDATE_FULL_BASE | 61874441 | 742493292 | 3225 | 00:00:01 |
| 4 | VIEW | | 1 | 14 | 6 | 00:00:01 |
| 5 | NESTED LOOPS | | 1 | 31 | 6 | 00:00:01 |
| 6 | NESTED LOOPS | | 24 | 31 | 6 | 00:00:01 |
| * 7 | TABLE ACCESS BY INDEX ROWID | WB_YH_BCUPDATE_MATCH_MONTH | 1 | 19 | 3 | 00:00:01 |
| * 8 | INDEX RANGE SCAN | WB_YH_BCUPDATE_MATCH_MONTH_IND | 24 | | 2 | 00:00:01 |
| * 9 | INDEX RANGE SCAN | WB_YH_BCUPDATE_FULL_BASE_IND | 24 | | 2 | 00:00:01 |
| * 10 | TABLE ACCESS BY INDEX ROWID | WB_YH_BCUPDATE_FULL_BASE | 1 | 12 | 3 | 00:00:01 |
-----------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
------------------------------------------
* 7 - filter("A"."CAR_MONTH"="B"."CAR_MONTH")
* 8 - access("A"."CUSTOMERNUMBER"="B"."CUSTOMERNUMBER")
* 9 - access("CUSTOMERNUMBER"="A"."CUSTOMERNUMBER")
* 10 - filter("CAR_MONTH"=TO_NUMBER("B"."MATCH_MONTH_6"))


Do you guys have any idea on how I can optimize this query or how I can rewrite this query to be more performant?

Kind regards,

SQB SQB
Answer
SELECT
   a.customernumber,
   a.car_month,
   b.car_month AS match_month_6,
   CASE
      WHEN b.customernumber IS NULL
      THEN 0
      END 1
   END AS fl_match_6
FROM WB_YH_BCUPDATE_MATCH_MONTH a
LEFT JOIN WB_YH_BCUPDATE_MATCH_MONTH b
   ON (a.customernumber = b.Customernumber AND a.match_month_6 = b.car_month);

Since you say that WB_YH_BCUPDATE_MATCH_MONTH contains the same data as WB_YH_BCUPDATE_FULL_BASE, but with one extra column, we can use the former and ignore the latter.

We now left join it with itself. Of course on the customer number, but also, we join the date + 6 months on the date. If the customer was active 6 months later, we will find an entry; if not, we won't.

To completely duplicate the results of your query, we select get our data for match_month_6 from the left joined table, since it was NULL if we couldn't get a match in your original query as well.

You should put indexes on both month fields as well, since we join on those too.


Note that this doesn't guarantee that the customer was active in the months in between. I a customer was active in January and in July, they will be returned by this query.