Sam Sam - 2 months ago 5
SQL Question

How to improve massive UPDATE from SELECT performance?

I want update a table with data from another table. My current approach looks like this:

UPDATE items t1
SET (name, manufacturer_id, price) =
(SELECT
t2.item_name,
t2.item_manufacturer_id,
t2.item_price
FROM staged_items t2
WHERE t2.upgrade_version = 1234
AND t2.operation = 'modification'
AND t1.id = t2.item_id)
WHERE EXISTS (
SELECT 1
FROM staged_items t2
WHERE t2.upgrade_version = 1234
AND t2.operation = 'modification'
AND t1.id = t2.item_id)


The problem is that it's taking too long and I don't know how to get feedback or an estimation about how long will it take. I have been waiting for about 3 hours without results and there are not pending operations on
v$session_longops
.

I would like to know how to improve the performance of the update or maybe another approach that is more efficient. Also, I'd like to know how to check the status of the running query.




Some remarks




  • The
    staged_items
    tables is expected to contain at least 300 million entries and grow to billions of entries in the long run.


    • Only "few entries" (from 0 to ~1 million) are expected to match the conditions (
      t2.upgrade_version = somenumber
      and
      t2.operation = 'modification'
      ) every time I run the
      UPDATE
      .

    • Currently I'm testing with a
      staged_items
      tables of 1 million entries, all of them matching the conditions.


  • The
    items
    table is expected to have about 20 million entries and stay in that order of magnitude in the long run.


    • Currently I'm testing with about ~1 million entries, most of them matched by the
      UPDATE
      .


  • Currently I have no indexes but I'm thinking of creating them on
    items.id
    ,
    staged_items.item_id
    ,
    staged_items.upgrade_version
    and
    staged_items.operation



    • I'm not quite sure of which indexes would be really needed and useful


  • The
    UPDATE
    is not expected to modify any indexed column, anyway this could change in the future so I'd appreaciate comments about how that scenario would have an impact on the proposed solution.





Edit: Finally, instead of checking the status of the running query, I opted for paginating the query into multiple queries using rownum. This allows me to wait for every (smaller) query and check and approximate percentage of completeness.

Considering that, my original query would look something like this:

UPDATE items t1
SET (name, manufacturer_id, price) =
(
SELECT
t2.item_name,
t2.item_manufacturer_id,
t2.item_price
FROM
(
SELECT /*+ FIRST_ROWS(n) */
a.*,
ROWNUM rnum
FROM
(
SELECT *
FROM staged_items t2
WHERE t2.upgrade_version = 1234
AND t2.operation = 'modification'
AND t1.id = t2.item_id
ORDER BY t2.id
) a
WHERE ROWNUM <= MAX_ROW_TO_FETCH
)
WHERE rnum >= :MIN_ROW_TO_FETCH
)
WHERE EXISTS (
SELECT 1
FROM staged_items t2
WHERE t2.upgrade_version = 1234
AND t2.operation = 'modification'
AND t1.id = t2.item_id)


(Based on the Pagination with ROWNUM section of this link)

Anyway, for the outer
WHERE
, I used Gordon Linoff's solution.

Answer

For your query, you want an index on:

staged_items(item_id, upgrade_version, operation)

I am also thinking that you could rewrite the outer where clause as:

WHERE t1.id IN (SELECT t2.item_id
                FROM staged_items t2
                WHERE t2.upgrade_version = 1234 AND t2.operation = 'modification'
               )

Then, you want indexes on staged_items(upgrade_version, operation, item_id) and items(id). Note that the order of the keys in the index is important and you still want the first index for the correlated subquery to get the values.