Heathcliff Heathcliff - 5 months ago 13
SQL Question

Performance issues with UNION of large tables

I have seven large tables, that can be storing between 100 to 1 million rows at any time. I'll call them

LargeTable1
,
LargeTable2
,
LargeTable3
,
LargeTable4
...
LargeTable7
. These tables are mostly static: there are no updates nor new inserts. They change only once every two weeks or once a month, when they are truncated and a new batch of registers are inserted in each.

All these tables have three fields in common:
Headquarter
,
Country
and
File
.
Headquarter
and
Country
are numbers in the format '000', though in two of these tables they are parsed as
int
due to some other system necessities.

I have another, much smaller table called
Headquarters
with the information of each headquarter. This table has very few entries. At most 1000, actually.

Now, I need to create a stored procedure that returns all those headquarters that appear in the large tables but are either absent in the
Headquarters
table or have been deleted (this table is deleted logically: it has a
DeletionDate
field to check this).

This is the query I've tried:

CREATE PROCEDURE deletedHeadquarters
AS
BEGIN
DECLARE @headquartersFiles TABLE
(
hq int,
countryFile varchar(MAX)
);

SET NOCOUNT ON

INSERT INTO @headquartersFiles
SELECT headquarter, CONCAT(country, ' (', file, ')')
FROM
(
SELECT DISTINCT CONVERT(int, headquarter) as headquarter,
CONVERT(int, country) as country,
file
FROM LargeTable1
UNION
SELECT DISTINCT headquarter,
country,
file
FROM LargeTable2
UNION
SELECT DISTINCT headquarter,
country,
file
FROM LargeTable3
UNION
SELECT DISTINCT headquarter,
country,
file
FROM LargeTable4
UNION
SELECT DISTINCT headquarter,
country,
file
FROM LargeTable5
UNION
SELECT DISTINCT headquarter,
country,
file
FROM LargeTable6
UNION
SELECT DISTINCT headquarter,
country,
file
FROM LargeTable7
) TC

SELECT RIGHT('000' + CAST(st.headquarter AS VARCHAR(3)), 3) as headquarter,
MAX(s.deletionDate) as deletionDate,
STUFF
(
(SELECT DISTINCT ', ' + st2.countryFile
FROM @headquartersFiles st2
WHERE st2.headquarter = st.headquarter
FOR XML PATH('')),
1,
1,
''
) countryFile
FROM @headquartersFiles as st
LEFT JOIN headquarters s ON CONVERT(int, s.headquarter) = st.headquarter
WHERE s.headquarter IS NULL
OR s.deletionDate IS NOT NULL
GROUP BY st.headquarter

END


This sp's performance isn't good enough for our application. It currently takes around 50 seconds to complete, with the following total rows for each table (just to give you an idea about the sizes):


  • LargeTable1: 1516666 rows

  • LargeTable2: 645740 rows

  • LargeTable3: 1950121 rows

  • LargeTable4: 779336 rows

  • LargeTable5: 1100999 rows

  • LargeTable6: 16499 rows

  • LargeTable7: 24454 rows



What can I do to improve performance? I've tried to do the following, with no much difference:


  • Inserting into the local table by batches, excluding those headquarters I've already inserted and then updating the countryFile field for those that are repeated

  • Creating a view for that UNION query

  • Creating indexes for the LargeTables for the headquarter field



I've also thought about inserting these missing headquarters in a permanent table after the
LargeTables
change, but the
Headquarters
table can change more often, and I would like not having to change its module to keep these things tidy and updated. But if it's the best possible alternative, I'd go for it.

Thanks

Answer

Do the filtering at each step. But first, modify the headquarters table so it has the right type for what you need . . . along with an index:

alter table headquarters add headquarter_int as (cast(headquarter as int));
create index idx_headquarters_int on headquarters(headquarters_int);

SELECT DISTINCT headquarter, country, file
FROM LargeTable5 lt5
WHERE NOT EXISTS (SELECT 1
                  FROM headquarters s
                  WHERE s.headquarter_int = lt5.headquarter and s.deletiondate is not null
                 );

Then, you want an index on LargeTable5(headquarter, country, file).

This should take less than 5 seconds to run. If so, then construct the full query, being sure that the types in the correlated subquery match and that you have the right index on the full table. Use union to remove duplicates between the tables.

Comments