rahularyansharma rahularyansharma - 7 months ago 12
SQL Question

How to find duplicate between staging and main table with millions records

I am using

mysql
and I want to check the duplicate rows between two tables. I used
join
but it's taking too much time as there are millions of records (for example staging table has 800k records while the main table has around 100 million records).

The query I am using is as follow :

INSERT INTO
tblspduplicate
SELECT
T2.SP,T1.FileImportedDate,T2.XYZFileName
FROM
tblspmaster T1
INNER JOIN
tblstaging T2
ON
T1.SP=T2.SP;

CREATE TABLE `tblspmaster` (
`CSN` bigint(20) NOT NULL AUTO_INCREMENT,
`SP` varchar(50) NOT NULL,
`FileImportedDate` date NOT NULL,
`XYZFileName` varchar(50) NOT NULL,
`XYZBatch` varchar(50) NOT NULL,
`BatchProcessedDate` date NOT NULL,
`ExpiryDate` date NOT NULL,
`Region` varchar(50) NOT NULL,
`FCCity` varchar(50) NOT NULL,
`VendorID` int(11) NOT NULL,
`LocationID` int(11) NOT NULL,
PRIMARY KEY (`CSN`)
) ENGINE=InnoDB AUTO_INCREMENT=7484570 DEFAULT CHARSET=latin1;


CREATE TABLE `tblstaging` (
`CSN` bigint(20) NOT NULL AUTO_INCREMENT,
`SP` varchar(50) NOT NULL,
`FileImportedDate` date NOT NULL,
`XYZFileName` varchar(50) NOT NULL,
`XYZBatch` varchar(50) NOT NULL,
`BatchProcessedDate` date NOT NULL,
`ExpiryDate` date NOT NULL,
`Region` varchar(50) NOT NULL,
`FCCity` varchar(50) NOT NULL,
`VendorID` int(11) NOT NULL,
`LocationID` int(11) NOT NULL,
PRIMARY KEY (`CSN`),
KEY `ind_staging` (`SP`)
) ENGINE=InnoDB AUTO_INCREMENT=851956 DEFAULT CHARSET=latin1;

Answer

Do you have an index on tblspmaster.SP? That would be the most important thing. With such an index, you query should be ok. First, though, test the query on using just the select.

Another problem that you could have are duplicate matches. This might be significantly multiplying the data that you have. You can test for this by doing:

select sp, count(*) as cnt
from tblmaster
group by sp
having cnt > 1
order by cnt desc;

and

select sp, count(*) as cnt
from tblstaging
having cnt > 1
order by cnt desc;

EDIT:

Based on the table structure, I repeat the suggestion to create an index to tblMaster(SP). You may also need to drop the index on tblStaging(SP). Alternatively, you can force the use of the master index instead of the staging index by using an index hint (the simple syntax is described here).

Also, I suggest that you run the above counts to see the risk of getting an unexpectedly large number of rows because of multiplicity in the SP values.