Guus Guus - 1 month ago 11
MySQL Question

How to optimize MySQL queries with Joins and subqueries for large datasets (millions of rows)

I am attempting to join four large tables (35-200 million rows) of the international patent database (PATSTAT) into a top 15 of most cited patents matching a few requirements.


A first table (

t9
) lists citations from one group (family) of applications to another.
Another table (
t1
) basically links everything together as is contains both family and application id's, and filing years
Tables
t2
and
tls209_appln_ipc
have are used to identify the
appln_id
's to include.


The code that I have finally come to is the following:

SELECT t9.cited_docdb_family_id, COUNT(t9.cited_docdb_family_id) AS cited, t3.appln_id
FROM docdb_family_citation t9
LEFT JOIN
(SELECT
t1.appln_id, t1.docdb_family_id from tls201_appln t1
LEFT JOIN tls204_appln_prior t2 on t1.appln_id=t2.appln_id
WHERE
t1.appln_filing_year BETWEEN 2010 AND 2015
AND
t2.appln_id IS NULL
AND
t1.appln_id IN (SELECT distinct appln_id from tls209_appln_ipc where ipc_subclass_symbol in ("A61K", "C07K", "A61P", "Cl2N", "C07D", "Cl2P", "C07H", "C12Q", "C07J"))) t3 ON t9.cited_docdb_family_id=t3.docdb_family_id
GROUP BY t9.cited_docdb_family_id
ORDER BY cited DESC
LIMIT 15


The problem is that the query ran in the online web-based interface of PATSTAT does not converge before my session times out. Are there ways to improve the efficiency of this query?

-Edit-

The
tls_209_appln_ipc
contains 195 million rows of
appln_id
coupled with
ipc_subclass_symbol
.
appln_id
's may occur zero or more times in this table. In my query I only need the
docdb_family_id
s if any of their linked
appln_id
s is linked to any of the
ipc_subclass_symbol
s I listed.

Answer Source

With help from the earlier answers, final code that gave the result I was looking for:

SELECT t9.cited_docdb_family_id, t99.cited AS cited, t1.appln_id, t1.appln_nr_epodoc
        FROM docdb_family_citation t9 
INNER JOIN (SELECT cited_docdb_family_id, count(cited_docdb_family_id) as cited FROM docdb_family_citation GROUP BY cited_docdb_family_id) t99 
ON t9.cited_docdb_family_id = t99.cited_docdb_family_id
LEFT JOIN 
     tls201_appln t1
     on t9.cited_docdb_family_id = t1.docdb_family_id 
     WHERE
        t1.appln_filing_year BETWEEN 2010 AND 2015 and
        exists (select 1 from tls209_appln_ipc t209
                where t209.appln_id = t1.appln_id
                  and    t209.ipc_subclass_symbol in ("A61K", "C07K", "A61P", "Cl2N", "C07D", "Cl2P", "C07H", "C12Q", "C07J")
               ) and
        not exists (select 1 from tls204_appln_prior t2
                    where t1.appln_id = t2.appln_id 
                   )
GROUP BY t9.cited_docdb_family_id
ORDER BY cited DESC
LIMIT 15;`

note that the join with subquery t99 is used to get the correct cited count