Gusman Gusman - 6 months ago 15
SQL Question

What is faster, subselect or distinct (MySQL)?

Let me describe my doubt. I have system where I have three entities, Doctor, Patient and Appointment. An appointment has the doctor's id and patient Id.

I need now to retrieve all the patients which have an appointment with a concrete doctor, and I'm not sure what will be faster, a distinct or a subselect for the id's, these are the queries:

using distinct->

SELECT DISTINCT patient.id, patient.name, patient.surname FROM
appointment INNER JOIN patient ON patient.id = appointment.patientid WHERE
appointment.doctorid = @id;


using subselect->

SELECT patient.id, patient.name, patient.surname FROM patient
WHERE patient.id IN (select appointment.patientid FROM appointment
WHERE appointment.doctorid = @id);


Not sure it this will affect, the system will run on a MariaDB cluster.

Answer

As with any performance question, you should test on your data and your hardware. The suspect problem in the first version the DISTINCT after the JOIN; this can require a lot of extra processing.

You can write the second as:

SELECT p.id, p.name, p.surname
FROM patient p
WHERE p.id IN (select a.patientid FROM appointment a WHERE a.doctorid = @id);

For this, you want an index on appointment(doctorid, patientid).

You might consider this version as well:

select p.id, p.name, p.surname
from patient p join
     (select distinct appointment.patientid
      from appointment
      where appointment.doctorid = @id
     ) a
     on p.id = a.patientid;

This specifically wants the same index. This pushes the distinct so it is only operating on a single table, meaning that MySQL may be able to use the index for that operation.

And this one:

SELECT p.id, p.name, p.surname
FROM patient p
WHERE EXISTS (select 1
              from appointment a
              where a.doctorid = @id and a.patientid = p.id
             );

This query wants an index on appointment(patientid, doctorid). It requires a full table scan of patient with a fast index lookup on each row. That could often be the fastest approach, depending on the data.

Note: which query performs better may also depends on the size and distribution of the data.