Gusman Gusman - 10 months ago 42
SQL Question

What is faster, subselect or distinct (MySQL)?

Let me describe my doubt. I have system where I have three entities, Doctor, Patient and Appointment. An appointment has the doctor's id and patient Id.

I need now to retrieve all the patients which have an appointment with a concrete doctor, and I'm not sure what will be faster, a distinct or a subselect for the id's, these are the queries:

using distinct->

SELECT DISTINCT,, patient.surname FROM
appointment INNER JOIN patient ON = appointment.patientid WHERE
appointment.doctorid = @id;

using subselect->

SELECT,, patient.surname FROM patient
WHERE IN (select appointment.patientid FROM appointment
WHERE appointment.doctorid = @id);

Not sure it this will affect, the system will run on a MariaDB cluster.


As with any performance question, you should test on your data and your hardware. The suspect problem in the first version the DISTINCT after the JOIN; this can require a lot of extra processing.

You can write the second as:

SELECT,, p.surname
FROM patient p
WHERE IN (select a.patientid FROM appointment a WHERE a.doctorid = @id);

For this, you want an index on appointment(doctorid, patientid).

You might consider this version as well:

select,, p.surname
from patient p join
     (select distinct appointment.patientid
      from appointment
      where appointment.doctorid = @id
     ) a
     on = a.patientid;

This specifically wants the same index. This pushes the distinct so it is only operating on a single table, meaning that MySQL may be able to use the index for that operation.

And this one:

SELECT,, p.surname
FROM patient p
WHERE EXISTS (select 1
              from appointment a
              where a.doctorid = @id and a.patientid =

This query wants an index on appointment(patientid, doctorid). It requires a full table scan of patient with a fast index lookup on each row. That could often be the fastest approach, depending on the data.

Note: which query performs better may also depends on the size and distribution of the data.