user3186023 user3186023 - 7 months ago 29
SQL Question

Evaluating a correlated subquery in SQL

I'm having trouble getting my head around evaluating correlated subqueries. An example is using a correlated subquery in SELECT so that GROUP BY isn't needed:

Consider the relations:

Movies : Title, Director Length
Schedule : Theatre, Title


I have the following query

SELECT S.Theater, MAX(M.Length)
FROM Movies M JOIN Schedule S ON M.Title=S.Title
GROUP BY S.Theater


Which gets the longest film that every theatre is playing. This is the same query without using GROUP BY:

SELECT DISTINCT S.theater,
(SELECT MAX(M.Length)
FROM Movies M
WHERE M.Title=S.Title)
FROM Schedule S


but I don't understand how it quite works.

I'd appreciate if anybody could give me an example of how correlated subqueries are evaluated.

Thanks :)

Answer

From a conceptual standpoint, imagine that the database is going through each row of the result without the subquery:

SELECT DISTINCT S.Theater, S.Title
FROM Schedule S

And then, for each one of those, running the subquery for you:

SELECT MAX(M.Length)
FROM Movies M
WHERE M.Title = (whatever S.Title was)

And placing that in as the value. Really, it's not (conceptually) that different from using a function:

SELECT DISTINCT S.Theater, SUBSTRING(S.Title, 1, 5)
FROM Schedule S

It's just that this function performs a query against another table, instead.

I do say conceptually, though. The database may be optimizing the correlated query into something more like a join. Whatever it does internally matters for performance, but doesn't matter as much for understanding the concept.

But, it may not return the results you're expecting. Consider the following data (sorry sqlfiddle seems to be erroring atm):

CREATE TABLE Movies (
  Title varchar(255),
  Length int(10) unsigned,
  PRIMARY KEY (Title)
);

CREATE TABLE Schedule (
  Title varchar(255),
  Theater varchar(255),
  PRIMARY KEY (Theater, Title)
);

INSERT INTO Movies
VALUES ('Star Wars', 121);
INSERT INTO Movies
VALUES ('Minions', 91);
INSERT INTO Movies
VALUES ('Up', 96);

INSERT INTO Schedule
VALUES ('Star Wars', 'Cinema 8');
INSERT INTO Schedule
VALUES ('Minions', 'Cinema 8');
INSERT INTO Schedule
VALUES ('Up', 'Cinema 8');
INSERT INTO Schedule
VALUES ('Star Wars', 'Cinema 6');

And then this query:

SELECT DISTINCT
  S.Theater,
  (
    SELECT MAX(M.Length)
    FROM Movies M
    WHERE M.Title = S.Title
  ) AS MaxLength
FROM Schedule S;

You'll get this result:

+----------+-----------+
| Theater  | MaxLength |
+----------+-----------+
| Cinema 6 |       121 |
| Cinema 8 |        91 |
| Cinema 8 |       121 |
| Cinema 8 |        96 |
+----------+-----------+

As you can see, it's not a replacement for GROUP BY (and you can still use GROUP BY), it's just running the subquery for each row. DISTINCT will only remove duplicates from the result. It's not giving the "greatest length" per theater anymore, it's just giving each unique movie length associated with the theater name.

PS: You might likely use an ID column of some sort to identify movies, rather than using the Title in the join. This way, if by chance the name of the movie has to be amended, it only needs to change in one place, not all over Schedule too. Plus, it's faster to join on an ID number than a string.