Scott Scott - 27 days ago 8
SQL Question

How can I add additional columns to a SELECT query after an aggregation without including them in GROUP BY?

Here is the table I'm querying:

EmployeeId, Salary, Date
8, 500, 2016-11-02
8, 500, 2016-09-21
8, 500, 2016-10-18
9, 500, 2016-10-18
9, 500, 2016-09-21
9, 500, 2016-11-02
10, 1000, 2016-11-02
10, 700, 2016-09-21


I want to select a list of employees and what their salaries were at the latest date available for that particular employee. In the case of the sample data this is the same date (11-02) but that is NOT necessarily the case. My query:

SELECT EmployeeId, Salary, MAX(Date)
FROM table
GROUP BY EmployeeId. Salary
ORDER BY EmployeeId


Which is returning:

8, 500, 2016-11-02
9, 500, 2016-11-02
10, 700, 2016-09-21
10, 1000, 2016-11-02


My expected output is:

8, 500, 2016-11-02
9, 500, 2016-11-02
10, 1000, 2016-11-02


If I aggregate without the salary then I get the expected results, but I need the salary to be visible. Is there a way to include it post-aggregation?

Answer

Before window functions (because not all RDBMS versions support them) you would do this with an inline view.

First consider your needs:

  • You need a data set which consists of the employee's max date and the employee ID.
  • you need the raw data off the table which you couldn't get by grouping...

To achieve the first, we do generate an inline view (aliased "B" below). Then we join back to the base set to retrieve the additional desired information; allowing the inner join to eliminate the undesired records; and negating the need for the group by on the outer query entirely.

SELECT A.EmployeeId, A.Salary, A.Date
FROM table A
INNER JOIN (SELECT max(date) mDate, EmployeeID 
            FROM table
            GROUP BY EmployeeID) B
  on A.EmployeeID = B.EmployeeID
 and A.Date = B.MDate
ORDER BY EmployeeId

When working with RDBMS think of data in terms of sets and how those sets could be filtered and joined back together to achieve the desired results. Set based processing on most (not all) RDBMS activities will be the most efficient. With the introduction of Window Functions (Aka Analytical Functions) generations of sets can be done without the subquery; making them powerful features in data analysis; albeit challenging to get your head around initially.

So what does: ROW_NUMBER() OVER (PARTITION BY EmployeeId ORDER BY Date DESC) RN exactly do?

It assigns a row number starting at 1 for EACH new employee encountered (partition) in the order of date descending (so the newest date for each employee will always have a RN of 1!) However, the 1 isn't available to the query in the where clause as the SQL order of operations generates the select values LAST (so the where clause has already executed). This means to make the window function achieve the desired results and limit by RN=1, you have to wrap it in a subquery and then you can access the RN = 1 thereby only returning records for employees with the newest date. Since the RDBMS is able to execute the generation of the table independently of the analytical function it can process them both simultaneously and provide a response very quickly.