HtmHell HtmHell - 5 months ago 13
SQL Question

Group by day from timestamp with multiple tables

I have two tables with

timestamp
columns.

I want to group the result by days. For example: from
2014/06/10
to
2014/06/13
. Doesn't matter if there are records between those dates, I want it to group it by days.

My tables:

profits
:

enter image description here

profits_referrals
:

enter image description here

The result I want:

╔═════════════╦══════════════╦══════╗
║ date ║ ......... ║ .. ║
╠═════════════╬══════════════╬══════╣
║ 2014/06/10 ║ ......... ║ .. ║
║ 2014/06/11 ║ ......... ║ .. ║
║ 2014/06/12 ║ ......... ║ .. ║
║ 2014/06/13 ║ ......... ║ .. ║
╚═════════════╩══════════════╩══════╝


A note: also if there are no records from some date, I still want it to show that date, and the
amount
will be
0
.

What I did so far:

SELECT SUM(`profits`.`amount`) AS `profAmount`,
COUNT(`profits`.`amount`) AS `profCount`,

SUM(`profits_referrals`.`amount`) AS `refAmount`,
COUNT(`profits_referrals`.`amount`) AS `refCount`,

DATE(FROM_UNIXTIME(`profits`.`date`)) AS `profDate`,
DATE(FROM_UNIXTIME(`profits_referrals`.`date`)) AS `refDate`
FROM `profits`
JOIN `profits_referrals`
ON `profits`.`userid` = `profits_referrals`.`referral`
WHERE `profits`.`userid` = " . (int)$user->id . "
GROUP BY DATE(FROM_UNIXTIME(`profits`.`date`)), DATE(FROM_UNIXTIME(`profits_referrals`.`date`))
ORDER BY `profDate`
DESC


Result: (I did some PHP code to display it)

Date Sales Referrals Total
2014-04-28 2 / $7.35 USD 2 / $1.4 USD $8.75 USD
2014-04-28 2 / $7.35 USD 2 / $1.4 USD $8.75 USD
2014-03-27 1 / $2.10 USD 1 / $0.7 USD $2.80 USD
2014-03-27 1 / $2.10 USD 1 / $0.7 USD $2.80 USD
2014-03-25 3 / $6.30 USD 3 / $2.0 USD $8.40 USD
2014-03-25 3 / $6.30 USD 3 / $2.0 USD $8.40 USD

Answer Source

First off, you're joining the two tables by the id, but not by the date, which is generating a temp result set that looks like this:

t1.date        t1.amt   t2.date        t2.amt
'2014-04-28'   1        '2014-03-27'   5
'2014-04-28'   1        '2014-03-25'   6

This is because joins actually say "hey, for every row that matches this condition, put the left and right sides together". The more times rows on the right side match one row on the left side, the more times the left side is repeated. It should be pretty obvious that accurate results will be getting thrown out the window. In order to have (at most) a 1-to-1 relationship, we need to do the aggregate before the join, usually by the use of a subquery;

SELECT ....
FROM {base_table} b
JOIN (SELECT {joinColumn}, {AGGREGATE_FUNCTION}
      FROM {other_table}
      GROUP BY {joinColumn}) o
  ON o.{joinColumn} = b.{joinColumn}

Unfortunately, your dataset doesn't have a canonical "base table" - you aren't guaranteed rows in either table, so something like FULL OUTER JOIN (or the MySQL equivalent) isn't going to work (ie, you would be missing dates if neither table had them). We need to create our own base table.

You need to create what's known as a Calendar Table (this particular one is for SQL Server, but would be adaptable). These are one of the most useful dimension/analysis tables you can make or use. The actual contents are up to you, but for this type of query it fulfills the role of the {base_table}. It's also going to help us get index-access (potentially) for the grouping.

First, the revised subquery:

SELECT Calendar.calendar_date, 
       COUNT(Profits) AS profCount, COALESCE(SUM(Profits.amt), 0) AS profAmount
FROM Calendar
LEFT JOIN Profits
       ON Profits.userId = {desiredUserId}
          AND Profits.date >= UNIX_TIMESTAMP(Calendar.calendar_date)
          AND Profits.date < UNIX_TIMESTAMP(Calendar.calendar_date + INTERVAL 1 DAY)
WHERE Calendar.calendar_date >= {rangeStart}
      AND Calendar.calendar_date < {rangeEnd}

So.
Some things to note here:

  • I've put descriptions for the parameter values. In reality, you should be using parameterized queries, or you risk SQL Injection. Your current query was safe because of the cast to int, but it's better to not have to worry about it.
  • Always query positive continuous-range types (everything but an integer count) with an inclusive lower-bound, >=, and an exclusive upper-bound, < (the post is written for SQL Server and timestamps therein, but the problem applies everywhere. Remember that the MySQL DATETIME/TIMESTAMP types have a user-specifiable number of fractional seconds!). For negative ranges, reverse the conditions.
  • The use of the functions on Calendar.calendar_date (assume this is just a standard DATE type) will prevent the use of indices on the join... from the Calendar side. From the Profits side, it'll have nice singe values to search against. Presumably there are multiple rows in Profits for each calendar day, meaning that's the slow side of the join.

In any case, this will input a temp result set that looks like this:

cal_date       Count   Amount
'2014-06-10'   1       5
'2014-06-11'   0       0
'2014-06-12'   1       -9.5
'2014-06-13'   99      99999999.1

Success; single row per day, pre-aggregated amounts. We can now combine this with the query for the other table (Profits_Referrals), and get our results:

SELECT Profits.Calendar_date,
       Profits.profAmount, Profits.profCount,
       Referrals.refAmount, Referrals.refCount
FROM (SELECT Calendar.calendar_date,
             COUNT(Profits) AS profCount, COALESCE(SUM(Profits.amt), 0) AS profAmount
             FROM Calendar
             LEFT JOIN Profits
                    ON Profits.userId = ?
                       AND Profits.date >= UNIX_TIMESTAMP(Calendar.calendar_date)
                       AND Profits.date < UNIX_TIMESTAMP(Calendar.calendar_date + INTERVAL 1 DAY)
             WHERE Calendar.calendar_date >= ?
                   AND Calendar.calendar_date < ?) Profits
JOIN (SELECT Calendar.calendar_date,
             COUNT(Refferals) AS refCount, COALESCE(SUM(Refferals.amt), 0) AS refAmount
             FROM Calendar
             LEFT JOIN Profits_Referrals Refferals
                    ON Refferals.userId = ?
                       AND Refferals.date >= UNIX_TIMESTAMP(Calendar.calendar_date)
                       AND Refferals.date < UNIX_TIMESTAMP(Calendar.calendar_date + INTERVAL 1 DAY)
             WHERE Calendar.calendar_date >= ?
                   AND Calendar.calendar_date < ?) Refferals
  ON Referrals.calendar_date = Profits.calendar_date
ORDER BY Profits.Calendar_Date

(Remember that the individual subqueries are outputting a row for every date, and everything is already aggregated by that date - we can just join based on the date. This also means we don't need a separate {base_table} here)