I have a database of a service that helps people sell things. If they fail a delivery of a sale, they get penalised. I am trying to extract the number of active listings each user had when a particular penalty was applied.
I have the equivalent to the following tables(and relevant fields):
SELECT penalty.id AS 'penalty_id',
listing_history.id AS 'record_id',
JOIN penalty ON penalty.user_id = user.id
JOIN transaction ON transaction.id = penalty.transaction_id
JOIN listing_history ON listing_history.listing_id = listing.id
WHERE listing_history.date_created < penalty.date_created
AND listing_history.status = 0
GROUP BY s1.penalty_id
UPDATE - As the comments on this answer indicate that changing the table structure isn't an option, here are more details on some queries you could use with the existing structure.
Note that I made a couple changes to the query before even modifying the logic.
listing.id; I've replaced it.
So the original query is rewritten as
SELECT penalty.id AS 'penalty_id' , COUNT(listing_history.id) 'active_listings' FROM user JOIN penalty ON penalty.user_id = user.id JOIN transaction ON transaction.id = penalty.transaction_id JOIN listing_history ON listing_history.listing_id = transaction.listing_id WHERE listing_history.date_created < penalty.date_created AND listing_history.status = 0 GROUP BY penalty.id
Now the most natural way, in my opinion, to write the corrected timeline constraint is with a
NOT EXISTS condition that filters out all but the most recent
listing_history record for a given
id. This does require thinking about some edge cases:
created_date is really a timestamp, then this may not matter much (if at all); if it's really a date, it might be a bigger issue. Since your original query required that the listing history be created before the penalty, I'll continue in that style; but it's still ambiguous how to handle the case where two history records with matching status have the same date. You may need to adjust the date comparisons to get the desired behavior.
SELECT penalty.id AS 'penalty_id' , COUNT(DISTINCT listing_history.id) 'active_listings' FROM user JOIN penalty ON penalty.user_id = user.id JOIN transaction ON transaction.id = penalty.transaction_id JOIN listing_history ON listing_history.listing_id = transaction.listing_id WHERE listing_history.date_created < penalty.date_created AND listing_history.status = 0 AND NOT EXISTS (SELECT 1 FROM listing_history h2 WHERE listing_history.date_created < h2.date_created AND h2.date_created < penalty.date_created AND h2.id = listing_history.id) GROUP BY penalty.id
Note that I switched from
COUNT(DISTINCT ...); this helps with some edge cases where two active records for the same listing might be counted.
If you change the date comparisons to use
<= instead of
< - or, equivalently, if you use
BETWEEN to combine the date comparisons - then you'd want to add
AND h2.status != 0 (or
AND h2.status <> 0, depending on your database) to the subquery so that two concurrent ACTIVE records don't cancel each other out.
There are several equivalent ways to write this, and unfortunately its the kind of query that doesn't always cooperate with a database query optimizer so some trial and error may be necessary to make it run well with large data volumes. Hopefully that gives enough insight into the intended logic that you could work out some equivalents if need be. You could consider using
NOT IN instead of
NOT EXISTS; or you could use an outer join to a second instance of
LISTING_HISTORY... There are probably others I'm not thinking of off hand.
I don't know that we're in a position to sign off on a general statement that the query is, or is not, "correct". If there's a specific question about whether a query will include/exclude a record in a specific situation (or why it does/doesn't, or how to modify it so it won't/will), those might get more complete answers.
I can say that there are a couple likely issues:
The only glaring logic issue has to do with timeline management, which is something that causes a lot of trouble with SQL. The issue is, while your query demonstrates that the listing was active at some point before the penalty creation date, it doesn't demonstrate that the listing was still active on the penalty creation date. Consider
PENALTY id transaction date 1 10 2016-02-01 TRANSACTION id listing_id 10 100 LISTING_HISTORY listing_id status date 100 0 2016-01-01 100 1 2016-01-15
The joins would create a single record, and the count for penalty 1 would include listing 100 even though its status had changed to something other than 0 before the penalty was created.
This is hard - but not impossible - to fix with your existing table structure. You could add a
NOT EXISTS condition looking for another LISTING_HISTORY record matching the ID with a date between the first LISTING_HISTORY date and the PENALTY date, for one.
It would be more efficient to add an end date to the LISTING_HISTORY date, but that may not be so easy depending on how the data is maintained.
The second potential issue is the
COUNT(RECORD_ID). This may not do what you mean - what
COUNT(x) may intuitively seem like it should do, is what
COUNT(DISTINCT RECORD_ID) actually does. As written, if the join produces two matches with the same LISTING_HISTORY.ID value - i.e. the listing became active at two different times before the penalty - the listing would be counted twice.