SQL Question

Need help wrapping head around joins

I have a database of a service that helps people sell things. If they fail a delivery of a sale, they get penalised. I am trying to extract the number of active listings each user had when a particular penalty was applied.

I have the equivalent to the following tables(and relevant fields):


  1. user (id)

  2. listing (id, user_id, status)

  3. transaction (listing_id, seller_id)

  4. listing_history (id, listing_status, date_created)

  5. penalty (id, transaction_id, user_id, date_created)



The listing_history table saves an entry every time a listing is modified, saving a record of what the new state of the listing is.

My goal is to end with a result table with the field: penalty_id, and number of active listings the penalised user had when the penalty was applied.

So far I have the following:

SELECT s1.penalty_id,
COUNT(s1.record_id) 'active_listings'
FROM (
SELECT penalty.id AS 'penalty_id',
listing_history.id AS 'record_id',
FROM user
JOIN penalty ON penalty.user_id = user.id
JOIN transaction ON transaction.id = penalty.transaction_id
JOIN listing_history ON listing_history.listing_id = listing.id
WHERE listing_history.date_created < penalty.date_created
AND listing_history.status = 0
) s1
GROUP BY s1.penalty_id


Status = 0 means that the listing is active (or that the listing was active at the time the record was created). I got results similar to what I expected, but I fear I may be missing something or may be doing the JOINs wrong. Would this have your approval? (apart from the obvious non-use of aliases, for clarity problems).

Answer

UPDATE - As the comments on this answer indicate that changing the table structure isn't an option, here are more details on some queries you could use with the existing structure.

Note that I made a couple changes to the query before even modifying the logic.

  • As viki888 pointed out, there was a problem reference to listing.id; I've replaced it.
  • There was no real need for a subquery in the original query; I've simplified it out.

So the original query is rewritten as

SELECT penalty.id AS 'penalty_id'
     , COUNT(listing_history.id) 'active_listings'
  FROM      user
       JOIN penalty
         ON penalty.user_id = user.id
       JOIN transaction
         ON transaction.id = penalty.transaction_id
       JOIN listing_history 
         ON listing_history.listing_id = transaction.listing_id
 WHERE listing_history.date_created < penalty.date_created
   AND listing_history.status = 0
 GROUP BY penalty.id

Now the most natural way, in my opinion, to write the corrected timeline constraint is with a NOT EXISTS condition that filters out all but the most recent listing_history record for a given id. This does require thinking about some edge cases:

  • Could two listing history records have the same create date? If so, how do you decide which happened first?
  • If a listing history record is created on the same day as the penalty, which is treated as happening first?

If the created_date is really a timestamp, then this may not matter much (if at all); if it's really a date, it might be a bigger issue. Since your original query required that the listing history be created before the penalty, I'll continue in that style; but it's still ambiguous how to handle the case where two history records with matching status have the same date. You may need to adjust the date comparisons to get the desired behavior.

SELECT penalty.id AS 'penalty_id'
     , COUNT(DISTINCT listing_history.id) 'active_listings'
  FROM      user
       JOIN penalty
         ON penalty.user_id = user.id
       JOIN transaction
         ON transaction.id = penalty.transaction_id
       JOIN listing_history 
         ON listing_history.listing_id = transaction.listing_id
 WHERE listing_history.date_created < penalty.date_created
   AND listing_history.status = 0
   AND NOT EXISTS (SELECT 1
                     FROM listing_history h2
                    WHERE listing_history.date_created < h2.date_created
                      AND h2.date_created < penalty.date_created
                      AND h2.id = listing_history.id)
 GROUP BY penalty.id

Note that I switched from COUNT(...) to COUNT(DISTINCT ...); this helps with some edge cases where two active records for the same listing might be counted.

If you change the date comparisons to use <= instead of < - or, equivalently, if you use BETWEEN to combine the date comparisons - then you'd want to add AND h2.status != 0 (or AND h2.status <> 0, depending on your database) to the subquery so that two concurrent ACTIVE records don't cancel each other out.

There are several equivalent ways to write this, and unfortunately its the kind of query that doesn't always cooperate with a database query optimizer so some trial and error may be necessary to make it run well with large data volumes. Hopefully that gives enough insight into the intended logic that you could work out some equivalents if need be. You could consider using NOT IN instead of NOT EXISTS; or you could use an outer join to a second instance of LISTING_HISTORY... There are probably others I'm not thinking of off hand.


I don't know that we're in a position to sign off on a general statement that the query is, or is not, "correct". If there's a specific question about whether a query will include/exclude a record in a specific situation (or why it does/doesn't, or how to modify it so it won't/will), those might get more complete answers.

I can say that there are a couple likely issues:

The only glaring logic issue has to do with timeline management, which is something that causes a lot of trouble with SQL. The issue is, while your query demonstrates that the listing was active at some point before the penalty creation date, it doesn't demonstrate that the listing was still active on the penalty creation date. Consider

PENALTY
id             transaction    date
1              10             2016-02-01

TRANSACTION
id             listing_id
10             100

LISTING_HISTORY
listing_id     status         date
100            0              2016-01-01
100            1              2016-01-15

The joins would create a single record, and the count for penalty 1 would include listing 100 even though its status had changed to something other than 0 before the penalty was created.

This is hard - but not impossible - to fix with your existing table structure. You could add a NOT EXISTS condition looking for another LISTING_HISTORY record matching the ID with a date between the first LISTING_HISTORY date and the PENALTY date, for one.

It would be more efficient to add an end date to the LISTING_HISTORY date, but that may not be so easy depending on how the data is maintained.

The second potential issue is the COUNT(RECORD_ID). This may not do what you mean - what COUNT(x) may intuitively seem like it should do, is what COUNT(DISTINCT RECORD_ID) actually does. As written, if the join produces two matches with the same LISTING_HISTORY.ID value - i.e. the listing became active at two different times before the penalty - the listing would be counted twice.