gh0st gh0st - 21 days ago 6
SQL Question

Filtering results based on most common occurrences of values in other columns (excluding outlying values)

Consider the following table and pseudo query:

Distinct Customers
WHERE
Most common PaymentMethod = 'CreditCard'
AND
Most common DeliveryService = '24hr'

Customer TransID PaymentMethod DeliveryService
-----------------------------------------------------
Susan 1 CreditCard 24hr
Susan 2 CreditCard 24hr
Susan 3 Cash 24hr
John 4 CreditCard 48hr
John 5 CreditCard 48hr
Diane 6 CreditCard 24hr
Steve 7 Paypal 24hr
Steve 8 CreditCard 48hr
Steve 9 Paypal 24hr


Should return (2) records:

Customer
---------
Susan
Diane


Another way to look at it is that I want to exclude minority cases, i.e.:
I don't want to return 'Steve', because although he used a creditcard once, he doesn't generally do so, I only care about the majority behaviour, across multiple columns.

In reality, there are more columns (10s) that need the same principle applied so I'm after a technique that will scale at least that far searching 100ks of records.

Answer

One method uses window functions and aggregation:

with cp as (
     select customerid, paymentmethod, count(*) as cnt,
            rank() over (partition by customerid order by count(*) desc) as seqnum
     from t
     group by customerid, paymentmethod
    ),
    cd as (
     select customerid, deliveryservice, count(*) as cnt
            rank() over (partition by customerid over by count(*) desc) as seqnum
     from t
     group by customerid, deliveryservice
    )
select cp.customerid
from cp join
     cd
     on cp.customerid = cd.customerid
where (cp.seqnum = 1 and cp.PaymentMethod = 'CreditCard') and
      (cd.seqnum = 1 and cd.DeliveryService = '24hr');

Because you need the ranks along two different dimensions, I think you need two subqueries (or the equivalent).