Daniel Richter Daniel Richter - 9 days ago 5
SQL Question

calculating percentages for repeat rate cohorts

I wrote a repeat rate query that gives me cohort repeat rate data in the following format:

cohort_join_day | repeat_day | repeat_users
11/15/16 | 0 | 10000
11/15/16 | 1 | 6000
11/15/16 | 2 | 3000


repeat_day 0 represents the total cohort size for that day

I'm trying to skip an excel step and add a forth column with daily repeat rate percentages like so:

cohort_join_day | repeat_day | repeat_users | repeat_percentage
11/15/16 | 0 | 10000 | 100%
11/15/16 | 1 | 6000 | 60%
11/15/16 | 2 | 3000 | 30%


The calculation for this row should be pretty simple e.g.:
day 1 cohort repeat rate on day 6 = (day 1 cohort repeat rate on day 6) / (day 1 cohort repeat rate on day 0)

(day 1 cohort repeat rate on day 0) represents the total size of the cohort

What's the best way to accomplish this?

Here's the daily repeat rate query I wrote:

SELECT
to_char(cohort_join_day, 'YYYY-MM-DD') AS cohort_join_day,
EXTRACT(DAY FROM (current_day - cohort_join_day)) AS repeat_day,
COUNT(DISTINCT unique_id) AS repeat_users
FROM
(
SELECT
auu.unique_id,
date_trunc('day', auu.ds) AS current_day,
date_trunc('day', fsb.ds) AS cohort_join_day
FROM rust.a_unique_users AS auu
JOIN mobile.first_seen_byos AS fsb
ON fsb.unique_id = auu.unique_id
WHERE
auu.os_type = 'iphone_native_app'
AND fsb.ds >= '2016-11-01'
) AS uniques_by_day
WHERE
cohort_join_day <= current_day
GROUP BY
cohort_join_day,
repeat_day;

Answer
SELECT
    *
    ,(repeat_users * 100.0) /
       MAX(CASE WHEN repeat_day = 0 THEN repeat_users END) OVER () as repeat_percentage
FROM
     Table

Conditional Aggregation and Window Functions makes this much easier

And if you are trying to do this calucation for every day then PARTITION the window function by cohor_join_day:

SELECT
    *
    ,(repeat_users * 100.0) /
       MAX(CASE WHEN repeat_day = 0 THEN repeat_users END) OVER (PARTITION BY cohort_join_day) as repeat_percentage
FROM
    Table

MAX(column) OVER () would simply provide the MAX value in the column accross the entire data set.

MAX(column) OVER (PARTITION BY column2) will provide the MAX value in that column for the matching column2 value. You can think of PARTITION BY similar to GROUP BY.

replacing column with a case expression allows you to do conditional aggregation. So for example when you only want the repeat_users when repeat_day = 0 a case expression saying that it will mean it will only return 1 value per partition and ignore the other values because they will be null.

So if you wanted to do the same thing in a straight query without the window function you would do something like this:

SELECT
    t.*
    ,(t.repeat_users * 100.0) / (SELECT t2.repeat_users
             FROM
                Table t2
             WHERE
                t.cohort_join_day = t2.cohort_join_day
                AND t2.repeat_day = 0)     as repeat_percentage
FROM
    Table t

And to show you how to do it with Juan Carlo's method when you have multiple days involved you could do it like so:

WITH cte AS (
    SELECT
       cohort_join_day
       ,repeat_users
    FROM
       @Table
    WHERE
       repeat_day = 0
)

SELECT
    t.*
    ,(t.repeat_users * 100.0) / c.repeat_users as repeat_percentage
FROM
    Table t
    CROSS JOIN cte c
WHERE
    t.cohort_join_day = c.cohort_join_day

If you ever want a running total try something like

SUM(column) OVER (PARTITION BY column2 ORDER BY column3)

definitely get familiar with window functions they are life savers these days.