Daniel Richter - 1 year ago 54
SQL Question

# calculating percentages for repeat rate cohorts

I wrote a repeat rate query that gives me cohort repeat rate data in the following format:

``````cohort_join_day | repeat_day | repeat_users
11/15/16        |      0     | 10000
11/15/16        |      1     | 6000
11/15/16        |      2     | 3000
``````

repeat_day 0 represents the total cohort size for that day

I'm trying to skip an excel step and add a forth column with daily repeat rate percentages like so:

``````cohort_join_day | repeat_day | repeat_users | repeat_percentage
11/15/16        |      0     | 10000        | 100%
11/15/16        |      1     | 6000         |  60%
11/15/16        |      2     | 3000         |  30%
``````

The calculation for this row should be pretty simple e.g.:
day 1 cohort repeat rate on day 6 = (day 1 cohort repeat rate on day 6) / (day 1 cohort repeat rate on day 0)

(day 1 cohort repeat rate on day 0) represents the total size of the cohort

What's the best way to accomplish this?

Here's the daily repeat rate query I wrote:

``````    SELECT
to_char(cohort_join_day, 'YYYY-MM-DD')            AS cohort_join_day,
EXTRACT(DAY FROM (current_day - cohort_join_day)) AS repeat_day,
COUNT(DISTINCT unique_id)                         AS repeat_users
FROM
(
SELECT
auu.unique_id,
date_trunc('day', auu.ds) AS current_day,
date_trunc('day', fsb.ds) AS cohort_join_day
FROM rust.a_unique_users AS auu
JOIN mobile.first_seen_byos AS fsb
ON fsb.unique_id = auu.unique_id
WHERE
auu.os_type = 'iphone_native_app'
AND fsb.ds >= '2016-11-01'
) AS uniques_by_day
WHERE
cohort_join_day <= current_day
GROUP BY
cohort_join_day,
repeat_day;
``````

``````SELECT
*
,(repeat_users * 100.0) /
MAX(CASE WHEN repeat_day = 0 THEN repeat_users END) OVER () as repeat_percentage
FROM
Table
``````

Conditional Aggregation and Window Functions makes this much easier

And if you are trying to do this calucation for every day then PARTITION the window function by cohor_join_day:

``````SELECT
*
,(repeat_users * 100.0) /
MAX(CASE WHEN repeat_day = 0 THEN repeat_users END) OVER (PARTITION BY cohort_join_day) as repeat_percentage
FROM
Table
``````

`MAX(column) OVER ()` would simply provide the `MAX` value in the `column` accross the entire data set.

`MAX(column) OVER (PARTITION BY column2)` will provide the `MAX` value in that column for the matching `column2` value. You can think of `PARTITION BY` similar to `GROUP BY`.

replacing `column` with a case expression allows you to do conditional aggregation. So for example when you only want the `repeat_users when repeat_day = 0` a case expression saying that it will mean it will only return 1 value per partition and ignore the other values because they will be null.

So if you wanted to do the same thing in a straight query without the window function you would do something like this:

``````SELECT
t.*
,(t.repeat_users * 100.0) / (SELECT t2.repeat_users
FROM
Table t2
WHERE
t.cohort_join_day = t2.cohort_join_day
AND t2.repeat_day = 0)     as repeat_percentage
FROM
Table t
``````

And to show you how to do it with Juan Carlo's method when you have multiple days involved you could do it like so:

``````WITH cte AS (
SELECT
cohort_join_day
,repeat_users
FROM
@Table
WHERE
repeat_day = 0
)

SELECT
t.*
,(t.repeat_users * 100.0) / c.repeat_users as repeat_percentage
FROM
Table t
CROSS JOIN cte c
WHERE
t.cohort_join_day = c.cohort_join_day
``````

If you ever want a running total try something like

``````SUM(column) OVER (PARTITION BY column2 ORDER BY column3)
``````

definitely get familiar with window functions they are life savers these days.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download