idanshmu idanshmu - 2 months ago 8
SQL Question

Consider a single record, per id, in a group by

Background



I have an SQL table with 4 columns:


  • id
    - varchar(50)

  • g1
    - varchar(50)

  • g2
    - varchar(50)

  • datetime
    - timestamp



I have this query:

SELECT g1,
COUNT(DISTINCT id),
SUM(COUNT(DISTINCT id)) OVER () AS total,
(CAST(COUNT(DISTINCT id) AS float) / SUM(COUNT(DISTINCT id)) OVER ()) AS share
FROM my_table
and g2 = 'start'
GROUP BY 1
order by share desc


This query was built to answer: What is the distributions of
g1
value out of the users?


Problem



Each
id
may have multiple records in the table. I wish to consider the earliest one. early means, the minimum
datetime
value.

Example



Table



id g1 g2 datetime
x1 a start 2016-01-19 21:01:22
x1 c start 2016-01-19 21:01:21
x2 b start 2016-01-19 09:03:42
x1 a start 2016-01-18 13:56:45


Actual query results



g1 count total share
a 2 4 0.5
b 1 4 0.25
c 1 4 0.25


we have 4 records, but I only want to consider two records:

x2 b start 2016-01-19 09:03:42
x1 a start 2016-01-18 13:56:45


which are the earliest records per
id
.

Expected query results



g1 count total share
a 1 2 0.5
b 1 2 0.5


Question



How do I consider only the earliest record, per
id
, in the
group by

Answer

You are querying from my_table all the data although you only want to have the earliest date for an id. I assume id is the primary key in the table.

I suggest you define a view (or inline view) which queries only the earliest dates for the id's and you use your query on that view instead of on my_table.

The view could be defined as so and would contain only id's of earliest date:

select * from my_table a 
where a.datetime = (select min(z.datetime) from my_table z where a.id = z.id) and a.g2 = 'start'

You can define that as a view or use it directly inline as in:

SELECT g1,
       COUNT(DISTINCT id),
       SUM(COUNT(DISTINCT id)) OVER () AS total,
       (CAST(COUNT(DISTINCT id) AS float) / SUM(COUNT(DISTINCT id)) OVER ()) AS share
FROM (select a.id, a.g1, a.g2, a.datetime from my_table a where a.datetime = (select min(z.datetime) from my_table z where a.id = z.id) and a.g2 = 'start')
GROUP BY 1
order by share desc
Comments