Dick McManus Dick McManus - 3 days ago 4
SQL Question

running multiple counts and joining the results

I am trying to do some spot checks on my data by counting the rows in the partitions, counting the amount of 'uses' I see for each day, and counting the amount of values I see for each day.

I was able to get a previous version of the below query to work before but I must have changed something without realizing it:

src as
(
select partition_date_column, count(*) as src_row_count
from database.table
where partition_date_column > '2016-01-01'
group by partition_date_column
)

,
pst as
(
select timestamp_pst as datevalue, count(*) as timestamp_row_count
from database.table
where partition_date_column > '2016-01-01'
and timestamp_pst between '2016-01-01' and '2017-07-01'
group by timestamp_pst
),

users as
(
select timestamp_pst as user_datevalue, count(*) as user_count
from database.table
where partition_date_column > '2016-01-01'
and timestamp_pst between '2016-01-01' and '2017-07-01'
and filter_column in ('filterA', 'filterB')
group by timestamp_pst
)

select datevalue as dayval, src_row_count, timestamp_row_count, user_count
from pst
left join src
on datevalue = partition_date_column
left join users
on datevalue = user_datevalue
order by dayval;


I am not clear as to what formatting error I made to cause Hive to not recognize this. I also feel there is likely a better way to count these three items even though one is grouped on a different column.

Answer

I figured it out. I was missing the "WITH" at the beginning of the code that allowed for multiple select statements like this.

With src as
(
   select partition_date_column, count(*) as src_row_count
   from database.table
   where partition_date_column > '2016-01-01' 
   group by partition_date_column
)

,
pst as
(
  select timestamp_pst as datevalue, count(*) as timestamp_row_count
  from database.table
  where partition_date_column > '2016-01-01'
  and timestamp_pst between '2016-01-01' and '2017-07-01'
  group by timestamp_pst
),

users as
(
  select timestamp_pst as user_datevalue, count(*) as user_count
  from database.table
  where partition_date_column > '2016-01-01'
  and timestamp_pst between '2016-01-01' and '2017-07-01'
  and filter_column in ('filterA', 'filterB')
  group by timestamp_pst
)

select datevalue as dayval, src_row_count, timestamp_row_count, user_count
from pst
left join src
on datevalue = partition_date_column
left join users
on datevalue = user_datevalue
order by dayval;