Joaquin Joaquin - 1 month ago 16
SQL Question

Aggregate one entry into multiple groups

I am using Google's BigQuery to extract data from the GDELT database, to extract the average tone of the news containing a given country. I have a working SQL query

SELECT date,
Avg(Float(tone)) tone
FROM (
SELECT integer(regexp_replace(String(date), r'\d{6}$', '')) date,
regexp_replace(v2tone, r',.*', '') tone,
FROM [gdelt-bq:gdeltv2.gkg_partitioned]
WHERE _partition_load_time BETWEEN timestamp('2016-07-06') AND timestamp('2016-07-07')
AND (
v2locations LIKE '%Spain%'))
GROUP BY date,
country
ORDER BY country,
date


But I need to do these queries for a lot of different countries, so I thought maybe I would be able to download all my data on one single query, and I think I am almost there. For an example with two countries:

SELECT date,
country,
Avg(Float(tone)) tone
FROM (
SELECT integer(regexp_replace(String(date), r'\d{6}$', '')) date,
regexp_replace(v2tone, r',.*', '') tone,
regexp_extract(v2locations, r'(Spain|Chile)') country
FROM [gdelt-bq:gdeltv2.gkg_partitioned]
WHERE _partition_load_time BETWEEN timestamp('2016-07-06') AND timestamp('2016-07-07')
AND (
v2locations LIKE '%Spain%'
OR v2locations LIKE '%Chile%'))
GROUP BY date,
country
ORDER BY country,
date


Now, the problem here is that for the entries that have both Chile and Spain need to be averaged in both the groups for Spain and Chile. As the code is now, I get the correct results for Chile, since it comes first alphabetically, but obviously the results for Spain are wrong, since the entries with both countries have been averaged into the Chile group.

My question is: How can I group the entries that has both the terms Spain and Chile in the V2Locations column in the both groups? Is it possible to do that?

EDIT:
While the answers below do answer my question, they do also lead to pretty high costs. Then what I am doing now is making the aggregations locally, to be able to take advantage of the partition featrues of the GDELT database. That is, I exctract the average tone for the different intersections of the countries, as well as the number of observations. This allows for computing the actual country means locally (not on BigQuery). As the amount of data exctrated increases, this takes a long time to compute, but it saves quite a bit on costs, and allows for the extraction of data of one extra country at zero extra cost.

SELECT date,
concat(IF(regexp_match(country,'Cataluna'),'Cataluna',''),
IF(regexp_match(country,'Chile'),'Chile',''),'') country,
AVG(FLOAT(tone)) Tone,
count(tone) num,
FROM (
SELECT INTEGER(REGEXP_REPLACE(STRING(DATE), r'\d{6}$', '')) date,
REGEXP_REPLACE(V2Tone, r',.*', '') tone,
V2Locations country,
FROM [gdelt-bq:gdeltv2.gkg_partitioned]
WHERE _PARTITION_LOAD_TIME BETWEEN TIMESTAMP('2016-05-01')
AND TIMESTAMP('2016-10-23')
AND (V2Locations like '%Cataluna%'
OR V2Locations like '%Chile%'))
GROUP BY date, country
ORDER BY country, date

Answer

Try below (BigQuery Legacy SQL mode)

SELECT DATE, country, AVG(FLOAT(tone)) Tone
FROM (
  SELECT INTEGER(REGEXP_REPLACE(STRING(DATE), r'\d{6}$', '')) AS DATE,
    REGEXP_REPLACE(V2Tone, r',.*', '') tone,
    country 
  FROM [gdelt-bq:gdeltv2.gkg_partitioned] gkg 
  CROSS JOIN (
    SELECT country FROM 
      (SELECT 'Chile' AS country),
      (SELECT 'Spain' AS country)
  ) c
  WHERE _PARTITION_LOAD_TIME BETWEEN TIMESTAMP('2016-07-07') AND TIMESTAMP('2016-07-07') 
  AND gkg.V2Locations LIKE CONCAT('%', country, '%') 
) x
GROUP BY DATE, country 
ORDER BY country, DATE
Comments