EsPEEEE TV EsPEEEE TV - 26 days ago 7
SQL Question

BQ SQL solution solution for comparing rows based on variance

I'm trying to compare scraped retail item price data in BigQuery (~2-3B rows depending on the time period and retailers included); with the intent to identify meaningful price differences. For example $1.99 vs $2.00 isn't meaningful, but $1.99 vs $2.50 is meaningful. Meaningful is quantified as a 2% difference between prices.

Example dataset for one item looks like this:

ITEM Price($) Meaningful (This is the column I'm trying to flag)
Apple $1.99 Y (lowest price would always be flagged)
Apple $2.00 N ($1.99 v $2.00)
Apple $2.01 N ($1.99 v $2.01) Still using $1.99 for comparison
Apple $2.50 Y ($1.99 v $2.50) Still using $1.99 for comparison
Apple $2.56 Y ($2.50 v $2.56) Now using $2.50 as new comp. price
Apple $2.62 Y ($2.55 v $2.62) Now using $2.56 as new comp. price


I was hoping to solve the problem just using SQL Window functions (lead, lag, partition over, etc..) comparing the current row's price to the next following row. However, that doesn't work correctly when I get to a non-meaningful price because I always want the next value to be compared to the most recent meaningful price (see $2.50 row example above that's compared to $2.00 and NOT $2.01 in the prior row)

My Questions:




  • Is it possible to solve this with SQL alone in BigQuery? (e.g. What creative SQL logic solution am I overlooking, like bucketing based on the variance amounts?)

  • What programmatic options do I have since I can't use stored procedures with BQ? Python/Dataframes in GCP Datalab? BQ UDFs?


Answer Source

Below is for BigQuery Standard SQL

#standardSQL
CREATE TEMPORARY FUNCTION x(prices ARRAY<FLOAT64>)
RETURNS ARRAY<STRUCT<price FLOAT64, flag STRING>>
LANGUAGE js AS """
  var result = [];
  var last = 0;
  var flag = '';
  for (i = 0; i < prices.length; i++){
    if (i == 0) {
      last = prices[i];
      flag = 'Y'
    } else {
      if ((prices[i] - last)/last > 0.02) {
        last = prices[i];
        flag = 'Y'
      } else {flag = 'N'}
    }
    var rec = [];
    rec.price = prices[i];
    rec.flag = flag;
    result.push(rec); 
  } 
  return result;
""";
SELECT item, rec.* 
FROM (
  SELECT item, ARRAY_AGG(price ORDER BY price) AS prices
  FROM `yourTable`
  GROUP BY item
), UNNEST(x(prices) ) AS rec
-- ORDER BY item, price  

You can play with / test it with below dummy data from your question

#standardSQL
CREATE TEMPORARY FUNCTION x(prices ARRAY<FLOAT64>)
RETURNS ARRAY<STRUCT<price FLOAT64, flag STRING>>
LANGUAGE js AS """
  var result = [];
  var last = 0;
  var flag = '';
  for (i = 0; i < prices.length; i++){
    if (i == 0) {
      last = prices[i];
      flag = 'Y'
    } else {
      if ((prices[i] - last)/last > 0.02) {
        last = prices[i];
        flag = 'Y'
      } else {flag = 'N'}
    }
    var rec = [];
    rec.price = prices[i];
    rec.flag = flag;
    result.push(rec); 
  } 
  return result;
""";
WITH `yourTable` AS (
  SELECT 'Apple' AS item, 1.99 AS price UNION ALL
  SELECT 'Apple', 2.00 UNION ALL
  SELECT 'Apple', 2.01 UNION ALL
  SELECT 'Apple', 2.50 UNION ALL
  SELECT 'Apple', 2.56 UNION ALL
  SELECT 'Apple', 2.62 
)
SELECT item, rec.* 
FROM (
  SELECT item, ARRAY_AGG(price ORDER BY price) AS prices
  FROM `yourTable`
  GROUP BY item
), UNNEST(x(prices) ) AS rec
ORDER BY item, price    

Result is as below

item    price   flag     
----    -----   ----
Apple   1.99    Y    
Apple   2.0     N    
Apple   2.01    N    
Apple   2.5     Y    
Apple   2.56    Y    
Apple   2.62    Y