euge1220 euge1220 - 1 month ago 14
SQL Question

Finding Covariance using SQL

# dt---------indx_nm1-----indx_val1-------indx_nm2------indx_val2
2009-06-08----ABQI------1001.2------------ACNACTR----------300.05
2009-06-09----ABQI------1002.12 ----------ACNACTR----------341.19
2009-06-10----ABQI------1011.4------------ACNACTR----------382.93
2009-06-11----ABQI------1015.43 ----------ACNACTR----------362.63


I have a table that looks like ^ (but with hundreds of rows that dates from 2009 to 2013). Is there a way that I could calculate the covariance : [(
indx_val1
- avg(
indx_val1
)) * (
indx_val2
- avg(
indx_val2
)] divided by total number of rows for each value of
indx_val1
and
indx_val2
(loop through the entire table) and return just a simple value for cov(
ABQI
,
ACNACTR
)

Answer

Since you have aggregates operating over two different groups, you will need two different queries. The main one groups by dt to get your row values per date. The other query has to perform AVG() and COUNT() aggregates across the whole rowset.

To use them both at the same time, you need to JOIN them together. But since there's no actual relation between the two queries, it is a cartesian join on the expression 1 = 1 in the ON. Effectively, that joins every row of the main query with the single row retrieved by the aggregate query. You can then perform the arithmetic in the SELECT list, using values from both:

So, building on the query from your earlier question:

SELECT 
 indxs.*,
 ((indx_val2 - indx_val2_avg) * (indx_val1 - indx_val1_avg)) / total_rows AS cv
FROM (
    SELECT 
      dt,
      MAX(CASE WHEN indx_nm = 'ABQI' THEN indx_nm ELSE NULL END) AS indx_nm1,
      MAX(CASE WHEN indx_nm = 'ABQI' THEN indx_val ELSE NULL END) AS indx_val1,
      MAX(CASE WHEN indx_nm = 'ACNACTR' THEN indx_nm ELSE NULL END) AS indx_nm2,
      MAX(CASE WHEN indx_nm = 'ACNACTR' THEN indx_val ELSE NULL END) AS indx_val2
    FROM table1 a
    GROUP BY dt
  ) indxs 
  INNER JOIN (
    /* Join against a query returning the AVG() and COUNT() across all rows */
    SELECT
      'ABQI' AS indx_nm1_aname,
      AVG(CASE WHEN indx_nm = 'ABQI' THEN indx_val ELSE NULL END) AS indx_val1_avg,
      'ACNACTR' AS indx_nm2_aname,
      AVG(CASE WHEN indx_nm = 'ACNACTR' THEN indx_val ELSE NULL END) AS indx_val2_avg,
      COUNT(*) AS total_rows
    FROM table1 b
    WHERE indx_nm IN ('ABQI','ACNACTR')
    /* And it is a cartesian product */
  ) aggs ON 1=1
WHERE
  indx_nm1 IS NOT NULL
  AND indx_nm2 IS NOT NULL
ORDER BY dt

Here's a demo, building on your earlier one: http://sqlfiddle.com/#!6/2ec65/14

Comments