JudyJiang - 8 months ago 36

SQL Question

I'm trying to calculate exponential average in hive.

For EMA, it's

EMA = (K * (C - P)) + P in which K is the smoothing factor, suppose it's 0.5. C is the current value, p is the previous value. If a table is like the table shown below:

`ID Value Date`

1 10 2010-05-03

2 15 2010-05-06

3 17 2010-05-13

And the EMA should be:

`ID EMA Date`

1 10 2010-05-03

2 0.5*(15 - 10) + 10 = 12.5 2010-05-06

3 0.5*(17 - 12.5) + 12.4 = 14.75 2010-05-13

Instead of implementing UDF in Java, I'm thinking if just by using Hive SQL build in functions can I get the same results. I think LAG function should be applied here, but I'm really not good at database... So am I in the right direction? Is there a Hive SQL like way to do this?

Thanks a lot!!

Answer

This is a little complicated, because the coefficients for the first two numbers are always the same, as you have described the problem. I would be inclined to do this:

```
select v.*,
sum(power(2, n)*val) over (order by id) / sum(power(2, n) over (order by id)
from (select v.*, row_number() over (order by id) - 1 as n
from vals
) v
```

However, this gives the results as 10, 13.33, and 15.42. Relative to what you want, it is underweighting the first value. This is readily fixed by adding it in:

```
select v.*,
(max(case when n = 0 then val else 0 end) over (order by id) +
sum(power(2, n)*val) over (order by id)
) / (1 + sum(power(2, n)) over (order by id)
from (select v.*, row_number() over (order by id) - 1 as n
from vals v
) v
```

Here is a SQL Fiddle using Oracle that demonstrates the code. I'm not 100% sure if the numeric functions have the same names in Hive but they should be something similar. Also, if your sequences are large, you run the risk of numeric overflow using this particular code.