kee kee - 1 month ago 20
Python Question

How to write a Python UDF for User Defined Aggregate Function in Hive

I would like to do some aggregation work on an aggregate column (after GROUP BY) in Hive using Python. I found there is UDAF for this purpose. All I can find is a Java example. Is there an example on writing in Python?

Or for python between UDF and UDAF, there is no difference? For UDAF, I just need to write it like a reducer? Please advise.

Answer

You can make use of Hive's streaming UDF functionality (TRANSFORM) to use a Python UDF which reads from stdin and outputs to stdout. You haven't found any Python "UDAF" examples because UDAF refers to the Hive Java class you extend so it would only be in Java.

When using a streaming UDF, Hive will choose whether to launch or a map or reduce job so there is no need to specify (for more on this functionality see this link: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Transform).

Basically, your implementation would be to write a python script which reads from stdin, calculates some aggregate number and outputs it to stdout. To implement in Hive do the following:

1) First add your python script to your resource library in Hive so that it gets distributed across your cluster:

add file script.py;

2) Then call your transform function and input the columns you want to aggregate. Here is an example:

select transform(input cols)
using 'python script.py' as (output cols)
from table
;

Depending on what you need to do, you may need a separate mapper and reducer script. If you need to aggregate based on column value, remember to use Hive's CLUSTER BY/DISTRIBUTE BY syntax in your mapper stage so that partitioned data gets sent to the reducer.

Let me know if this helps.