arv100kri arv100kri - 1 year ago 48
Java Question

Get fields by name in Pig?

Currently I have a simple pig script which reads from a file on a hadoop fs, as

my_input = load 'input_file' as (A, B, C)

and then I have another line of code which needs to manipulate the fields, like for instance convert them to uppercase (as in the Pig UDF tutorial).

I do something like,

manipulated = FOREACH my_input GENERATE myudf.Upper(A, B, C)

Now in my file I know that I can get the value of A, B, C as (assuming they are all Strings)

public String exec(Tuple input) throws IOException
//yada yada yada
String A = (String) input.get(0);
String B = (String) input.get(1);
String C = (String) input.get(2);
//yada yada yada

Is there anyway I can get the value of a field by its name? For instance if I need to get like 10 fields, is there no other way than to do input.get(i) from 0 to 9?

I am new to Pig, so I am interested in knowing why this is the case? Is there something like a tuple.getByFieldName('Field Name')?

Answer Source

This is not possible, nor would it be very good design to allow it. Pig field names are like variable names. They allow you to give a memorable name to something that gives you insight into what it means. If you use those names in your UDF, you are forcing every Pig script which uses the UDF to adhere to the same naming scheme. If you decide later that you want to think of your variables a little differently, you can't reflect that in their names because the UDF would not function anymore.

The code that reads data from the input tuple in your UDF is like a function declaration. It establishes how to treat each argument to the function.

If you really want to be able to do this, you can build a map easily enough using the TOMAP builtin function, and have your UDF read from the map. This greatly hurts the reusability of your UDF for the reasons mentioned above, but it is nevertheless a fairly simple workaround.