user1956609 user1956609 - 8 months ago 58
Java Question

Java/Hive regex interpretation

Straightforward question, it's just difficult to google regex syntax...

I'm going through the HortonWorks Hive tutorials (Hive uses same regex as Java), and the following SELECT statement uses regex to pull from what's probably JSON data...

INSERT OVERWRITE TABLE batting
SELECT
regexp_extract(col_value,'^(?:([^,]*)\.?){1}',1) player_id,
regexp_extract(col_value,'^(?:([^,]*)\.?){2}',1) year,
regexp_extract(col_value,'^(?:([^,]*)\.?){9}',1) run
FROM temp_batting;


The data looks like this:

PlayerID,yearID,stint,teamID,lgID,G,G_batting,AB,R,H,2B,3B,HR,RBI,SB,CS,BB,SO,IBB,HBP,SH,SF,GIDP,G_old
aardsda01,2004,1,SFN,NL,11,11,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,11
aardsda01,2006,1,CHN,NL,45,43,2,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,45
aardsda01,2007,1,CHA,AL,25,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2

And so PlayerID is in column1, year is column2, R (runs) is column 9. How is regexp_extract successfully pulling this data?

I'm new to non-capturing groups, but it looks to me like the entire thing is a non-capturing group. Also, I'm used to seeing {1}, {2}, or {9} in the form [0-9]{9} meaning it matches a 9-digit number. In this case it looks like it's pointing to the 9th match of something, what is this syntax called?

Answer Source

First break apart the regex:

^(?:([^,]*)\.?){n}
  • ^ is the start of a String
  • (?:...){n} is a non-capturing group repeated n times
  • ([^,]*) is a negated character class, it matches "not ," zero or more times
  • \.? is an optional (literal) .

So, how does this work?

The non-capturing group is solely there for the numeric quantifier, i.e. it makes the entire pattern in the group repeat n times.

The actual pattern being captured is in the capturing group ([^,]*). I'm not sure why the optional . is there and I don't see any inputs ending with a . in your sample data but I assume there are some.

What happens is the the group is captured n times but only the last capture is stored and this is stored in the first group, i.e. group 1. This is the default in the regexp_extract.

So when the pattern repeats once in the first case we capture the first element on the comma separated array. When the pattern repeats twice in the second example we capture the second element. When the pattern repeats nine times then the ninth element is captured.

The pattern itself is actually pretty horrible as it allows for a zero length pattern to be repeated, this means that the regex engine can backtrack a lot if there is a non-matching pattern. I imagine this isn't an issue for you but it is generally bad practice.

It would be best to either make the [^,]* possessive by adding a +:

^(?:([^,]*+)\.?){n}

Or make the entire non-capturing group atomic:

^(?>([^,]*)\.?){n}