Danny David Leybzon Danny David Leybzon - 4 months ago 19
SQL Question

Extract all characters before a period with HiveQL regex?

I have a table that looks like:


And I'd like to use HiveQL's
to return


Cam Cam

Given the docs data about regexp_extract:

regexp_extract(string subject, string pattern, int index)

Returns the string extracted using the pattern. For example, regexp_extract('foothebar', 'foo(.*?)(bar)', 2) returns 'bar.' Note that some care is necessary in using predefined character classes: using '\s' as the second argument will match the letter s; '\s' is necessary to match whitespace, etc. The 'index' parameter is the Java regex Matcher group() method index. See docs/api/java/util/regex/Matcher.html for more information on the 'index' or Java regex group() method.

So, if you have a table with a single column (let's call it description for our example) you should be able to use regexp_extract as follows to get the data before a period, if one exists, or the entire string in the absence of a period:


The components of the regex are as follows:

  • ^ start of string
  • ([^\.]+) any non-period character one or more times, in a capture group
  • \.? a period either once or no times

Because the part of the string we're interested in will be in the first (and only) capture group, we refer to it by passing the index parameter a value of 1.