Keren Keren - 3 years ago 350
Scala Question

Spark Regexp: Split column based on date

I have a column, called "data", in my dataframe that looks like this:


I would like to separate this into three different columns that look like:

col1: {"blah:"blah","blah":"blah"""
col2: 10/7/17
col3: service

I have tried this approach:

val separate = df.withColumn("col1", regexp_extract($"data", "(/(0[1-9]|1[012])[- \/.](0[1-9]|[12][0-9]|3[01])[- \/.](19|20)\d\d/)", 1)
.withColumn("col2",regexp_extract($"data", "(/(0[1-9]|1[012])[- \/.](0[1-9]|[12][0-9]|3[01])[- \/.](19|20)\d\d/)", 2))

But this regex doesn't really get me through the door. I feel like I'm missing something about how the regex operator works in Spark. Any ideas?

Thanks so much!! :)

edit-rules for columns:

  • col1: before the date value

  • col2: date value

  • col3: after the datevalue

Answer Source

Okay, as you confirmed the rules are:

  • col1: Match until it finds the last "
  • col2: Match the date
  • col3: The rest of the string

The regex you need is:


However, when you use it on the regexp_extract() function, you must escape the backslashes, so for each column, you'll use:

regexp_extract($"data", "(.+\")(\\d{1,2}\\/\\d{1,2}\\/\\d{1,2})(.+)", N)

Based on the code you wrote, try using this:

val separate = df.withColumn("col1", regexp_extract($"data", "(.+\")(\\d{1,2}\\/\\d{1,2}\\/\\d{1,2})(.+)", 1)).withColumn("col2",regexp_extract($"data", "(.+\")(\\d{1,2}\\/\\d{1,2}\\/\\d{1,2})(.+)", 1)).withColumn("col3",regexp_extract($"data", "(.+\")(\\d{1,2}\\/\\d{1,2}\\/\\d{1,2})(.+)", 3))
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download