I have a column, called "data", in my dataframe that looks like this:
{"blah:"blah","blah":"blah"""10/7/17service
col1: {"blah:"blah","blah":"blah"""
col2: 10/7/17
col3: service
val separate = df.withColumn("col1", regexp_extract($"data", "(/(0[1-9]|1[012])[- \/.](0[1-9]|[12][0-9]|3[01])[- \/.](19|20)\d\d/)", 1)
.withColumn("col2",regexp_extract($"data", "(/(0[1-9]|1[012])[- \/.](0[1-9]|[12][0-9]|3[01])[- \/.](19|20)\d\d/)", 2))
Okay, as you confirmed the rules are:
col1
: Match until it finds the last "
col2
: Match the datecol3
: The rest of the stringThe regex you need is:
/(.+")(\d{1,2}\/\d{1,2}\/\d{1,2})(.+)/
However, when you use it on the regexp_extract()
function, you must escape the backslashes, so for each column, you'll use:
regexp_extract($"data", "(.+\")(\\d{1,2}\\/\\d{1,2}\\/\\d{1,2})(.+)", N)
Based on the code you wrote, try using this:
val separate = df.withColumn("col1", regexp_extract($"data", "(.+\")(\\d{1,2}\\/\\d{1,2}\\/\\d{1,2})(.+)", 1)).withColumn("col2",regexp_extract($"data", "(.+\")(\\d{1,2}\\/\\d{1,2}\\/\\d{1,2})(.+)", 1)).withColumn("col3",regexp_extract($"data", "(.+\")(\\d{1,2}\\/\\d{1,2}\\/\\d{1,2})(.+)", 3))