hans glick hans glick - 3 months ago 21
R Question

regex, match string which contains linebreak with lookaroud str_replace in R

I'm using str_replace function in the stringr package in R. I want to replace substring between PARTITIONED BY and STORED AS

Those commands work

my_string="esrhjg erguhg rziughrtPARTITIONED BY hzueirghf zreeuifh iuehg reuhg riutghSTORED ASiugh oer hfz"
p="(?<=PARTITIONED BY).*(?=STORED AS)"
str_replace(my_string,p,"TO REPLACE")


Those cammands do not (I add a \n)

my_string="esrhjg erguhg rziughrtPARTITIONED BY hz\nueirghf zreeuifh iuehg reuhg riutghSTORED ASiugh oer hfz"
p="(?<=PARTITIONED BY).*(?=STORED AS)"
str_replace(my_string,p,"TO REPLACE")


How to make str_replace work if the "between" string contains a linebreak \n?

Answer

In ICU regex flavor, used in all stringr functions, a dot matches any character but a newline.

You may use an inline (?s) modifier - "(?s)(?<=PARTITIONED BY).*(?=STORED AS)":

my_string="esrhjg erguhg rziughrtPARTITIONED BY hz\nueirghf zreeuifh iuehg reuhg riutghSTORED ASiugh oer hfz"
p="(?s)(?<=PARTITIONED BY).*(?=STORED AS)"
str_replace(my_string,p,"TO REPLACE")

Note that you do not need this complex regex, actually, you may just use the TRE regex with sub where . matches a newline, too:

my_string = "esrhjg erguhg rziughrtPARTITIONED BY hzueirghf zreeuifh iuehg reuhg riutghSTORED ASiugh oer hfz"
sub("PARTITIONED BY.*STORED AS", "PARTITIONED BY -TO_REPLACE- STORED AS", my_string)
## or with backreferences:
sub("(PARTITIONED BY).*(STORED AS)", "\\1 -TO_REPLACE- \\2", my_string)

See this IDEONE demo.

If you have multiple substrings to replace in a string, you will need either str_replace_all or gsub with a pattern where .* is replaced with .*?.

Comments