Alexey Ferapontov Alexey Ferapontov - 23 days ago 7
R Question

R sub with perl - starts search backwards?

I have strings that look like

a
shown below. I need to extract part of the string that is between first
//
and first subsequent
/
. I use
sub
with
perl = F
but it's roughly 4 times slower than with
perl = T
. So I tried
perl = T
and found that search starts from the END of the string??

a = "https://moo.com/meh/woof//A.ds.serving/hgtht//ghhg/tjtke"
print(gsub(".*//(.*?)/.*","\\1",a))

"moo.com"

print(gsub(".*//(.*?)/.*","\\1",a,perl=T))

"ghhg"


moo.com
is what I need. I am very surprised to see this - is it documented somewhere? How can I rewrite it with
perl
- I have 20M rows to work with, and speed is important. Thanks!

Edit: it is not given that every string will start with
http

Answer

You can try .*?//(.*?)/.* to make the first .* lazy too so that // will match the first // instance:

gsub(".*?//(.*?)/.*","\\1",a,perl=T)
# [1] "moo.com"

And ?gsub says:

The standard regular-expression code has been reported to be very slow when applied to extremely long character strings (tens of thousands of characters or more): the code used when perl = TRUE seems much faster and more reliable for such usages.

The standard version of gsub does not substitute correctly repeated word-boundaries (e.g. pattern = "\b"). Use perl = TRUE for such matches.

Comments