Diego Diego - 1 month ago 7
Perl Question

regex capturing to start at \b or end of (www\.)

I am trying to capture first occurence of anything that looks like a domain name from a string. For examaple

my.domain.home.com
from
'dfasdf https://www.my.domain.home.com fadsfas'
. I am using
\b
assertion or non-capturing group
(?:www\.)
to mark the start of my capturing group. But instead I get
www.my.domain.home.com
i.e. the
www.
is not stripped out.

This is my full regex:

\b(?:www\.)((?=[a-z0-9-]{1,63}\.)(xn--)?[a-z0-9]+(-[a-z0-9]+)*\.)+[a-z]{2,63}\b


this is the part that I am unsure of:

\b(?:www\.)


how can I make my capturing start at the beginning of the word OR end of 'www.'?

I have checked it with https://www.regex101.com/r/NjR11m/1/tests as well but my final destination is Teradata 15.10 regex which is said to be compliant with the Perl dialect. So if you could help me with in the Perl context I guess I will be fine.

SELECT 'dfasdf https://www.my.domain.home.com fadsfas' AS string,
REGEXP_SUBSTR(string,
'\b(?:www\.)((?=[a-z0-9-]{1,63}\.)(xn--)?[a-z0-9]+(-[a-z0-9]+)*\.)+[a-z]{2,63}\b'
) AS url_to_match;


For
'dfasdf https://my.domain.home.com fadsfas'
it should return
my.domain.home.com
as well.

Additional examples of the strings that should also return
my.domain.home.com


'dfasdf my.domain.home.com fadsfas'


'dfasdf ,my.domain.home.com-- fadsfas'


'dfasdf www.my.domain.home.com#fadsfas'

cco cco
Answer

The problem with www. being included in the match seems to be because you're using the 0th group (which is the full match, not just the capturing groups). While I don't know how to change that, it is possible to reformulate the regex so that group 0 and group 1 have the same value, like so:

\b(?!www\.)([-a-z0-9]{1,63}(?:\.[-a-z0-9]{1,63})+)

This just says the match can't start at www., rather than allowing the match to start there and then having to ignore it.

I've made a modified version of your regex that shows how it works. Note that if you want to match names with mixed-case alphanumerics you'll need to add A-Z to the a-z0-9, or turn on case-insensitivity; matching non-ascii domain names is more work, and left for the interested reader to work out.

Comments