Diego Diego - 3 years ago 108
Perl Question

regex capturing to start at \b or end of (www\.)

I am trying to capture first occurence of anything that looks like a domain name from a string. For examaple

'dfasdf https://www.my.domain.home.com fadsfas'
. I am using
assertion or non-capturing group
to mark the start of my capturing group. But instead I get
i.e. the
is not stripped out.

This is my full regex:


this is the part that I am unsure of:


how can I make my capturing start at the beginning of the word OR end of 'www.'?

I have checked it with https://www.regex101.com/r/NjR11m/1/tests as well but my final destination is Teradata 15.10 regex which is said to be compliant with the Perl dialect. So if you could help me with in the Perl context I guess I will be fine.

SELECT 'dfasdf https://www.my.domain.home.com fadsfas' AS string,
) AS url_to_match;

'dfasdf https://my.domain.home.com fadsfas'
it should return
as well.

Additional examples of the strings that should also return

'dfasdf my.domain.home.com fadsfas'

'dfasdf ,my.domain.home.com-- fadsfas'

'dfasdf www.my.domain.home.com#fadsfas'

cco cco
Answer Source

The problem with www. being included in the match seems to be because you're using the 0th group (which is the full match, not just the capturing groups). While I don't know how to change that, it is possible to reformulate the regex so that group 0 and group 1 have the same value, like so:


This just says the match can't start at www., rather than allowing the match to start there and then having to ignore it.

I've made a modified version of your regex that shows how it works. Note that if you want to match names with mixed-case alphanumerics you'll need to add A-Z to the a-z0-9, or turn on case-insensitivity; matching non-ascii domain names is more work, and left for the interested reader to work out.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download