Kir Chou Kir Chou - 3 years ago 186
Python Question

Regex matching on full matched substring with constrains in Python

Since it's a regex question. This is a potential duplicated question.

Considering those given strings

test_str = [
"bla bla google.com bla bla", #0
"bla bla www.google.com bla bla", #1
"bla bla api.google.com bla bla", #2
"google.com", #3
"www.google.com", #4
"api.google.com", #5
"http://google.com", #6
"http://www.google.com", #7
"http://api.google.com", #8
"bla bla http://www.google.com bla bla", #9
"bla bla https://www.api.google.com bla bla" #10
]


My desired return is
google.*
or
www.google.*
but not
api.google.*
. Which means, in above case, 2, 5, 8, 10 should not return any match.




I have tried several regex, but I can not find a one line regex string for doing this tasks. Here are what I tried.

re.compile("((http[s]?://)?www\.google[a-z.]*)") # match 1,4,7,9
re.compile("((http[s]?://)?google[a-z.]*)") # match all
re.compile("((http[s]?://)?.+\.google[a-z.]*)") # match except 0,3,6
re.compile("((http[s]?://)?!.+\.google[a-z.]*)") # match nothing


Here, I am seeking a way to ignore
*.google.*
except
www.google.*
and
google.*
. But I got stuck while finding a way to get
*.google.*
.




PS: I have found a O(n**2) way with
split()
to solve this.

r = re.compile("^((http[s]?://)?www.google[a-z.]*)|^((http[s]?://)?google[a-z.]*)")

for s in test_str:
for seg in s.split():
r.findall(seg)

Answer Source

You may use

(?<!\S)(?:https?://)?(?:www\.)?google\.\S*

See the regex demo.

Details

  • (?<!\S) - a location preceded with a whitespace or start of a string (note that you may also use (?:^|\s) here, to be more explicit)
  • (?:https?://)? - an optional non-capturing group matching an optional sequence of https:// or http://
  • (?:www\.)? an optional non-capturing group matching an optional sequence of www.
  • google\. - a google. substring
  • \S* - 0+ non-whitespace chars.

Python demo:

import re
test_str = [
    "bla bla google.com bla bla", #0
    "bla bla www.google.com bla bla", #1
    "bla bla api.google.com bla bla", #2
    "google.com", #3
    "www.google.com", #4
    "api.google.com", #5
    "http://google.com", #6
    "http://www.google.com", #7
    "http://api.google.com", #8
    "bla bla http://www.google.com bla bla", #9
    "bla bla https://www.api.google.com bla bla", #10
    "bla bla https://www.map.google.com bla bla" #11
]
r = re.compile(r"(?<!\S)(?:https?://)?(?:www\.)?google\.\S*")
for i,s in enumerate(test_str):
    m = r.search(s)
    if m:
        print("{}\t#{}".format(m.group(0), i))

Output:

google.com  #0
www.google.com  #1
google.com  #3
www.google.com  #4
http://google.com   #6
http://www.google.com   #7
http://www.google.com   #9
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download