Observer Observer - 3 months ago 7
Python Question

Use Regex to parse out some part of URL using python

Suppose I am having some like as the following,

URL
http://hostname.com/as/ck$st=fa+gw+hw+ek+ei/
http://hostname.com/wqs/ck$st=fasd+/
http://hostname.com/as/ck$st=fa+gq+hf+kg+is&sadfnlslkdfn&gl+jh+ke+oj+kp sfav


I want to check for first + symbol in the url and move backward until we find a special character such as / or ? or = or any other special character and start from that and go on until we find a space or end of line or & or /.

The regex which I wrote with the help of stackoverflow forums is as follows,

re.search(r"[^\w\+ ]([\w\+ ]+\+[\w\+ ]+)(?:[^\w\+ ]|$)", x).group(1)


This one works with the first row. But does not parse anything with second row. Also in the third row, I want to check for multiple patterns like this in the row. The current regex checks only for one pattern.

My output should be,

parsed
fa+gw+hw+ek+ei
fasd
fa+gq+hf+kg+is gl+jh+ke+oj+kp


Can anybody help me to modify the regex which is already there to suit this needs?

Thanks

Answer

I used regexr to come up with this (regexr link):

([\w\+]*\+[\w\+]*)(?:[^\w\+]|$)

Matches:

fa+gw+hw+ek+ei fasd+ fa+gq+hf+kg+is gl+jh+ke+oj+kp

EDIT: Instead of using re.search, try using re.findall instead:

>>> s = "http://hostname.com/as/ck$st=fa+gq+hf+kg+is&sadfnlslkdfn&gl+jh+ke+oj+kp sfav"
>>> re.findall("([\w\+]+\+[\w\+]*)(?:[^\w\+]|$)", s)
['fa+gq+hf+kg+is', 'gl+jh+ke+oj+kp']