Alexander Alexander - 6 months ago 18
Python Question

How to set regex for website url pattern

The url pattern is

http://www.hepsiburada.com/philips-40pfk5500-40-102-ekran-full-hd-200-hz-uydu-alicili-cift-cekirdek-smart-android-led-tv-p-EVPHI40PFK5500


This website has similar urls. The unique identifier is
-p-
for this url.
The url pattern always has
-p-
before word which is at end of url.

I used the following regex

(.*)hepsiburada\.com\/([\w.-]+)([\-p\-\w+])\Z


it matched but it match many patterns on this website.

For example regex should match url above but it shouldnt match with

http://www.hepsiburada.com/bilgisayarlar-c-2147483646

Answer

Since you are using a re.match you really need to match the string from the beginning. However, the main problem is that your -p- is inside a character class, and is thus treated as separate symbols that can be matched. Same is with the \w+ - it is considered as \w and + separately.

So, use a sequence:

 (.*)hepsiburada\.com/([\w.-]+)(-p-\w+)$

See this regex demo

Or

^https?://(?:www\.)?hepsiburada\.com/([\w.-]+)(-p-\w+)$

See the regex demo

Note that most probably you even have no need in the capture groups, and (...) parentheses can be removed from the pattern.

Comments