Jordan Davis Jordan Davis - 2 months ago 17
C Question

Regex URL Capturing Group

I'm writing a regex expression and trying to get each part of a URL into it's own capture group for extraction:


  • Protocol (http,https)

  • Sub Domain (sub)

  • Domain (domain)

  • Domain Extension (com,net)

  • Path (/path/to/file - this is to be the path to the directory the file is contained in)

  • URI (file name)

  • URI Extension (file extension - js,css,pdf)



Sample URLs:

http://domain.com/path1/to/file.js
http://domain.com/path-dash/to-dash/file.js
http://domain.com/path-dash/to-dash/file-name.js
https://sub.domain.com/path/to/file.js
http://sub.domain-dash.net/path/to/file.js
http://sub-dash.domain.com/path/to/file.js
http://sub-dash.domain-dash.com/path/to/file.js


What I have so far:

/(https?):\/\/(\w+[\-]?\w+)?.?(\w+[\-]?\w+)?/gm


Desired Output:


  • Group1: protocol

  • Group2: sub domain (if exist, or blank if not)

  • Group3: domain

  • Group4: domain extension

  • Group5: directory path

  • Group6: file name

  • Group7: file extension



Question: How can I get each URL part into it's own capture group across all the examples I have listed above?

Answer

You can use https://regex101.com/ to check the group numbers but (if having extra groups doesn't bother you) with

/(https?):\/\/(([\w-]+)\.)?([\w-]+)\.(com|net)((\/[\w-]+)*\/([\w-]+)+\.([a-z]+))/

you'll get

Group 1: protocol

Group 3. subdomain

Group 4. domain

Group 5. Top Level Domain (or as you say domain extension)

Group 6. /path/to/file.js

Group 8. filename

Group 9. extension


If you DO care about the numbers, you can always use "non-capturing groups (?:)

(https?):\/\/(?:([\w-]+)\.)?([\w-]+)\.(com|net)((?:\/[\w-]+)*\/([\w-]+)+\.([a-z]+))

That Way you'll indeed get

Group 1: protocol

Group 2. subdomain

Group 3. domain

Group 4. domain extension (TLD)

Group 5. /path/to/file.js

Group 6. filename

Group 7. extension