Jasmin Shah Jasmin Shah - 9 months ago 48
Python Question

What is regex for website domain to use in tokenizing while keeping punctuation apart from words?

This is normal output:
enter image description here

What I want is to keep domain names as single tokens. For ex: "https://www.twitter.com" should remain as a single token.

My code:

import nltk
from nltk.tokenize.regexp import RegexpTokenizer

line="My website: http://www.cartoon.com is not accessible."
pattern = r'^(((([A-Za-z0-9]+){1,63}\.)|(([A-Za-z0-9]+(\-)+[A-Za-z0-9]+){1,63}\.))+){1,255}$'

print (tokeniser.tokenize(line))



What am I doing wrong? Any better regex for domain names?

Edit: The special character must remain as a separate token, like from above example, tokenization must separate ('website' , ':').


You may use


See the regex demo


  • \b - leading word boundary (there must be a non-word char before...)
  • (?:http|ftp)s?:// - a protocol, http/https, ftp/ftps
  • \S* - 0+ non-whitespace symbols
  • \w - a word char (=letter/digit/_)
  • | - or
  • \w+ - 1 or more word chars
  • | - or
  • [^\w\s]+ - 1 or more non-word chars excluding whitespaces.