gangi gangi - 10 months ago 55
Python Question

Regex to find continuous characters in the word and remove the word

I want to find whether a particular character is occurring continuously in the a word of the string or find if the word contains only numbers and remove those as well. For example,

All aaaaaab the best 8965
US issssss is 123 good
qqqq qwerty 1 poiks
lkjh ggggqwe 1234 aqwe iphone5224s

I want to check for two conditions, where in the first condition check for repeating characters more than 3 times and also check if a word contains only numbers. I want to remove only when the word contains only numbers and when a character occurs more than 3 times continuously in the word.

the following should be the output,

All the best
US is good
qwerty poiks
lkjh aqwe iphone5224s

The following are my trying,

re.sub('r'\w[0-9]\w*', df[i])
for number. but this is not removing single character numbers. Also for the repeated characters, I tried,
re.sub('r'\w[a-z A-Z]+[a-z A-Z]+[a-z A-Z]+[a-z A-Z]\w*', df[i])
but this is removing every word here. instead of repeated letter.

Can anybody help me in solving these problems?


I would suggest


See the regex demo

Only alphanumeric words are matched with this pattern:

  • \s* - zero or more whitespaces
  • \b - word boundary
  • (?=[a-zA-Z\d]*([a-zA-Z\d])\1{3}|\d+\b) - there must be at least 4 repeated consecutive letters or digits in the word OR the whole word must consist of only digits
  • [a-zA-Z\d]+ - a word with 1+ letters or digits.

Python demo:

import re
p = re.compile(r'\s*\b(?=[a-z\d]*([a-z\d])\1{3}|\d+\b)[a-z\d]+', re.IGNORECASE)
s = "df\nAll aaaaaab the best 8965\nUS issssss is 123 good \nqqqq qwerty 1 poiks\nlkjh ggggqwe 1234 aqwe iphone5224s"
strs = s.split("\n")                   # Split to test lines individually
print([p.sub("", x).strip() for x in strs])
# => ['df', 'All the best', 'US is good', 'qwerty poiks', 'lkjh aqwe iphone5224s']

Note that strip() will remove remaining whitespaces at the start of the string.

A similar solution in R with a TRE regex:

x <- c("df", "All aaaaaab the best 8965", "US issssss is 123 good ", "qqqq qwerty 1 poiks", "lkjh ggggqwe 1234 aqwe iphone5224s")
p <- " *\\b(?:[[:alnum:]]*([[:alnum:]])\\1{3}[[:alnum:]]*|[0-9]+)\\b"
gsub(p, "", x)

See a demo

Pattern details and demo:

  • \s* - 0+ whitespaces
  • \b - a leading word boundary
  • (?:[[:alnum:]]*([[:alnum:]])\1{3}[[:alnum:]]*|[0-9]+) - either of the 2 alternatives:
    • [[:alnum:]]*([[:alnum:]])\1{3}[[:alnum:]]* - 0+ alphanumerics followed with the same 4 alphanumeric chars, followed with 0+ alphanumerics
    • | - or
    • [0-9]+ - 1 or more digits
  • \b - a trailing word boundary


To also add an option to remove 1-letter words you may use

  1. R (add [[:alpha:]]| to the alternation group): \s*\b(?:[[:alpha:]]|[[:alnum:]]*([[:alnum:]])\1{3}[[:alnum:]]*|[0-9]+)\b (see demo)
  2. Python lookaround based regex (add [a-zA-Z]\b| to the lookahead group): *\b(?=[a-zA-Z]\b|\d+\b|[a-zA-Z\d]*([a-zA-Z\d])\1{3})[a-zA-Z\d]+