Lewis Collins Lewis Collins - 9 days ago 5
Python Question

Python 3 regex remove characters before certain point

I have multiple words stored in a list like this:

31547 4.7072% i
25109 3.7466% u
20275 3.0253% you
10992 1.6401% me
9490 1.4160% do
7681 1.1461% like
6293 0.9390% want
6225 0.9288% my
5459 0.8145% have
5141 0.7671% your


now i need to cleanse this so that it removes everything before the (i) taking into account that the word will not always be (i) but the format of everything before will be similar. I have seen other questions that are similar but they needed the word/str to be same every time to work.

Thanks in advance for all help and advice, I have tried reading up and doing tutorials on Regex but i do find it quite complex to get your head around.

for a similar problem i had i needed to remove everything inside of brackets for which i used:

Cleanse = re.sub('<.*?>', '', line)


but I'm unsure as how to manipulate this to remove everything before the word as I will stress this is my first real time of coming across using regex.

Answer

To process a multiline string, you may use

s = re.sub(r'^\d+[ \t]+\d+\.\d+%[ \t]*', '', s, flags=re.M)

If you process line by line, use

r = re.compile(r'^\d+\s+\d+\.\d+%\s*')
...
s = r.sub('', s)

See the regex demo

Pattern explanation:

  • ^ - start of a string (or line if re.M flag is passed)
  • \d+ - 1 or more digits
  • \s+ - 1 or more whitespaces
  • \d+\.\d+ - 1+ digits, ., 1+ digits
  • % - a literal % symbol
  • \s* - 0+ whitespaces

Note that in a "multiline" version, the [ \t] is preferable in order to only match horizontal ASCII whitespace. It can also be done with a more sophisticated [^\S\r\n] pattern that is Unicode aware by default in Python 3.x.

Comments