user89239213892389 user89239213892389 - 6 months ago 19
Python Question

Finding the repeating sub string in a big string

Take for example the following strings

0.714285714285714285714285714285714285714285
0.111111111111111111111111111111111111111111
0.166666666666666666666666666666666666666666


I want to find the sub string that is repeating repetition for each.

714285
1
6


How can I do this in python. Using regex is okay, I tried the following:

import re

testString = "0.714285714285714285714285714285714285714285"
print(re.search(r"(.+)\1", testString).group(1))


This gives me the (wrong) output:

714285714285714285


It should be
7814285


How do I fix this? Is there way to improve my regex or is regex the wrong tool for this job? Maybe python has an awesome built in for this? Is there anyway to do use this with or without regex?

EDIT Before posting an answer check with the test case

0.0022271714922048997772828507795100222717149220489977728285077951002227171492204899777282850779510022


It should return
00222717149220489977728285077951

Answer

You can give a try to this pattern:

(?=(\d+)\1+(.*))(\d+?)\3+\2$

demo

or to obtain the substring as whole match (group 0):

(?=(\d+)\1+(.*))(\d+?)(?=\3+\2$)

What does exactly the pattern?

It returns, for a position in the string, the smallest repeated substring that spans the larger part of the string.

How does it work?

In a lookahead is described the largest repeated substring with a greedy quantifier (i.e. (\d+)), followed by its repetitions \1+, followed by the end of the string captured in group 2.

Then, once the lookahead closed, (\d+?)\3+ searches this time the smallest repeated substring with a non-greedy quantifier but with a condition: after the repetitions, the end of the string must be the same than the one captured in the lookahead.

This ensures that the substring in group 3 can't be sliced into a smaller repeated substring.

Results

The searched substring is in the group 3.

If you use the pattern as it (i.e. non-anchored), the first repeated substring on the left is returned.

Obviously, if you only want a result that starts after the dot you need to anchor the pattern with it:

\.(?=(\d+)\1+(.*))(\d+?)\3+\2$ # immediately after the dot

or

\..*?(?=(\d+)\1+(.*))(\d+?)\3+\2$ # the first after the dot

If you want to research repeated substrings for each positions in the string (for example to find the largest whatever the starting position), you need to enclose all the second part in a lookahead too and to use re.findall:

(?=(\d+)\1+(.*))(?=(\d+?)\3+\2$)

(Then feel free to sort the result list, if you want to obtain the largest string whatever the starting position)