tumbleweed tumbleweed - 1 month ago 7
Python Question

Which is the most efficent of matching and replacing with an identifier every three new lines?

I am working with some .txt files that doesn't have structure (they are messy), they represent a number of pages. In order to give them some structure I would like to identify the number of pages since the file itself doesn't have them. This can be done by replacing every three newlines with some annotation like:

\n
page: N
\n


Where
N
is the number. This is how my files look like, and I also tried with a simple
replace
. However, this function confuses and does not give me the expected format which would be something like this. Any idea of how to replace the spaces with some kind of identifier, just to try to parse them and getting the position of some information (page)?.

I also tried this:

import re

replaced = re.sub('\b(\s+\t+)\b', '\n\n\n', text)
print (replaced)

Answer

If the format is as regular as you state in your problem description:

Replace every occurrence of three newlines \n with page: N

You wouldn't have to use the re module. Something as simple as the following would do the trick:

>>> s='aaaaaaaaaaaaaaaaa\n\n\nbbbbbbbbbbbbbbbbbbbbbbb\n\n\nccccccccccccccccccccccc'
>>> pages = s.split('\n\n\n')
>>> ''.join(page + '\n\tpage: {}\n'.format(i + 1) for i, page in enumerate(pages))
'aaaaaaaaaaaaaaaaa\n\tpage: 1\nbbbbbbbbbbbbbbbbbbbbbbb\n\tpage: 2\nccccccccccccccccccccccc\n\tpage: 3\n'

I suspect, though, that your format is less regular than that, but you'll have to include more details before I can give a good answer for that.

If you want to split with messy whitespace (which I'll define as at least three newlines with any other whitespace mixed in), you can replace s.split('\n\n\n') with:

re.split(r'(?:\n\s*?){3,}', s)