Россарх Россарх - 1 month ago 31
Python Question

Python: regexp sentence segmentation

Having a simple tokenizer, which works well for the test files i need to show it on, in the following code:

import re, sys
for line in sys.stdin:
for token in re.findall("(\w+\.\w+\.[\w.]*|\w+[-.]\w+|[-]+|'s|[,;:.!?\"%']|\w+)", line.strip()):
print(token)


Text like This house is small. That house is big. turns correctly to:

This
house
is
small
.
That
house
is
big
.


However, i also need to insert a blank line between sentences:

···
small
.

That
···


So i’ve written another loop

for token in re.sub("([\"\.!?])\s([\"`]+|[A-Z]+\w*)", "\\1\n\n\\2", line):


with a
regexp
which catches almost all sentence breaks in the test texts that i need to use, but i’m having trouble in actually connecting it to the code. Putting it inside the first
for loop
, which feels most logical to me, breaks the output completely. Also tried with some
if clauses
, but that doesn’t work either.

Answer

Here's a simpler approach, that works for the example that you gave. If the more complex regex is needed it can be added back in:

import re
mystr = "This house is small. That house is big."
for token in re.findall(r"([\w]+|[^\s])", mystr):
    print (token)
    if re.match(r"[.!?]", token):
        print()

I'm not quite clear how you expect to handle punctuation within sentences, and which punctuation terminates a sentence, so it would likely have to be modified a little.

Comments