Having a simple tokenizer, which works well for the test files i need to show it on, in the following code:
import re, sys
for line in sys.stdin:
for token in re.findall("(\w+\.\w+\.[\w.]*|\w+[-.]\w+|[-]+|'s|[,;:.!?\"%']|\w+)", line.strip()):
for token in re.sub("([\"\.!?])\s([\"`]+|[A-Z]+\w*)", "\\1\n\n\\2", line):
Here's a simpler approach, that works for the example that you gave. If the more complex regex is needed it can be added back in:
import re mystr = "This house is small. That house is big." for token in re.findall(r"([\w]+|[^\s])", mystr): print (token) if re.match(r"[.!?]", token): print()
I'm not quite clear how you expect to handle punctuation within sentences, and which punctuation terminates a sentence, so it would likely have to be modified a little.