mfuerli mfuerli - 5 months ago 10
Bash Question

regex adding linebreaks at every dot in a text except for defined abbreviations

I am trying to find a regex for a bash shell script in Mac OS-X which replaces dots (.) into linebreaks (\n) in a big text file.
But dots used for common abbreviations like tel. etc. Mr. Ms. U.S. and some others should be excluded.

So far I am using sed for simple replacements already (but of course the ignore-part is missng):

LC_ALL=C sed -i "" -e "s/.*SEARCH.*/REPLACEMENT/" ascii.txt


example:

Mr. Brown searches his fox. My tel. nr. can be found online. U.S. is a typical abbreviation for the United States.


the result should be:

Mr. Brown searches his fox.\n
My tel. nr. can be found online.\n
U.S. is a typical abbreviation for the United States.\n

Answer

You could use GNU sed like this:

sed -r 's/\./\n/g; s/(Mr|tel|nr|U|S)\n/\1./g; s/\n */\n/g'

If your sed implementation does not support extended regular expressions, you need to say something like

sed 's/\./\n/g; s/\(Mr\|tel\|nr\|U\|S\)\n/\1./g; s/\n */\n/g'

If your sed implementation does not support that either, then you need to handle all abbreviations separately, e.g.

s/Mr\n/Mr./g; s/tel\n/tel./g;

and so on. If your sed implementation can handle that, either, then it's time to look for another operating system.