basil basil - 2 months ago 9
Perl Question

using the command line and regex to determine words that start sentences

I have the text:

This is a test. This is only a test! If there were an emergency, then Information would be provided for you.


I want to be able to determine which words start sentences. What I have now is:

$ cat <FILE> | perl -pe 's/[\s.?!]/\n/g;'


This just gets rid of punctuation and replaces them with newlines, giving me:

This
is
a
test

This
is
only
a
test

If
there
were
an
emergency,
then
Information
would
be
provided
for
you


From here I could somehow extract the words that have either nothing above them (start of file) or a blank space, but I am unsure of exactly how to do this.

Answer

You can use this gnu grep command to extract first after each period or ! or ?:

grep -oP '(?:^|[.?!])\s*\K[A-Z][a-z]*' file

This
This
If

Though I must caution you may get false results for cases like Mr. Smith.

Regex Breakup:

  • (?:^|[.?!]) - match start or DOT or ! or ?
  • \s* - match 0 or more whitespaces
  • \K - match reset to forget matched data
  • [A-Z][a-z]* - match a word startign with upper case letter
Comments