pandaman pandaman - 6 months ago 17
Ruby Question

Get sentences containing a keyword from paragraph

I need to extract sentences that contain the word

island
or
Island
from a paragraph. Each sentence begins with a capital letter and ends with a period.

Paragraph as string

"
The islands were settled from the second century AD by a series of local empires. In 1819, Sir Stamford Raffles founded modern Singapore as a trading post of the East India Company; after the company collapsed, the islands were ceded to Britain and became part of its Straits Settlements in 1826. During World War II, Singapore was occupied by Japan. It gained independence from Britain in 1963, by uniting with other former British territories to form Malaysia, but was expelled two years later over ideological differences. After early years of turbulence, and despite lacking natural resources and a hinterland, the nation developed rapidly as an Asian Tiger economy, based on external trade and its human capital.
"
(Source: https://en.wikipedia.org/wiki/Singapore)

Ideal result as elements in an array:


  • The islands were settled from the second century AD by a series of local empires.

  • In 1819, Sir Stamford Raffles founded modern Singapore as a trading post of the East India Company; after the company collapsed, the islands were ceded to Britain and became part of its Straits Settlements in 1826.



I found examples on how to do in other languages like Java (Regex to find sentence containing specific word (java) from paragraph). But the same Regex didn't work for Ruby.

Is this possible in Ruby?

Answer

I suggest using two regular expressions, one to break the string into sentences, the other to extract sentences containing the word "island" or "islands", with the first letter possibly capitalized.

str.split(/(?<=\.)\s+/).select { |s| s =~ /\b[iI]slands?\b/ }
  #=> ["The islands were settled from the second century AD by a series of local empires.",
  #    "In 1819, Sir Stamford Raffles founded modern Singapore as a trading post of
  #     the East India Company; after the company collapsed, the islands were ceded to
  #     Britain and became part of its Straits Settlements in 1826. *
  • /(?<=\.)\s+/ matches a period in a positive lookbehind followed by one or more spaces.
  • /\b[iI]slands?\b/ matches the strings "island", "Island", "islands" and "Islands", surrounded by word breaks (to avoid matching, for example, "islander").

* I've added two line breaks here to make it more readable.