randy newfield randy newfield - 1 year ago 74
Ruby Question

figuring out if an apostrophe is a quote or contraction

I am looking for a way to go through a sentence to see if an apostrophe is a quote or a contraction so I can remove punctuation from the string, and then normalize all words.

My test sentence is:

don't frazzel the horses. 'she said wow'.

In my attempts I have split the sentence into words parts tokonizing on words and non words like so:

contractionEndings = ["d", "l", "ll", "m", "re", "s", "t", "ve"]

sentence = "don't frazzel the horses. 'she said wow'.".split(/(\w+)|(\W+)/i).reject! { |word| word.empty? }

This returns
["don", "'", "t", " ", "frazzel", " ", "the", " ", "horses", ". '", "she", " ", "said", " ", "wow", "'."]

Next I want to be able to iterate sentence looking for apostrophes
and when one is found, compare the next element to see if it is included in the
array. If it is included I want to join the prefix, the apostrophe
, and the suffix into one index, else remove the apostrophes.

In this example,
, and
would be joined into
as a single index, but
. '
would be removed.

Afterwards I can run a regex to remove other punctuation from the sentence so that I can pass it into my stemmer to normalize the input.

The final output I am after is
don't frazzel the horses she said wow
in which all punctuation will be removed besides apostrophes for contractions.

If anyone has any suggestions to make this work or have a better idea on how to solve this problem I would like to know.

Overall I want to remove all punctuation from the sentence except for contractions.



How about this?

irb:0> s = "don't frazzel the horses. 'she said wow'."
irb:0> contractionEndings = ["d", "l", "ll", "m", "re", "s", "t", "ve"]
irb:0> s.scan(/\w+(?:'(?:#{contractionEndings.join('|')}))?/)
=> ["don't", "frazzel", "the", "horses", "she", "said", "wow"]

The regex scans for some "word" characters, and then optionally (with the ?) an apostrophe-plus-contraction ending. You can subsitute in Ruby expressions just like double-quote strings do, so we can get our contractions in, joining them with the regex alternation operator |. The last thing is to mark the groups (sections in parentheses) as non-capturing with ?: so that scan doesn't return a bunch of nils, just the whole match per-iteration.