marcamillion marcamillion - 5 months ago 21
Ruby Question

If I have an array of strings (which include symbols) how can I strip out the elements whose symbols make up non-words?

I have an array that looks like this:

> uniq_words
=> ["Welcome",
"John (CPA)",

As you can see, there are some elements of this array that are bits of code and have

This is where it gets tricky though, what I would like to do is strip out the elements that are obviously non words -- so things like
should be stripped out (or any other symbols like

But the key is the context.

John (CPA)
should not be stripped, neither should
Mr. Smith
Johnson & Johnson
, etc.

So how do I cleanse
of those elements? I imagine I would likely use
and some regex, but how would all of the pieces look together?

Edit 1

Per Cary's comment, what I am essentially trying to do is search through all the text on a website for names. However, some names may include titles beside them (like
John Brown (MBA)
). So I don't want any string that is obviously not a word, and almost certainly not a name. Spaces are a must for obvious reasons.

I don't need the regex to fully match names, because I know that's almost impossible, I just don't want it to allow obviously non words (e.g.
, without excluding valid strings like
John Brown (Esq.)

I hope that clarifies it.


After your clarification, the best I came up with would be:

#⇒ [
#  [0] "Welcome",
#  [1] "Occurred",
#  [2] "John (CPA)",
#  [3] "target",
#  [4] "else",
#  [5] "The",
#  [6] "web",
#  [7] "site"
# ]

Remove the trailing question mark to search for names with titles only:

#⇒ ["John (CPA)"]

The regulars are using proper unicode character classes to match names like “Köhl” and/or “Liña.”