marcamillion marcamillion - 1 month ago 6
Ruby Question

If I have an array of strings (which include symbols) how can I strip out the elements whose symbols make up non-words?

I have an array that looks like this:

> uniq_words
=> ["Welcome",
"Occurred",
"John (CPA)",
"{",
"if(",
")",
"//",
"target",
"=",
"}",
"else",
"target.style.display",
"The",
"web",
"site"]


As you can see, there are some elements of this array that are bits of code and have
{
and
(
.

This is where it gets tricky though, what I would like to do is strip out the elements that are obviously non words -- so things like
=
,
}
,
if(
and
)
should be stripped out (or any other symbols like
*&^%$
etc.).

But the key is the context.

John (CPA)
should not be stripped, neither should
Mr. Smith
or
Johnson & Johnson
, etc.

So how do I cleanse
uniq_words
of those elements? I imagine I would likely use
.select
and some regex, but how would all of the pieces look together?

Edit 1

Per Cary's comment, what I am essentially trying to do is search through all the text on a website for names. However, some names may include titles beside them (like
John Brown (MBA)
). So I don't want any string that is obviously not a word, and almost certainly not a name. Spaces are a must for obvious reasons.

I don't need the regex to fully match names, because I know that's almost impossible, I just don't want it to allow obviously non words (e.g.
//
or
=
or
(
, without excluding valid strings like
John Brown (Esq.)
).

I hope that clarifies it.

Answer

After your clarification, the best I came up with would be:

input.grep(/\A[\p{Alnum}\s]+(\([\p{Alnum}\s]+\))?\z/)
#⇒ [
#  [0] "Welcome",
#  [1] "Occurred",
#  [2] "John (CPA)",
#  [3] "target",
#  [4] "else",
#  [5] "The",
#  [6] "web",
#  [7] "site"
# ]

Remove the trailing question mark to search for names with titles only:

input.grep(/\A[\p{Alnum}\s]+(\([\p{Alnum}\s]+\))\z/)
#⇒ ["John (CPA)"]

The regulars are using proper unicode character classes to match names like “Köhl” and/or “Liña.”