Casey B. Casey B. - 1 month ago 14
Java Question

Regex won't match whitespace character with [\r\n\t\f\s]

This is likely a very simple fix but I can't figure it out!

I'm trying to match (up to) 3 capitalized words in a row given the following text.

Russell Lake West
. The match should include all 3 words.

This regex will match the first 2 words but not the third (demo here):

(([A-Z][a-z]+)\s{0,2}([A-Z][a-z]+)?\s{0,2}([A-Z][a-z]+)?)


This regex will match all 3 words, but I had to copy/paste the whitespace between
Lake
and
West
for it to work (demo here):

(([A-Z][a-z'-]+)\s{0,2}([A-Z][a-z'-]+)? \s{0,2}([A-Z][a-z'-]+)?)


^ pasted it here


So I assumed that maybe the whitespace isn't being treated as whitespace, but perhaps a newline character or similar, so I tried this (demo here):

[\r\n\t\f\s]West


But it doesn't recognize any of those characters before
West
, thus returning no results.

Why can't regex101 or Java recognize this apparent whitespace between
Lake
and
West
? What's a reliable way to handle this?

Answer

There are many kinds of spaces. The one you are using in your demo is non-breaking one (indexed as 160 in Unicode table) which doesn't belong to \s (whitespaces character class) as it doesn't represent place on which we can expect text to be split into separate parts like lines.

To match it you can use \p{Zs} class.
You can also combine both \s and \p{Zs} classes with [\\p{Zs}\\s].