Andreas S. Andreas S. - 3 months ago 23
C++ Question

Using RegEx to split up a text into single words in Embarcadero's C++ Builder

I am working on a spellchecker application with Embarcadero's C++ Builder. I split up a text into single words using a regular expression. The Code below worked fine with RAD Studio XE but does not behave in the same way with RAD Studio Seattle.

The problem appears when words contain non-latin characters like German Umlauts (Ä,Ö,Ü) or characters with accents (é,ê,à).
"\w" is interpreted as [a-zA-Z_0-9] ignoring non-latin characters.

First, what is a word in my context?
Possible words consist of:

  • "\r\n"

  • "word-word-word-word ..."

  • "word." or "word-"

  • words with apostrophs: " 'word" "wor'd" "word' "

  • "word"

  • there are two different types of apostrophes: ' and ’

Here's the Code:

String text (L"Österreich l'année);
const String sRegex (L"\r\n|(\\w+\\-)+\\w+|\\w+(\\.|\\-)|('|’)?\\w+('|’)?\\w*");
TRegEx regex(sRegex, TRegExOptions());
TMatchCollection regexMatches = regex.Matches(text);
for (int i=0; i<regexMatches.Count; ++i)
TMatch regexMatch = regexMatches.Item[i];
String word (regexMatch.Value);

//do stuff with word

The desired values for the String word are "Österreich" and "l'année". However, what the RegEx matches is "sterreich", "l'ann" and "e".

My question is, how to specify all non-latin characters?


\p{L} matches a unicode letter. Try using that instead of \w.

See it here at regex101.

If you want digits as well (as with \w) add \d to the group.