Andreas S. Andreas S. - 3 months ago 23
C++ Question

Using RegEx to split up a text into single words in Embarcadero's C++ Builder

I am working on a spellchecker application with Embarcadero's C++ Builder. I split up a text into single words using a regular expression. The Code below worked fine with RAD Studio XE but does not behave in the same way with RAD Studio Seattle.

The problem appears when words contain non-latin characters like German Umlauts (Ä,Ö,Ü) or characters with accents (é,ê,à).
"\w" is interpreted as [a-zA-Z_0-9] ignoring non-latin characters.

First, what is a word in my context?
Possible words consist of:


  • "\r\n"

  • "word-word-word-word ..."

  • "word." or "word-"

  • words with apostrophs: " 'word" "wor'd" "word' "

  • "word"

  • there are two different types of apostrophes: ' and ’



Here's the Code:

String text (L"Österreich l'année);
const String sRegex (L"\r\n|(\\w+\\-)+\\w+|\\w+(\\.|\\-)|('|’)?\\w+('|’)?\\w*");
TRegEx regex(sRegex, TRegExOptions());
TMatchCollection regexMatches = regex.Matches(text);
for (int i=0; i<regexMatches.Count; ++i)
{
TMatch regexMatch = regexMatches.Item[i];
String word (regexMatch.Value);

//do stuff with word
}


The desired values for the String word are "Österreich" and "l'année". However, what the RegEx matches is "sterreich", "l'ann" and "e".

My question is, how to specify all non-latin characters?

Answer

\p{L} matches a unicode letter. Try using that instead of \w.

See it here at regex101.

If you want digits as well (as with \w) add \d to the group.

Comments