Chris Cirefice Chris Cirefice - 1 month ago 16x
Javascript Question

Concrete Javascript Regex for Accented Characters (Diacritics)

I've looked on Stack Overflow (replacing characters.. eh, how JavaScript doesn't follow the Unicode standard concerning RegExp, etc.) and haven't really found a concrete answer to the question:

How can JavaScript match for accented characters (those with diacritical marks)?

I'm forcing a field in a UI to match the format:
last_name, first_name
(last [comma space] first), and I want to provide support for diacritics, but evidently in JavaScript it's a bit more difficult than other languages/platforms.

This was my original version, until I wanted to add diacritic support:


Currently I'm debating one of three methods to add support, all of which I have tested and work (at least to some extent, I don't really know what the "extent" is of the second approach). Here they are:

Explicitly listing all accented characters that I would want to accept as valid (lame and overly-complicated):

var accentedCharacters = "àèìòùÀÈÌÒÙáéíóúýÁÉÍÓÚÝâêîôûÂÊÎÔÛãñõÃÑÕäëïöüÿÄËÏÖÜŸçÇßØøÅåÆæœ";
// Build the full regex
var regex = "^[a-zA-Z" + accentedCharacters + "]+,\\s[a-zA-Z" + accentedCharacters + "]+$";
// Create a RegExp from the string version
regexCompiled = new RegExp(regex);
// regexCompiled = /^[a-zA-ZàèìòùÀÈÌÒÙáéíóúýÁÉÍÓÚÝâêîôûÂÊÎÔÛãñõÃÑÕäëïöüÿÄËÏÖÜŸçÇßØøÅåÆæœ]+,\s[a-zA-ZàèìòùÀÈÌÒÙáéíóúýÁÉÍÓÚÝâêîôûÂÊÎÔÛãñõÃÑÕäëïöüÿÄËÏÖÜŸçÇßØøÅåÆæœ]+$/

  • This correctly matches a last/first name with any of the supported accented characters in

My other approach was to use the
character class, to have a simpler expression:

var regex = /^.+,\s.+$/;

  • This would match for just about anything, at least in the form of:
    something, something
    . That's alright I suppose...

The last approach, which I just found might be simpler...


  • It matches a range of unicode characters - tested and working, though I didn't try anything crazy, just the normal stuff I see in our language department for faculty member names.

Here are my concerns:

  1. The first solution is far too limiting, and sloppy and convoluted at that. It would need to be changed if I forgot a character or two, and that's just not very practical.

  2. The second solution is better, concise, but it probably matches far more than it actually should. I couldn't find any real documentation on exactly what
    matches, just the generalization of "any character except the newline character" (from a table on the MDN).

  3. The third solution seems the be the most precise, but are there any gotchas? I'm not very familiar with Unicode, at least in practice, but looking at a code table/continuation of that table,
    seems to be pretty solid, at least for my expected input.

    • Faculty won't be submitting forms with their names in their native language (e.g., Arabic, Chinese, Japanese, etc.) so I don't have to worry about out-of-Latin-character-set characters

So the real question(s): Which of these three approaches is most suited for the task? Or are there better solutions?


Which of these three approaches is most suited for the task?

Depends on the task :-) To match exactly all latin characters and there accented versions, the unicode ranges probably provide the best solution. They might be extended to all non-whitespace characters, which could be done using the \S character class.

I'm forcing a field in a UI to match the format: last_name, first_name (last [comma space] first)

The most basic problem I'm seeing here are not diacritics, but whitespaces. There are a few names that consist of multiple words, e.g. for titles. So you should go with the most generic, that is allowing everything but the comma that distinguishes first from last name:


But your second solution with the . character class is just as fine, you only might need to care about multiple commata then.