antithesis antithesis - 11 days ago 8
Javascript Question

Javascript - regex to remove special characters but also keep greek characters

I am trying to remove special characters from a piece of text, but using the following regular expression

var desired = stringToReplace.replace(/[^\w\s]/gi, '')


(found here:
javascript regexp remove all special characters)

has the negative effect that deletes greek characters and this is something I don't want.

Can someone also explain me how to use character ranges in regular expressions? Is there a character map which can help me define the range I want?

Answer: See my 2nd comment under Joeytje50's answer.

Answer

The way these ranges are defined is based on their character code. So, since A has char code 65, and z has char code 122, the following regex:

[A-z]

would match every letter, but also every character with char codes that fall between those char codes, namely those with codes 91 through 95, which would be the characters [\]^_. (demo).

Now, for Greek letters, the character codes for the uppercase characters are 913-937 for alpha through omega, and the lowercase characters are 945-969 for alpha through omega (this includes both lowercase variants of sigma, namely ς (962) and σ (963)).

So, to match every character except for latin letters, greek letters, and arabic numerals, you need the following regex:

[a-zA-Z0-9α-ωΑ-Ω]

So, for greek characters, it works just like latin letters.


Edit: I've tested this via a Google Translate'd Lipsum, and it looks like this doesn't take accented letters into account. I've checked what the character codes for these accented letters were, and it turns out they are placed right before the lowercase letters, or right after the uppercase letters. So, the following regex works for all greek letters, including accented ones:

[a-zA-Z0-9ά-ωΑ-ώ]

Demo

This expanded range now also includes άέήίΰ (char codes 940 through 944) and ϊϋόύώ (codes 970 through 974).

To also include whitespace (spaces, tabs, newlines), simply include a \s in the range:

[a-zA-Z0-9ά-ωΑ-ώ\s]

Demo.


Edit: Apparently there are more Greek letters that needed to be included in this range, namely those in the range [Ά-Ϋ], which is the range of letters right before the ά, so the new regex would look like this:

[a-zA-Z0-9Ά-ωΑ-ώ\s]

Demo.

Comments