user1021605 user1021605 - 1 month ago 8
ASP.NET (C#) Question

Translating RegEx from PHP to Asp.Net RegularExpressionValidator

I currently have this RegularExpressionValidator running:

<asp:RegularExpressionValidator ID="rev_Nachname" runat="server" ControlToValidate="edtNachname"
Display="None" ErrorMessage="$InvalidBeginOfStringNonTechnik$Nachname$2" ValidationExpression="^[a-zA-ZÆÄÜÖáâãäåæçèéêëìíîïñòóôõöøùúûüß0-9'-]{2}.*"></asp:RegularExpressionValidator>


I do have the requirement now to change it to Unicode - Latin and created following RegEx for PHP:

^[\p{Latin}+\p{M}*+0-9'-]{2,}


After changing regex and deploying the site in question - the application will just run into a timeout if I open the site where I changed the Regex - If I undo my changes everythings fine.

Since I do not recieve any errors I am kinda in the dark - but thinking that ASP cannot handle my RegEx.
Is there anything obvious why it isn't working?

Thanks in Advance!

Answer

You seem to want to allow all Unicode categories that have Latin in their names. Here are their ranges:

| Code point range  |        Block name            |
|--------------------------------------------------|
|   0000 - 007F     |    IsBasicLatin              |
|   0080 - 00FF     |    IsLatin-1Supplement       |
|   0100 - 017F     |    IsLatinExtended-A         | 
|   0180 - 024F     |    IsLatinExtended-B         |
|   1E00 - 1EFF     |    IsLatinExtendedAdditional |
|--------------------------------------------------|

So, you can create a custom special class from them and add '0-9- to it to get your extended version of your previous regex: [\u0000-\u007F\u0080-\u00FF\u0100-\u017F\u0180-024F\u1E00-\u1EFF'0-9-].

However, your current regex only matches the start of string ^, exactly 2 chars from your custom character class (the [...]{2} part), and then any 0+ chars other than linebreak symbols (.*). The extended version will look like

^[\u0000-\u007F\u0080-\u00FF\u0100-\u017F\u0180-024F\u1E00-\u1EFF'0-9-]{2}.*    

If you need to allow two or more symbols from your custom character class, use

^[\u0000-\u007F\u0080-\u00FF\u0100-\u017F\u0180-024F\u1E00-\u1EFF'0-9-]{2,}$

UPDATE:

So, it turns out you need to support diacritics from outside the BMP plane, and specific Unicode code point ranges excluding some of them.

^(?:(?:(?:(?![\u0009-\u002F\u003A-\u0040])[a-zA-Z\u006E-\u0302\u006D-\u0302\u004A-\u030C'0-9-])|[\uD800-\uDBFF][\uDC00-\uDFFF]|[\uD800-\uDBFF](?![\uDC00-\uDFFF])|(?:[^\uD800-\uDBFF]|^)[\uDC00-\uDFFF])){2,}

And here is a regex demo

The main part of the pattern is (?:(?![\u0009-\u002F\u003A-\u0040])[a-zA-Z\u006E-\u0302\u006D-\u0302\u004A-\u030C'0-9-]), and the rest is for matching diacritics.

Comments