Ghostkeeper Ghostkeeper - 1 month ago 8
Javascript Question

Regex character count, but some count for three

I'm trying to build a regular expression that places a limit on the input length, but not all characters count equal in this length. I'll put the rationale at the bottom of the question. As a simple example, let's limit the maximum length to 12 and allow only

a
and
b
, but
b
counts for 3 characters.

Allowed are:


  • aa
    (anything less than 12 is fine).

  • aaaaaaaaaaaa
    (exactly 12 is fine).

  • aaabaaab
    (6 + 2 * 3 = 12, which is fine).

  • abaaaaab
    (still 6 + 2 * 3 = 12).



Disallowed is:


  • aaaaaaaaaaaaa
    (13
    a
    's).

  • bbbba
    (1 + 4 * 3 = 13, which is too much).

  • baaaaaaab
    (7 + 2 * 3 = 13, which is too much).



I've made an attempt that gets fairly close:

^(a{0,3}|b){0,4}$


This matches on up to 4 clusters that may consist of 0-3
a
's or one
b
.

However, it fails to match on my last positive example:
abaaaaab
, because that forces the first cluster to be the single
a
at the beginning, consumes a second cluster for the
b
, then leaves only 2 more clusters for the rest,
aaaaab
, which is too long.

Constraints




  • Must run in JavaScript. This regex is supplied to Qt, which apparently uses JavaScript's syntax.

  • Doesn't really need to be fast. In the end it'll only be applied to strings of up to 40 characters. I hope it validates within 50ms or so, but slightly slower is acceptable.



Rationale



Why do I need to do this with a regular expression?

It's for a user interface in Qt via PyQt and QML. The user can type a name in a text field here for a profile. This profile name is url-encoded (special characters are replaced by %XX), and then saved on the user's file system. We encounter problems when the user types a lot of special characters, such as Chinese, which then encode to a very long file name. Turns out that at somewhere like 17 characters, this file name becomes too long for some file systems. The URL-encoding encodes as UTF-8, which has up to 4 bytes per character, resulting in up to 12 characters in the file name (as each of these gets percent-encoded).

16 characters is too short for profile names. Even some of our default names exceed that. We need a variable limit based on these special characters.

Qt normally allows you to specify a Validator to determine which values are acceptable in a text box. We tried implementing such a validator, but that resulted in a segfault upstream, due to a bug in PyQt. It can't seem to handle custom Validator implementations at the moment. However, PyQt also exposes three built-in validators. Two apply only to numbers. The third is a regex validator that allows you to put a regular expression that matches all valid strings. Hence the need for this regular expression.

Answer

There is no real straightforward way to do this, given the limitations of regexp. You're going to have to test for all combinations, such as thirteen b with up to one a, twelve b with up to four a, and so on. We will build a little program to generate these for us. The basic format for testing for up to four a will be

/^(?=([^a]*a){0,4}[^a]*$)/

We'll write a little routine to create these lookaheads for us, given some letter and a minimum and maximum number of occurrences:

function matchLetter(c, m, n) {
  return `(?=([^${c}]*${c}){${m},${n}}[^${c}]*$)`;
}

> matchLetter('a', 0, 4)
< "(?=([^a]*a){0,4}[^a]*$)"

We can combine these to test for three b with up to three a:

/^(?=([^b]*b){3}[^b]*$)(?=([^a]*a){0,3}[^a]*$)/

We will write a function to create such combined lookaheads which matches exactly m occurrences of c1 and up to n occurrences of c2:

function matchTwoLetters(c1, m, c2, n) {
  return matchLetter(c1, m, m) + matchLetter(c2, 0, n);
}

We can use this to match exactly twelve b and up to four a, for a total of forty or less:

> matchTwoLetters('b', 12, 'a', 1, 4)
< "(?=([^b]*b){12,12}[^b]*$)(?=([^a]*a){1,4}[^a]*$)"

It remains to simply create versions of this for each count of b, and glom them together (for the case of a max count of 12):

function makeRegExp() {
  const res = [];
  for (let bs = 0; bs <= 4; bs++)
    res.push(matchTwoLetters('b', bs, 'a', 12 - bs*3));
  return new RegExp(`^(${res.join('|')})`);
}

> makeRegExp()
< "^((?=([^b]*b){0,0}[^b]*$)(?=([^a]*a){0,12}[^a]*$)|(?=([^b]*b){1,1}[^b]*$)(?=([^a]*a){0,9}[^a]*$)|(?=([^b]*b){2,2}[^b]*$)(?=([^a]*a){0,6}[^a]*$)|(?=([^b]*b){3,3}[^b]*$)(?=([^a]*a){0,3}[^a]*$)|(?=([^b]*b){4,4}[^b]*$)(?=([^a]*a){0,0}[^a]*$))"

Now you can do the test with

makeRegExp().test("baabaaa");

For the case of length=40, the regxp is 679 characters long. A very rough benchmark shows that it executes in under a microsecond.