moerker moerker - 1 month ago 4x
CoffeeScript Question

Atom Language Definition using nested Regex patterns

I am actually trying to define a grammar in Atom (which goes surprisingly well) and, after 3 days of fiddling with Regex, get the feeling to slowly going nuts.

The problem is that I now leave the field of "simple" definitions, so I also need a far better knowledge on regular expressions than I have now.

I want to match 4 specific patterns using

Through Textmate tutorials I learned that the behaviour should be somwhat like:

begin: \w
end: \d

Using this knowledge, I want to match these four expressions:

  1. foo( a(1) )
    : resolves to a scope which is nested "in itself" (the same way as described for
    -Strings in the TextMate Language Example.

  2. bar(1)('a')
    : resolves to a scope
    which is accessed by the field
    and therefore field
    has this scope only under the condition that at least a second parenthesis block is present.

  3. foo( bar(1)('a') )
    : A mixture of (1) and (2).
    is extracted (1),
    represents the same thing as described in (2).

  4. foo( bar(1)('a')('a') )('a')
    : The most complex one.
    represents an element which can be extracted by using the second parenthesis,
    represents something which can be extracted by the same mechanism and yield a value which may access
    without further problems at runtime.

To catch all of those statements I now have two regular expressions (CSON-syntax follows):

'comment': 'tries to catch foo(a)(a)(a) constructs'
'begin': '(?:' +
'(?:(?<=\\))\\s*)' + # closing parenthesis beforehand
'|(?:[\\w%\\$\\?!#]*)' + # character beforehand
')' +
'\\s*' +
'(\\()' + # opening bracket
'[^;]+?' +
'(\\))' +
'end': '(\\))+?'

'name': 'punctuation.parens.begin.someLang'
'name': 'punctuation.parens.someLang'
'name': 'punctuation.parens.begin.someLang'
'name': 'punctuation.parens.end.someLang'

So, to catch the surronding parenthesis, I use this:

'comment': 'describes a (nested) accessor using parenthesis'
'begin': '(?:[a-zA-Z_%\\$\\?!#][\\w%\\$\\?!#]*)' + # character beforehand
'end': '(?>(\\)))'

'name': 'punctuation.section.parens.begin.someLang'
'name': 'punctuation.section.parens.end.someLang'
'name': 'banana.invalid.illegal.someLang'

{ 'include': '#strange_accessors'}


I was fiddling my way through greedy, reluctant and posessive behaviour, as well as atomic groups, because I think this will be the key to a good match.

But I am fully confust and don't really know how to solve this strange nesting problem. If somebody is interested and wants to try why I need this:

It's a grammar for Scilab.


This regex below uses recursion:


It will match

  foo( a(1) )
  foo( bar(1)('a') )
  foo( bar(1)('a')('a') )('a')

You can test it here on regex101
(Scilab uses the PCRE regex engine from what I read on a forum)

Notice that it contains the positive lookbehind (?<=\s) to assure there's a whitespace before the leading word.
Because I doubted that you want to match something like \b( a(1) )

This regex will also match them, but without recursion. Just using non-capture groups: