moerker moerker - 2 months ago 18
CoffeeScript Question

Atom Language Definition using nested Regex patterns

I am actually trying to define a grammar in Atom (which goes surprisingly well) and, after 3 days of fiddling with Regex, get the feeling to slowly going nuts.

The problem is that I now leave the field of "simple" definitions, so I also need a far better knowledge on regular expressions than I have now.

Question:
I want to match 4 specific patterns using

begin
and
end
.
Through Textmate tutorials I learned that the behaviour should be somwhat like:

begin: \w
,
end: \d
becomes
\w(.*)\d


Using this knowledge, I want to match these four expressions:


  1. foo( a(1) )
    : resolves to a scope which is nested "in itself" (the same way as described for
    qq
    -Strings in the TextMate Language Example.

  2. bar(1)('a')
    : resolves to a scope
    bar
    which is accessed by the field
    (1)
    and therefore field
    ('a')
    .
    bar
    has this scope only under the condition that at least a second parenthesis block is present.

  3. foo( bar(1)('a') )
    : A mixture of (1) and (2).
    foo
    is extracted (1),
    bar
    represents the same thing as described in (2).

  4. foo( bar(1)('a')('a') )('a')
    : The most complex one.
    foo
    represents an element which can be extracted by using the second parenthesis,
    bar
    represents something which can be extracted by the same mechanism and yield a value which may access
    foo
    without further problems at runtime.



To catch all of those statements I now have two regular expressions (CSON-syntax follows):

'strange_accessors':
{
'comment': 'tries to catch foo(a)(a)(a) constructs'
'begin': '(?:' +
'(?:(?<=\\))\\s*)' + # closing parenthesis beforehand
'|(?:[\\w%\\$\\?!#]*)' + # character beforehand
')' +
'\\s*' +
'(\\()' + # opening bracket
'[^;]+?' +
'(\\))' +
'\\s*(\\()'
'end': '(\\))+?'

'beginCaptures':
'1':
'name': 'punctuation.parens.begin.someLang'
'2':
'name': 'punctuation.parens.someLang'
'3':
'name': 'punctuation.parens.begin.someLang'
'endCaptures':
'0':
'name': 'punctuation.parens.end.someLang'
}


So, to catch the surronding parenthesis, I use this:

'surronding_parenthesis':
{
'comment': 'describes a (nested) accessor using parenthesis'
'begin': '(?:[a-zA-Z_%\\$\\?!#][\\w%\\$\\?!#]*)' + # character beforehand
'(\\()'
'end': '(?>(\\)))'

'beginCaptures':
'1':
'name': 'punctuation.section.parens.begin.someLang'
'endCaptures':
'1':
'name': 'punctuation.section.parens.end.someLang'
'2':
'name': 'banana.invalid.illegal.someLang'

'patterns':[
{ 'include': '#strange_accessors'}
]

}


I was fiddling my way through greedy, reluctant and posessive behaviour, as well as atomic groups, because I think this will be the key to a good match.

But I am fully confust and don't really know how to solve this strange nesting problem. If somebody is interested and wants to try why I need this:

It's a grammar for Scilab.

Answer

This regex below uses recursion:

(?<=\s)\w+(\(((?:[^()]+|(?1))*?)\))(\('.*?'\))?

It will match

  foo( a(1) )
  bar(1)('a')
  foo( bar(1)('a') )
  foo( bar(1)('a')('a') )('a')

You can test it here on regex101
(Scilab uses the PCRE regex engine from what I read on a forum)

Notice that it contains the positive lookbehind (?<=\s) to assure there's a whitespace before the leading word.
Because I doubted that you want to match something like \b( a(1) )

This regex will also match them, but without recursion. Just using non-capture groups:

(?<=\s)\w+\((?:.*?(?:\(.*?\))?)+\)(?:\('.*?'\))?