Zac Zac - 1 month ago 15
Ruby Question

How to have an undefined amount of captures in a regex?

I'm making a simple stack-based language which uses commands to manipulate the stack. When I find a command in the source, I use this regex to separate out the actual command name, such as sum, and the arguments to the command. Arguments are surrounded by triangle brackets and are separated by commas.

Here's the regex I'm currently using:

(?<command>[^<>\s]+)(\<(?<args>(\d+)+(?>,\s*\d+)*)\>)?


Now this works fine, and here are some examples of it working:

+ => command: '+', args: nil
sum<5> => command: 'sum', args: '5'
print<1, 2, 3> => command: 'print', args: '1, 2, 3'


This works exactly as I want for each one but the last. My question is, is there a way to capture each argument separately? I mean like this:

print<1, 2, 3> => command: 'print', args: ['1', '2', '3']


By the way, I'm using the latest Ruby regex engine.

Answer

It is not possible to get such an output using a simple regex with repeated capturing groups in Ruby regex as the engine does not keep the capture stack.

You need to split the second capture with , as a post-process step.

See Ruby demo:

def cmd_split(s)
    rx = /(?<command>[^<>\s]+)(<(?<args>(\d+)+(?:,\s*\d+)*)>)?/
    res = []
    s.scan(rx) { 
        res << ($~[:args] != nil ? 
            Hash["command", $~[:command], "args", $~[:args].split(/,\s*/)] : 
            Hash[$~[:command], ""]) }
    return res
end

puts cmd_split("print<1, 2, 3>") # => {"command"=>"print", "args"=>["1", "2", "3"]}
puts cmd_split("disp<1>")        # => {"command"=>"disp", "args"=>["1"]}
puts cmd_split("+")              # => {"+"=>""}