Gawdl3y Gawdl3y - 3 months ago 8
Javascript Question

Parsing a string into an array of arguments with non-strict trailing argument

I'm trying to parse an argument string into an array of arguments. I have it mostly working, but it definitely seems like there'd be an easier way to go about doing this.

Rules:


  • Quoted strings (
    "some string"
    ) should be treated as a single argument, but the quotes should be removed from the resulting string

  • Any whitespace should separate arguments, except when we're already at the argCount (allowing the final argument to be unquoted, with all non-leading/trailing whitespace included)

  • Quotes should be ignored in the final argument, being left in the string as-is, unless the quotes in question are surrounding the entire final argument.



Examples:


  • this is an arg string
    with argCount 2 should result in
    ['this', 'is an arg string']

  • "this is" an arg string
    with argCount 2 should result in
    ['this is', 'an arg string']

  • "this is" "an arg" string too
    with argCount 3 should result in
    ['this is', 'an arg', 'string too']

  • this\nis an arg\n string!
    with argCount 3 should result in
    ['this', 'is', 'an arg\n string!']

  • this\nis an arg string!
    with argCount 2 should result in
    ['this', 'is an arg string!']

  • this\nis an arg string\nwith multiple lines in the final arg.\n inner whitespace still here
    with argCount 2 should result in
    ['this', 'is an arg string\nwith multiple lines in the final arg.\n inner whitespace still here']

  • this is an arg " string with "quotes in the final" argument.
    with argCount 2 should result in
    ['this', 'is an arg " string with "quotes in the final" argument.']

  • "this is" "an arg string with nested "quotes" in the final arg. neat."
    with argCount 2 should result in
    ['this is', 'an arg string with nested "quotes" in the final arg. neat.']



My current code:

function parseArgs(argString, argCount) {
if(argCount) {
if(argCount < 2) throw new RangeError('argCount must be at least 2.');
const args = [];
const newlinesReplaced = argString.trim().replace(/\n/g, '{!~NL~!}');
const argv = stringArgv(newlinesReplaced);
if(argv.length > 0) {
for(let i = 0; i < argCount - 1; i++) args.push(argv.shift());
if(argv.length > 0) args.push(argv.join(' ').replace(/{!~NL~!}/g, '\n').replace(/\n{3,}/g, '\n\n'));
}
return args;
} else {
return stringArgv(argString);
}
}


I'm using the string-argv library, which is what
stringArgv
is calling.
The four last examples do not work properly with my code, as the dummy newline replacement tokens cause the arguments to be smashed together during the stringArgv call - and quotes are taking complete priority.

Update:

I clarified the quotes rule, and added a rule about quotes also being left untouched in the final argument. Added two additional examples to go along with the new rule.

Answer

You could use a regular expression for this:

function mySplit(s, argCount) {
    var re = /\s*(?:("|')([^]*?)\1|(\S+))\s*/g,
        result = [],
        match = []; // should be non-null
    argCount = argCount || s.length; // default: large enough to get all items
    // get match and push the capture group that is not null to the result
    while (--argCount && (match = re.exec(s))) result.push(match[2] || match[3]);
    // if text remains, push it to the array as it is, except for 
    // wrapping double quotes, which are removed from it
    if (match && re.lastIndex < s.length)
        result.push(s.substr(re.lastIndex).replace(/^("|')([^]*)\1$/g, '$2'));
    return result;
}
// Sample input
var s = '"this is" "an arg" string too';
// Split it
var parts = mySplit(s, 3);
// Show result
console.log(parts);

This gives the desired result for all example cases you provided.

Backslash escaping

If you want to support backslash escaping, so you can embed literal double quotes in your first arguments without interrupting those argument, then you can use this regular expression in the above code:

var re = /\s*(?:("|')((?:\\[^]|[^\\])*?)\1|(\S+))\s*/g,

The magic is in (?:\\[^]|[^\\]): either a backslash followed by something, or not-a-backslash. This way, the double quote that follows a backslash will never get matched as an argument-closing one.

The (?: makes the group non capturing (i.e. it is not numbered for $1 style back-references).

The [^] may look weird, but it is a way in JavaScript regexes to say "any character", which is more broad than the dot, which does not match newlines. Also there is the s modifier to give the dot operator this broader meaning, that modifier is not supported in JavaScript.

Comments