MichaelChirico - 1 year ago 64
R Question

# How exactly does R parse `->`, the right-assignment operator?

So this is kind of a trivial question, but it's bugging me that I can't answer it, and perhaps the answer will teach me some more details about how R works.

The title says it all: how does R parse

`->`
, the obscure right-side assignment function?

My usual tricks to dive into this failed:

```````->`
``````

Error: object
`->`

``````getAnywhere("->")
``````

no object named
`->`
was found

And we can't call it directly:

```````->`(3,x)
``````

Error: could not find function
`"->"`

But of course, it works:

``````3 -> x #assigns the value 3 to the name x
``````

It appears R knows how to simply reverse the arguments, but I thought the above approaches would surely have cracked the case:

``````pryr::ast(3 -> y)
# \- ()
#   \- `<- #R interpreter clearly flipped things around
#   \- `y  #  (by the time it gets to `ast`, at least...)
#   \-  3  #  (note: this is because `substitute(3 -> y)`
#          #   already returns the reversed version)
``````

Compare this to the regular assignment operator:

```````<-`
.Primitive("<-")

`<-`(x, 3) #assigns the value 3 to the name x, as expected
``````

`?"->"`
,
`?assignOps`
, and the R Language Definition all simply mention it in passing as the right assignment operator.

But there's clearly something unique about how
`->`
is used. It's not a function/operator (as the calls to
`getAnywhere`
and directly to
``->``
seem to demonstrate), so what is it? Is it completely in a class of its own?

Is there anything to learn from this besides "
`->`
is completely unique within the R language in how it's interpreted and handled; memorize and move on"?

Let me preface this by saying I know absolutely nothing about how parsers work. Having said that, line 296 of gram.y defines the following tokens to represent assignment in the (YACC?) parser R uses:

``````%token      LEFT_ASSIGN EQ_ASSIGN RIGHT_ASSIGN LBB
``````

Then, on lines 5140 through 5150 of gram.c, this looks like the corresponding C code:

``````case '-':
if (nextchar('>')) {
if (nextchar('>')) {
yylval = install_and_save2("<<-", "->>");
return RIGHT_ASSIGN;
}
else {
yylval = install_and_save2("<-", "->");
return RIGHT_ASSIGN;
}
}
``````

Finally, starting on line 5044 of gram.c, the definition of `install_and_save2`:

``````/* Get an R symbol, and set different yytext.  Used for translation of -> to <-. ->> to <<- */
static SEXP install_and_save2(char * text, char * savetext)
{
strcpy(yytext, savetext);
return install(text);
}
``````

So again, having zero experience working with parsers, it seems that `->` and `->>` are translated directly into `<-` and `<<-`, respectively, at a very low level in the interpretation process.

You brought up a very good point in asking how the parser "knows" to reverse the arguments to `->` - considering that `->` appears to be installed into the R symbol table as `<-` - and thus be able to correctly interpret `x -> y` as `y <- x` and not `x <- y`. The best I can do is provide further speculation as I continue to come across "evidence" to support my claims. Hopefully some merciful YACC expert will stumble on this question and provide a little insight; I'm not going to hold my breath on that, though.

Back to lines 383 and 384 of gram.y, this looks like some more parsing logic related to the aforementioned `LEFT_ASSIGN` and `RIGHT_ASSIGN` symbols:

``````|   expr LEFT_ASSIGN expr       { \$\$ = xxbinary(\$2,\$1,\$3);  setId( \$\$, @\$); }
|   expr RIGHT_ASSIGN expr      { \$\$ = xxbinary(\$2,\$3,\$1);  setId( \$\$, @\$); }
``````

Although I can't really make heads or tails of this crazy syntax, I did notice that the second and third arguments to `xxbinary` are swapped to WRT `LEFT_ASSIGN` (`xxbinary(\$2,\$1,\$3)`) and `RIGHT_ASSIGN` (`xxbinary(\$2,\$3,\$1)`).

Here's what I'm picturing in my head:

`LEFT_ASSIGN` Scenario: `y <- x`

• `\$2` is the second "argument" to the parser in the above expression, i.e. `<-`
• `\$1` is the first; namely `y`
• `\$3` is the third; `x`

Therefore, the resulting (C?) call would be `xxbinary(<-, y, x)`.

Applying this logic to `RIGHT_ASSIGN`, i.e. `x -> y`, combined with my earlier conjecture about `<-` and `->` getting swapped,

• `\$2` gets translated from `->` to `<-`
• `\$1` is `x`
• `\$3` is `y`

But since the result is `xxbinary(\$2,\$3,\$1)` instead of `xxbinary(\$2,\$1,\$3)`, the result is still `xxbinary(<-, y, x)`.

Building off of this a little further, we have the definition of `xxbinary` on line 3310 of gram.c:

``````static SEXP xxbinary(SEXP n1, SEXP n2, SEXP n3)
{
SEXP ans;
if (GenerateCode)
PROTECT(ans = lang3(n1, n2, n3));
else
PROTECT(ans = R_NilValue);
UNPROTECT_PTR(n2);
UNPROTECT_PTR(n3);
return ans;
}
``````

Unfortunately I could not find a proper definition of `lang3` (or its variants `lang1`, `lang2`, etc...) in the R source code, but I'm assuming that it is used for evaluating special functions (i.e. symbols) in a way that is synchronized with the interpreter.

1) Is this really the only object in R that behaves like this?? (I've got in mind the John Chambers quote via Hadley's book: "Everything that exists is an object. Everything that happens is a function call." This clearly lies outside that domain -- is there anything else like this?

First, I agree that this lies outside of that domain. I believe Chambers' quote concerns the R Environment, i.e. processes that are all taking place after this low level parsing phase. I'll touch on this a little bit more below, however. Anyways, the only other example of this sort of behavior I could find is the `**` operator, which is a synonym for the more common exponentiation operator `^`. As with right assignment, `**` doesn't seem to be "recognized" as a function call, etc... by the interpreter:

``````R> `->`
R> `**`
``````

I found this because it's the only other case where `install_and_save2` is used by the C parser:

``````case '*':
/* Replace ** by ^.  This has been here since 1998, but is
undocumented (at least in the obvious places).  It is in
the index of the Blue Book with a reference to p. 431, the
help for 'Deprecated'.  S-PLUS 6.2 still allowed this, so
presumably it was for compatibility with S. */
if (nextchar('*')) {
yylval = install_and_save2("^", "**");
return '^';
} else
yylval = install_and_save("*");
return c;
``````

2) When exactly does this happen? I've got in mind that substitute(3 -> y) has already flipped the expression; I couldn't figure out from the source what substitute does that would have pinged the YACC...

Of course I'm still speculating here, but yes, I think we can safely assume that when you call `substitute(3 -> y)`, from the perspective of the substitute function, the expression always was `y <- 3`; e.g. the function is completely unaware that you typed `3 -> y`. `do_substitute`, like 99% of the C functions used by R, only handles `SEXP` arguments - an `EXPRSXP` in the case of `3 -> y` (== `y <- 3`), I believe. This is what I was alluding to above when I made a distinction between the R Environment and the parsing process. I don't think there is anything that specifically triggers the parser to spring into action - but rather everything you input into the interpreter gets parsed. I did a little more reading about the YACC / Bison parser generator last night, and as I understand it (a.k.a. don't bet the farm on this), Bison uses the grammar you define (in the `.y` file(s)) to generate a parser in C - i.e. a C function which does the actual parsing of input. In turn, everything you input in an R session is first processed by this C parsing function, which then delegates the appropriate action to be taken in the R Environment (I'm using this term very loosely by the way). During this phase, `lhs -> rhs` will get translated to `rhs <- lhs`, `**` to `^`, etc... For example, this is an excerpt from one of the tables of primitive functions in names.c:

``````/* Language Related Constructs */

/* Primitives */
{"if",      do_if,      0,  200,    -1, {PP_IF,      PREC_FN,     1}},
{"while",   do_while,   0,  100,    2,  {PP_WHILE,   PREC_FN,     0}},
{"for",     do_for,     0,  100,    3,  {PP_FOR,     PREC_FN,     0}},
{"repeat",  do_repeat,  0,  100,    1,  {PP_REPEAT,  PREC_FN,     0}},
{"break",   do_break, CTXT_BREAK,   0,  0,  {PP_BREAK,   PREC_FN,     0}},
{"next",    do_break, CTXT_NEXT,    0,  0,  {PP_NEXT,    PREC_FN,     0}},
{"return",  do_return,  0,  0,  -1, {PP_RETURN,  PREC_FN,     0}},
{"function",    do_function,    0,  0,  -1, {PP_FUNCTION,PREC_FN,     0}},
{"<-",      do_set,     1,  100,    -1, {PP_ASSIGN,  PREC_LEFT,   1}},
{"=",       do_set,     3,  100,    -1, {PP_ASSIGN,  PREC_EQ,     1}},
{"<<-",     do_set,     2,  100,    -1, {PP_ASSIGN2, PREC_LEFT,   1}},
{"{",       do_begin,   0,  200,    -1, {PP_CURLY,   PREC_FN,     0}},
{"(",       do_paren,   0,  1,  1,  {PP_PAREN,   PREC_FN,     0}},
``````

You will notice that `->`, `->>`, and `**` are not defined here. As far as I know, R primitive expressions such as `<-` and `[`, etc... are the closest interaction the R Environment ever has with any underlying C code. What I am suggesting is that by this stage in process (from you typing a set characters into the interpreter and hitting 'Enter', up through the actual evaluation of a valid R expression), the parser has already worked its magic, which is why you can't get a function definition for `->` or `**` by surrounding them with backticks, as you typically can.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download