Iterator - 9 months ago 39

R Question

For a vector of logical values, why does R allocate 4 bytes, when a bit vector would consume 1 bit per entry? (See this question for examples.)

Now, I realize that R also facilitates storage of

`NA`

For what it's worth, Matlab uses 1 byte for logicals, though it doesn't facilitate NA values. I'm not sure why MathWorks isn't satisfied with one bit functionality, much less a two bit data structure, but they have fancy pants marketers... [I'm gonna milk "two bit" for all it's worth in this question. ;-)]

Update 1. I think that the architecture reasons offered make some sense, but that feels a little ex post facto.

In addition, if the word size argument is so powerful, it seems strange to me to see Matlab (I think it's a 32 bit Matlab) consume only 1 byte - I wonder if MathWorks chose to be more memory efficient with a tradeoff for programming complexity and some other overhead for finding sub-word objects.

Also, there are certainly other options in are: as Brian Diggs notes, the

`bit`

`logical`

`bit()`

`NA`

To the extent that my RAM and computational speed are affected by booleans, I intend to switch over to using

`bit`

I think that the answer to this question will come from someone with a deeper understanding of R's design or internals. The best example is that Matlab uses a different size for their logical, and memory word sizes wouldn't be the answer in that case. Python may be similar to R, for what it's worth.

A related way to phrase this might be: why would

`LGLSXP`

`CHARSXP`

`CHARSXP`

`CHARSXP`

`sum`

Update 2. There have been some very good and enlightening answers here, especially relative to how one

- In order to support addition on a logical vector (note that "logical" is defined by programming language / environment, and is not the same as a boolean), one is best served by reusing code for adding integers. In the case of R, integers consume 4 bytes. In the case of Matlab, the smallest integer is 1 byte (i.e. ). This would explain why something different would be a nuisance to write for logicals. [To those not familiar with R, it supports many numerical operations on logicals, such as
`int8`

,`sum(myVector)`

, etc.]`mean(myVector)`

- Legacy support makes it exceedingly difficult to do something other than what has been done in R and S-Plus for a long time now. Moreover, I suspect that in the early days of S, S-Plus, and R, if someone was doing a lot of boolean operations, they did them in C, rather than trying to do so much work with logicals in R.

The other answers are fantastic for the purposes of how one might implement better boolean handling - don't naively assume that one can get at any individual bit: it's most efficient to load a word, then mask the bits that are not of interest, as Dervall has described. This is very, very useful advice should one write specialized code for boolean manipulation for R (e.g. my question on cross tabulations): don't iterate over bits, but instead work at the word level.

Thanks to all for a very thorough set of answers and insights.

Answer Source

Knowing a little something about R and S-Plus, I'd say that R most likely did it to be compatible with S-Plus, and S-Plus most likely did it because it was the easiest thing to do...

Basically, a logical vector is identical to an integer vector, so `sum`

and other algorithms for integers work pretty much unchanged on logical vectors.

In 64-bit S-Plus, the integers are 64-bit and thus also the logical vectors! That's *8 bytes* per logical value...

@Iterator is of course correct that a logical vector *should* be represented in a more compact form. Since there is already a `raw`

vector type which is 1-byte, it would seem like a very simple change to use that one for logicals too. And 2 bits per value would of course be even better - I'd probably keep them as two separate bit vectors (TRUE/FALSE and NA/Valid), and the NA bit vector could be NULL if there are no NAs...

Anyway, that's mostly a dream since there are so many RAPI packages (packages that use the R C/FORTRAN APIs) out there that would break...