Iterator - 1 year ago 83
R Question

# Why do logicals (booleans) in R require 4 bytes?

For a vector of logical values, why does R allocate 4 bytes, when a bit vector would consume 1 bit per entry? (See this question for examples.)

Now, I realize that R also facilitates storage of

`NA`
values, but couldn't that be done with an additional bit vector? In other words, why isn't it enough to just use a cheap two bit data structure?

For what it's worth, Matlab uses 1 byte for logicals, though it doesn't facilitate NA values. I'm not sure why MathWorks isn't satisfied with one bit functionality, much less a two bit data structure, but they have fancy pants marketers... [I'm gonna milk "two bit" for all it's worth in this question. ;-)]

Update 1. I think that the architecture reasons offered make some sense, but that feels a little ex post facto. I haven't checked 32 bit or 16 bit R to see how large their logicals are - that could lend some support to the idea. It seems, from the R Internals manual that logical vectors (LGLSXP) and integers (INTSXP) are 32 bits on every platform. I can understand a universal size for integers, independent of word size. Similarly, storage of logicals also seems to be independent of word size. But it's so BIG. :)

In addition, if the word size argument is so powerful, it seems strange to me to see Matlab (I think it's a 32 bit Matlab) consume only 1 byte - I wonder if MathWorks chose to be more memory efficient with a tradeoff for programming complexity and some other overhead for finding sub-word objects.

Also, there are certainly other options in are: as Brian Diggs notes, the
`bit`
package facilitates bit vectors, which was very useful for the problem in the question above (an 8X-10X speedup for the task was obtained by converting from 4 byte
`logical`
values to bit vectors). Although speed of accessing memory is important, moving 30-31 extra uninformative bits (from an information theory perspective) is wasteful. For instance, one could use something like the memory tricks used for integers described here - grab a bunch of extra memory (V cells) and then process things at the bit level (a la
`bit()`
). Why not do that and save 30 bits (1 for the value, 1 for
`NA`
) for a long vector?

To the extent that my RAM and computational speed are affected by booleans, I intend to switch over to using
`bit`
, but that's because a 97% savings in space matters in some cases. :)

I think that the answer to this question will come from someone with a deeper understanding of R's design or internals. The best example is that Matlab uses a different size for their logical, and memory word sizes wouldn't be the answer in that case. Python may be similar to R, for what it's worth.

A related way to phrase this might be: why would
`LGLSXP`
be 4 bytes on all platforms? (Is
`CHARSXP`
typically smaller, and wouldn't that work as well? Why not go even smaller, and just over-allocate?) (Updated The idea of using
`CHARSXP`
is likely bogus, because operations on
`CHARSXP`
aren't as fully useful as those for integers, such as
`sum`
. Using the same data structure as characters might save space, but would constrain which existing methods could operate on it. A more appropriate consideration is the use of smaller integers, as discussed below.)

Update 2. There have been some very good and enlightening answers here, especially relative to how one should implement retrieval and processing of booleans for the goals of speed and programming efficiency. I think that Tommy's answer is particularly plausible regarding the why it appears this way in R, which seems to arise from 2 premises:

1. In order to support addition on a logical vector (note that "logical" is defined by programming language / environment, and is not the same as a boolean), one is best served by reusing code for adding integers. In the case of R, integers consume 4 bytes. In the case of Matlab, the smallest integer is 1 byte (i.e.
`int8`
). This would explain why something different would be a nuisance to write for logicals. [To those not familiar with R, it supports many numerical operations on logicals, such as
`sum(myVector)`
,
`mean(myVector)`
, etc.]

2. Legacy support makes it exceedingly difficult to do something other than what has been done in R and S-Plus for a long time now. Moreover, I suspect that in the early days of S, S-Plus, and R, if someone was doing a lot of boolean operations, they did them in C, rather than trying to do so much work with logicals in R.

The other answers are fantastic for the purposes of how one might implement better boolean handling - don't naively assume that one can get at any individual bit: it's most efficient to load a word, then mask the bits that are not of interest, as Dervall has described. This is very, very useful advice should one write specialized code for boolean manipulation for R (e.g. my question on cross tabulations): don't iterate over bits, but instead work at the word level.

Thanks to all for a very thorough set of answers and insights.

Basically, a logical vector is identical to an integer vector, so `sum` and other algorithms for integers work pretty much unchanged on logical vectors.
@Iterator is of course correct that a logical vector should be represented in a more compact form. Since there is already a `raw` vector type which is 1-byte, it would seem like a very simple change to use that one for logicals too. And 2 bits per value would of course be even better - I'd probably keep them as two separate bit vectors (TRUE/FALSE and NA/Valid), and the NA bit vector could be NULL if there are no NAs...