lolibility - 1 year ago 61

R Question

I have a data frame stores the possession of numbers of different kinds of fruits of different people. Like below

`apple banana orange`

Tim 3 0 2

Tom 0 1 1

Bob 1 2 2

Again, the numbers are the counts of fruits. How can I change it into a existence matrix which means if a person has one fruit, no matter how many he has, then the I record 1, if not, record 0. Like below

`apple banana orange`

Tim 1 0 1

Tom 0 1 1

Bob 1 1 1

Answer Source

Here's your `data.frame`

:

```
x <- structure(list(apple = c(3L, 0L, 1L), banana = 0:2, orange = c(2L,
1L, 2L)), .Names = c("apple", "banana", "orange"), class = "data.frame", row.names = c("Tim",
"Tom", "Bob"))
```

And your matrix:

```
as.matrix((x > 0) + 0)
apple banana orange
Tim 1 0 1
Tom 0 1 1
Bob 1 1 1
```

I had no idea that a quick pre-bedtime posting would generate any discussion, but the discussions themselves are quite interesting, so I wanted to summarize here:

My instinct was to simply take the fact that underneath a `TRUE`

and `FALSE`

in R, are the numbers `1`

and `0`

. If you try (a not so good way) to check for equivalence, such as `1 == TRUE`

or `0 == FALSE`

, you'll get `TRUE`

. My shortcut way (which turns out to take **more time** than the *correct*, or at least *more conceptually correct* way) was to just add `0`

to my `TRUE`

s and `FALSE`

s, since I know that R would coerce the logical vectors to numeric.

The correct, or at least, more appropriate way, would be to convert the output using `as.numeric`

(I think that's what @JoshO'Brien intended to write). BUT.... unfortunately, that removes the dimensional attributes of the input, so you need to re-convert the resulting vector to a matrix, which, as it turns out, is ** still** faster than adding

`0`

as I did in my answer.Having read the comments and criticisms, I thought I would add one more option---using `apply`

to loop through the columns and use the `as.numeric`

approach. That is *slower* than manually re-creating the matrix, but *slightly faster* than adding `0`

to the logical comparison.

```
x <- data.frame(replicate(1e4,sample(0:1e3)))
library(rbenchmark)
benchmark(X1 = {
x1 <- as.matrix((x > 0) + 0)
},
X2 = {
x2 <- apply(x, 2, function(y) as.numeric(y > 0))
},
X3 = {
x3 <- as.numeric(as.matrix(x) > 0)
x3 <- matrix(x3, nrow = 1001)
},
X4 = {
x4 <- ifelse(x > 0, 1, 0)
},
columns = c("test", "replications", "elapsed",
"relative", "user.self"))
# test replications elapsed relative user.self
# 1 X1 100 116.618 1.985 110.711
# 2 X2 100 105.026 1.788 94.070
# 3 X3 100 58.750 1.000 46.007
# 4 X4 100 382.410 6.509 311.567
all.equal(x1, x2, check.attributes=FALSE)
# [1] TRUE
all.equal(x1, x3, check.attributes=FALSE)
# [1] TRUE
all.equal(x1, x4, check.attributes=FALSE)
# [1] TRUE
```

Thanks for the discussion y'all!