slhck slhck - 3 months ago 8
R Question

Extract one column as rows with R, preserving other columns

What I have:

I have a data frame that looks like this:

sequence foo model output real
1 3 a 12 12
1 3 b 29 12
1 3 c 10 12
1 3 d 38 12
1 3 e 10 12
2 3 a 38 15
2 3 b 10 15
2 3 c 29 15
2 3 d 56 15
2 3 e 10 15


Created by:

d.test = data.frame(
sequence = c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2),
foo = c(3, 3, 3, 3, 3, 3, 3, 3, 3, 3),
model = c("a", "b", "c", "d", "e", "a", "b", "c", "d", "e"),
output = c(12, 29, 10, 38, 10, 38, 10, 29, 56, 10),
real = c(12, 12, 12, 12, 12, 15, 15, 15, 15, 15)
)


The model predicts an
output
for every given
sequence
, but the
real
output is also recorded along every sequence.

What I need:

I would like to transform the data such that
real
becomes a "model" itself, that is:

sequence foo model output
1 3 a 12
1 3 b 29
1 3 c 10
1 3 d 38
1 3 e 10
1 3 real 12
2 3 a 38
2 3 b 10
2 3 c 29
2 3 d 56
2 3 e 10
2 3 real 15


How can I achieve that using
dplyr
,
tidyr
and their cousins?

Note that for a “nice” solution, one should not have to:


  • Manually enter column indices

  • Manually specify all the columns like
    foo
    which are not of interest



What I've tried:

I tried the following, but it feels clumsy:

unique(
melt(d.test,
id.vars = c("sequence", "foo"),
measure.vars = c("real"),
variable.name = "model",
value.name = "output"
)
)


Now I have to remove the
real
column from the original data frame and append the rows of what I just did. It's not a nice solution because apart from the
foo
column there may be many more columns that I'd like to preserve, and then I'd have to specify them as
id.vars
.

Answer

I'd use data.table:

library(data.table)
setDT(d.test)

d.test[, 
  rbind(.SD, .SD[1L][, `:=`(model = "real", output = real[1L])])
, by=sequence][, real := NULL][]

If I had to use the 'verse:

d.real = d.test %>% distinct(sequence) %>%
  mutate(model = "real", output = real) %>% select(-real)

d = d.test %>% select(-real)

And then stack them:

bind_rows(d, d.real)

If the ordering is important, add %>% arrange(sequence).


Comment. The problem in the OP originates with untidy data. Reading Hadley's paper on the subject would probably be helpful if you don't know what I mean.