skan skan - 2 months ago 4
R Question

R converting to long format, pattern

I would like to convert a data.table like this one from wide format to long.

set.seed(1)
DT <- data.table(
ID = c(1:5, NA),
Name = c("Bob","Ana","Smith","Sam","Big","Lulu"),
Kind_2001 = factor(sample(c(letters[1:3], NA), 6, TRUE)),
Kind_2002 = factor(sample(c(letters[1:3], NA), 6, TRUE)),
Kind_2003 = factor(sample(c(letters[1:3], NA), 6, TRUE)),
Conc_2001 = sample(99,6),
Conc_2002 = sample(79,6),
Conc_2003 = sample(49,6)
)


ID Name Kind_2001 Kind_2002 Kind_2003 Conc_2001 Conc_2002 Conc_2003
1 Bob b NA c 38 22 24
2 Ana b c b 77 31 29
3 Smith c c NA 91 2 49
4 Sam NA a b 21 30 9
5 Big a a c 62 66 38
NA Lulu NA a NA 12 26 30


And I would like to get something like this:

ID Name Year Kind Conc
1 Bob 2001 b 38
1 Bob 2002 NA 22
1 Bob 2003 c 24
2 Ana 2001 b 77
2 Ana 2002 c 31
2 Ana 2003 b 29
...


The real table has many more variables, I'm looking for a solution without explicitly saying every column name or number, detecting automatically the pattern.

I have two kind of columns, some ending with an underscore and a four digit year, such as _2001, and the other without that ending.

Some can have an underscore in the middle of the name (this will be kept untransformed).

I would like to transform the columns ending with a year to long format.

I've tried with

melt(DT, id=1:2, variable.name = "year")
or with
melt(DT, id=1:2, measure=patterns("_2[0-9][0-9][0-9]$"))


but I'm not getting what I want.

Maybe I first need to filter the names with gsub.

PD: I've found this solution.

posi <- grep("_[0-9][0-9][0-9][0-9]$",colnames(DT))
work <- unique(gsub("_[0-9][0-9][0-9][0-9]$","",colnames(DT)[posi]))
melt(DT, measure=patterns(paste0("^",work)), variable="year", value.name=work)


It almost works but the year column is not populated properly. I'm missing something or it's a bug.
And I'm sure it could be written simpler.

ID Name year Kind Conc
1 Bob 1 b 38
2 Ana 1 b 77
3 Smith 1 c 91
4 Sam 1 NA 21
5 Big 1 a 62
NA Lulu 1 NA 12
1 Bob 2 NA 22
2 Ana 2 c 31
3 Smith 2 c 2
4 Sam 2 a 30
5 Big 2 a 66
NA Lulu 2 a 26
1 Bob 3 c 24
2 Ana 3 b 29
3 Smith 3 NA 49
4 Sam 3 b 9
5 Big 3 c 38
NA Lulu 3 NA 30


Regards

Answer

Here's an option that's more robust with respect to the order of your columns, as well as missing/extra years:

dcast(melt(DT, id.vars = c("ID", "Name"))
        [, .(ID, Name, sub('_.*', '', variable), sub('.*_', '', variable), value)],
      ID + Name + V4 ~ V3)
#    ID  Name   V4 Conc Kind
# 1:  1   Bob 2001   38    b
# 2:  1   Bob 2002   22   NA
# 3:  1   Bob 2003   24    c
# 4:  2   Ana 2001   77    b
# 5:  2   Ana 2002   31    c
# 6:  2   Ana 2003   29    b
# 7:  3 Smith 2001   91    c
# 8:  3 Smith 2002    2    c
# 9:  3 Smith 2003   49   NA
#10:  4   Sam 2001   21   NA
#11:  4   Sam 2002   30    a
#12:  4   Sam 2003    9    b
#13:  5   Big 2001   62    a
#14:  5   Big 2002   66    a
#15:  5   Big 2003   38    c
#16: NA  Lulu 2001   12   NA
#17: NA  Lulu 2002   26    a
#18: NA  Lulu 2003   30   NA

Edit for many id columns:

idvars = grep("_", names(DT), invert = TRUE)
dcast(melt(DT, id.vars = idvars)
        [, `:=`(var      = sub('_.*', '', variable),
                year     = sub('.*_', '', variable),
                variable = NULL)],
      ... ~ var, value.var='value')
Comments