Arun - 1 year ago 68

R Question

I just discovered this warning in my script that was a bit strange.

`# Warning message:`

# In rbindlist(list(DT.1, DT.2)) : NAs introduced by coercion

`require(data.table)`

DT.1 <- data.table(x = letters[1:5], y = 6:10)

DT.2 <- data.table(x = LETTERS[1:5], y = 11:15)

# works fine

rbindlist(list(DT.1, DT.2))

# x y

# 1: a 6

# 2: b 7

# 3: c 8

# 4: d 9

# 5: e 10

# 6: A 11

# 7: B 12

# 8: C 13

# 9: D 14

# 10: E 15

However, now if I convert column

`x`

`factor`

`DT.1[, x := factor(x)]`

rbindlist(list(DT.1, DT.2))

# x y

# 1: a 6

# 2: b 7

# 3: c 8

# 4: d 9

# 5: e 10

# 6: NA 11

# 7: NA 12

# 8: NA 13

# 9: NA 14

# 10: NA 15

# Warning message:

# In rbindlist(list(DT.1, DT.2)) : NAs introduced by coercion

But

`rbind`

`rbind(DT.1, DT.2) # where DT.1 has column x as factor`

# do.call(rbind, list(DT.1, DT.2)) # also works fine

# x y

# 1: a 6

# 2: b 7

# 3: c 8

# 4: d 9

# 5: e 10

# 6: A 11

# 7: B 12

# 8: C 13

# 9: D 14

# 10: E 15

The same behaviour can be reproduced if column

`x`

`ordered factor`

`?rbindlist`

`Same as do.call("rbind",l), but much faster.`

Here's my session info:

`# R version 3.0.0 (2013-04-03)`

# Platform: x86_64-apple-darwin10.8.0 (64-bit)

#

# locale:

# [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

#

# attached base packages:

# [1] stats graphics grDevices utils datasets methods base

#

# other attached packages:

# [1] data.table_1.8.8

#

# loaded via a namespace (and not attached):

# [1] tools_3.0.0

`# column x in DT.1 is still a factor`

rbindlist(list(DT.2, DT.1))

# x y

# 1: A 11

# 2: B 12

# 3: C 13

# 4: D 14

# 5: E 15

# 6: 1 6

# 7: 2 7

# 8: 3 8

# 9: 4 9

# 10: 5 10

Here, the column from

`DT.1`

`numeric`

`rbind(DT2, DT1)`

`rbind`

`rbindlist`

`rbind`

`# DT.1 column x is already a factor`

DT.2[, x := factor(x)]

rbindlist(list(DT.1, DT.2))

# x y

# 1: a 6

# 2: b 7

# 3: c 8

# 4: d 9

# 5: e 10

# 6: a 11

# 7: b 12

# 8: c 13

# 9: d 14

# 10: e 15

Here, the column

`x`

`DT.2`

`DT.1`

`DT.1`

`DT.2`

In general, there seems to be a problem with handling

`factor`

`rbindlist`

Answer Source

I believe that `rbindlist`

when applied to factors is combining the numerical values of the factors and using only the levels associated with the first list element.

As in this bug report: http://r-forge.r-project.org/tracker/index.php?func=detail&aid=2650&group_id=240&atid=975

```
# Temporary workaround:
levs <- c(as.character(DT.1$x), as.character(DT.2$x))
DT.1[, x := factor(x, levels=levs)]
DT.2[, x := factor(x, levels=levs)]
rbindlist(list(DT.1, DT.2))
```

As another view of whats going on:

```
DT3 <- data.table(x=c("1st", "2nd"), y=1:2)
DT4 <- copy(DT3)
DT3[, x := factor(x, levels=x)]
DT4[, x := factor(x, levels=x, labels=rev(x))]
DT3
DT4
# Have a look at the difference:
rbindlist(list(DT3, DT4))$x
# [1] 1st 2nd 1st 2nd
# Levels: 1st 2nd
do.call(rbind, list(DT3, DT4))$x
# [1] 1st 2nd 2nd 1st
# Levels: 1st 2nd
```

as for observation 1, what's happening is similar to:

```
x <- factor(LETTERS[1:5])
x[6:10] <- letters[1:5]
x
# Notice however, if you are assigning a value that is already present
x[11] <- "S" # warning, since `S` is not one of the levels of x
x[12] <- "D" # all good, since `D` *is* one of the levels of x
```