Amin Amin - 2 months ago 6
R Question

in R data.table using as.numeric as 0 or 1 value producer coerces NA where should not

In R data.table I have this table:

Company <- c ("A", "A" , "A", "A", "B" , "B", "B", "B")
TopManger <- c(1, 1, 1, 0, 1, 1, 0, 0)
Salary <- c(300, 300, 300, NA, 250,250, NA, 100)
tbl = data.table(company, TopManger, Salary)


that looks like:

Company TopManger Salary
1: A 1 300
2: A 1 300
3: A 1 300
4: A 0 NA
5: B 1 250
6: B 1 250
7: B 0 NA
8: B 0 100


I perform this query:

tbl[, HighPayedComp := as.numeric(Salary[TopManger == 1] > 200), by=Company]


And I expect to get this:

Company TopManger Salary HighPayedComp
1: A 1 300 1
2: A 1 300 1
3: A 1 300 1
4: A 0 NA 1
5: B 1 250 1
6: B 1 250 1
7: B 0 NA 1
8: B 0 100 1


In this minimal example it works but in my big data.table in some arbitrary but fixed situations coerces
NA
where logically it shouldn't:
For example in line 5 I have
NA
for
HighPayedComp
.

Answer

It could be due to the presence of NA in the 'TopManager' column. The == returns NA whenever there is a NA. It can be avoided by either using %in% or creating another condition with !is.na

tbl[, HighPayedComp:= as.integer(salary[TopManager == 1 & 
                         !is.na(TopManager)] > 200), by = Company]

Or using %in%

tbl[, HighPayedComp:= as.integer(salary[TopManager %in% 1] > 200), by = Company]

This is because of the behavior of ==

v1 <- c(NA, 1, 3)
v1 ==1
#[1]    NA  TRUE FALSE

whereas

!is.na(v1) & v1==1
#[1] FALSE  TRUE FALSE

takes care of the NA elements

If we use %in%, the code is basically doing a match and converting it to logical (> 0L)

`%in%`
#function (x, table) 
#match(x, table, nomatch = 0L) > 0L

In the vector above, the first element is NA and doing the logical comparison with == returns NA.

It is also described in the ?"=="

Missing values (NA) and NaN values are regarded as non-comparable even to themselves, so comparisons involving them will always result in NA. Missing values can also result when character strings are compared and one is not valid in the current collation locale.

data

tbl <- data.table(Company = rep(LETTERS[1:3], each =2), TopManager = c(1, 0, NA,
  1, 1, NA), salary = c(250, 300, 220, 180, 240, 200))
Comments