daRknight daRknight - 1 month ago 5
R Question

Create new variable with reference from another data.table

I know this may be accomplished with a for-loop, but am certain there is a more elegant solution within the construct of

data.table
.

I have two data tables, and will use 'iris' to illustrate my issue:

library("data.table")
A <- as.data.table(iris) #primary data table
B <- A[Sepal.Width > 3, .N, by = Species] #count from A meeting condition

head(A, 3)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#1: 5.1 3.5 1.4 0.2 setosa
#2: 4.9 3.0 1.4 0.2 setosa
#3: 4.7 3.2 1.3 0.2 setosa

B
# Species N
#1: setosa 42
#2: versicolor 8
#3: virginica 17


I would like to add a new variable to B which is simply the proportion of the data set that B represents, i.e. for the first row the output would be something like:

B[, Proportion := N/nrow(A[Species == "setosa"])]


The RHS of that index would obviously need to be dynamic, referencing the value of the first column in B by row..

It is this iteration that eludes me (though I feel it has to do with the data table key(s) perhaps?); greatly appreciate any help!

Answer

I would approach this as follows:

A <- as.data.table(iris)
B <- A[Sepal.Width > 3, .N, by = .("spec" = Species)]

B[, Proportion := N/nrow(A[Species == spec]), by = spec]

which results in:

> B
         spec  N Proportion
1:     setosa 42       0.84
2: versicolor  8       0.16
3:  virginica 17       0.34

Explanation:

  • By renaming the Species column to spec, you prevent R & data.table from not knowing which column to take for the calculation of the Proportion.
  • Using by = spec takes care of that the correct spec is used in A[Species == spec].