sqlck sqlck - 4 years ago 164
R Question

Adding multiple columns to data.table in R

I have a data.table with sequences and number of reads, like so:

sequence num_reads
1: AACCTGCCG 1
2: CGCGCTCAA 12
3: AGTGTGAGC 3
4: TGGGTACAC 11
5: GGCCGCGTG 15
6: CCTTAAGAG 2
7: GCGGAACTG 9
8: GCGTTGTAG 17
9: GTTGTAGCG 20
10: ACACGTGAC 16


I'd like to use data.table to add two new columns to this table, based on the results of applying dpois() with two weights and two lambdas. The correct output should be this (based on using data.frame):

sequence num_reads clus1 clus2
1 AACCTGCCG 1 2.553269503552647000377e-03 1.610220613932057849571e-03
2 CGCGCTCAA 12 1.053993989051599418361e-02 2.887608256917401083896e-02
3 AGTGTGAGC 3 2.085170094567994833468e-02 1.717568654860860896672e-02
4 TGGGTACAC 11 1.806846838374168498498e-02 4.331412385376097462508e-02
5 GGCCGCGTG 15 1.324248858039188620275e-03 5.415587646672919558410e-03
6 CCTTAAGAG 2 8.936443262434262332916e-03 6.440882455728230530922e-03
7 GCGGAACTG 9 4.056186780023639942838e-02 7.444615037365168164207e-02
8 GCGTTGTAG 17 2.385595369261770803265e-04 1.274255916864215588610e-03
9 GTTGTAGCG 20 1.196285397159046524451e-05 9.538289904012846548518e-05
10 ACACGTGAC 16 5.793588753921446012421e-04 2.707793823336458478163e-03


But when I try to use data.table I can't seem to get the right result. Here is what I tried (based on similar questions asked around this topic):

pois = function(n, p, l){return(dpois(as.numeric(as.character(n)), l)*p) }
x = x[, c(paste("clus", seq(1,2), sep = '')) := pois(num_reads, c(0.4,0.6), c(7,8)), by = seq_len(nrow(x))]


And here is the result:

sequence num_reads clus1 clus2
1: AACCTGCCG 1 2.553269503552647000377e-03 2.553269503552647000377e-03
2: CGCGCTCAA 12 1.053993989051599418361e-02 1.053993989051599418361e-02
3: AGTGTGAGC 3 2.085170094567994833468e-02 2.085170094567994833468e-02
4: TGGGTACAC 11 1.806846838374168498498e-02 1.806846838374168498498e-02
5: GGCCGCGTG 15 1.324248858039188620275e-03 1.324248858039188620275e-03
6: CCTTAAGAG 2 8.936443262434262332916e-03 8.936443262434262332916e-03
7: GCGGAACTG 9 4.056186780023639942838e-02 4.056186780023639942838e-02
8: GCGTTGTAG 17 2.385595369261770803265e-04 2.385595369261770803265e-04
9: GCGTTGTAG 20 1.196285397159046524451e-05 1.196285397159046524451e-05
10: ACACGTGAC 16 5.793588753921446012421e-04 5.793588753921446012421e-04


The reason I'm using data.table and not data.frame is that my real data has 100,000s of rows. I studied the answers to this and this but I haven't been able to come up with a solution.

Any tips you have would be much appreciated. Thanks!

Answer Source

We can try

x[,  paste("clus", seq(1,2), sep = ''):= 
      as.list(pois(num_reads, c(0.4,0.6), c(7,8))), by = seq_len(nrow(x))]
x
#    sequence num_reads         clus1        clus2
#1: AACCTGCCG         1 0.00255326950 0.0016102206
#2: CGCGCTCAA        12 0.01053993989 0.0288760826
#3: AGTGTGAGC         3 0.02085170095 0.0171756865
#4: TGGGTACAC        11 0.01806846838 0.0433141239
#5: GGCCGCGTG        15 0.00132424886 0.0054155876
#6: CCTTAAGAG         2 0.00893644326 0.0064408825
#7: GCGGAACTG         9 0.04056186780 0.0744461504
#8: GCGTTGTAG        17 0.00023855954 0.0012742559
#9: GTTGTAGCG        20 0.00001196285 0.0000953829
#10:ACACGTGAC        16 0.00057935888 0.0027077938
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download