User2321 - 1 year ago 107
R Question

# Count rows in second data table based on column values of first without merging

I admit the title of the questions is a little cryptic but bear with me. I have two data tables, one small and one very big which contains useless information.

The first dataset (dt1) is in the following format:

``````      CG     MG1     MG2
1:   49693 914569 4417756
2:   50422  22514   31343
3:   90543  90544 4531361
4:  142864 143471  143806
5:  386093   2149 4149104
6: 2674708  23921   24327
``````

So it contains just numbers, some of which can be found in the second data table. The second data table (dt2) contains many fields but the one I need is ID.

``````      ID
1:   49693
2:   49693
3:   49693
4:   49693
5:   49693
6: 2674708
7: 2679818
8: 2680618
9:   49693
10: 2695042
``````

What I want to do is to add 3 more columns in dt1 each one specifying whether there are more than 5 rows in dt2 with ID equal to CG1, MG1, MG2 respectively. My end result would be something like:

``````     CG     MG1     MG2    CG_OK MG1_OK MG2_OK
1:   49693 914569 4417756     1      0      0
2:   50422  22514   31343     0      0      0
3:   90543  90544 4531361     0      0      0
4:  142864 143471  143806     0      0      0
5:  386093   2149 4149104     0      0      0
6: 2674708  23921   24327     0      0      0
``````

There is a 1 in CG_OK because for the CG 49693 we can find 6 rows in dt2 with the same ID.

Ways I can achieve my result: subset dt2 to contain only the values that can be found in CG, MG1 and MG2, then count rows for each one and then merge somehow back.

My question is, is there a better (or well shorter in terms of code) way?? Something like (not working I know):

``````dt1[, CG_OK := ifelse(nrow(dt2[ID == CG]) > 5, 1, 0)]
``````

With:

``````dt1[, paste0(names(dt1),'_OK') := lapply(.SD, function(x) as.integer(x %in% unique(dt2\$ID))), .SDcols = 1:3][]
``````

you get:

``````        CG    MG1     MG2 CG_OK MG1_OK MG2_OK
1:   49693 914569 4417756     1      0      0
2:   50422  22514   31343     0      0      0
3:   90543  90544 4531361     0      0      0
4:  142864 143471  143806     0      0      0
5:  386093   2149 4149104     0      0      0
6: 2674708  23921   24327     1      0      0
``````

And if the conversion into integers isn't necessary:

``````dt1[, paste0(names(dt1),'_OK') := lapply(.SD, `%in%`, unique(dt2\$ID)), .SDcols = 1:3]
``````

``````cols <- names(dt1)