User2321 User2321 - 13 days ago 9
R Question

Count rows in second data table based on column values of first without merging

I admit the title of the questions is a little cryptic but bear with me. I have two data tables, one small and one very big which contains useless information.

The first dataset (dt1) is in the following format:

CG MG1 MG2
1: 49693 914569 4417756
2: 50422 22514 31343
3: 90543 90544 4531361
4: 142864 143471 143806
5: 386093 2149 4149104
6: 2674708 23921 24327


So it contains just numbers, some of which can be found in the second data table. The second data table (dt2) contains many fields but the one I need is ID.

ID
1: 49693
2: 49693
3: 49693
4: 49693
5: 49693
6: 2674708
7: 2679818
8: 2680618
9: 49693
10: 2695042


What I want to do is to add 3 more columns in dt1 each one specifying whether there are more than 5 rows in dt2 with ID equal to CG1, MG1, MG2 respectively. My end result would be something like:

CG MG1 MG2 CG_OK MG1_OK MG2_OK
1: 49693 914569 4417756 1 0 0
2: 50422 22514 31343 0 0 0
3: 90543 90544 4531361 0 0 0
4: 142864 143471 143806 0 0 0
5: 386093 2149 4149104 0 0 0
6: 2674708 23921 24327 0 0 0


There is a 1 in CG_OK because for the CG 49693 we can find 6 rows in dt2 with the same ID.

Ways I can achieve my result: subset dt2 to contain only the values that can be found in CG, MG1 and MG2, then count rows for each one and then merge somehow back.

My question is, is there a better (or well shorter in terms of code) way?? Something like (not working I know):

dt1[, CG_OK := ifelse(nrow(dt2[ID == CG]) > 5, 1, 0)]

Answer

With:

dt1[, paste0(names(dt1),'_OK') := lapply(.SD, function(x) as.integer(x %in% unique(dt2$ID))), .SDcols = 1:3][]

you get:

        CG    MG1     MG2 CG_OK MG1_OK MG2_OK
1:   49693 914569 4417756     1      0      0
2:   50422  22514   31343     0      0      0
3:   90543  90544 4531361     0      0      0
4:  142864 143471  143806     0      0      0
5:  386093   2149 4149104     0      0      0
6: 2674708  23921   24327     1      0      0

And if the conversion into integers isn't necessary:

dt1[, paste0(names(dt1),'_OK') := lapply(.SD, `%in%`, unique(dt2$ID)), .SDcols = 1:3]

Or bit more readable:

cols <- names(dt1)
dt1[, paste0(cols,'_OK') := lapply(.SD, `%in%`, unique(dt2$ID)), .SDcols = cols]