user1017373 user1017373 - 3 months ago 9
R Question

The matching columns from a data frame based on value in a column from other data Frame

I have two data frames
The first one is df1 has 485513 columns and 100 rows,

head(df1)

sample cg1 cg2 cg3 cg4 cg5 cg6 cg7 cg8 cg9 cg10 cg11
AAD_1 33435 33436 33437 33438 33439 33440 33441 33442 33443 33444 33445
AAD_2 0.33 1.33 2.33 3.33 4.33 5.33 6.33 7.33 8.33 9.33 10.33
AAD_3 0.56 1.56 2.56 3.56 4.56 5.56 6.56 7.56 8.56 9.56 10.56
AAD_4 45.9 46.9 47.9 48.9 49.9 50.9 51.9 52.9 53.9 54.9 55.9
AAD_5 46.9 47.9 48.9 49.9 50.9 51.9 52.9 53.9 54.9 55.9 56.9
AAD_6 47.9 48.9 49.9 50.9 51.9 52.9 53.9 54.9 55.9 56.9 57.9
AAD_7 48.9 49.9 50.9 51.9 52.9 53.9 54.9 55.9 56.9 57.9 58.9
AAD_8 49.9 50.9 51.9 52.9 53.9 54.9 55.9 56.9 57.9 58.9 59.9
AAD_9 50.9 51.9 52.9 53.9 54.9 55.9 56.9 57.9 58.9 59.9 60.9
AAD_10 51.9 52.9 53.9 54.9 55.9 56.9 57.9 58.9 59.9 60.9 61.9


and the second one has df2 84 rows and single column. I am aiming to get a subset of df1 using the values in the column from the df2 data frame.

head(df2)
ID
cg1
cg2
cg3
cg4
cg5


The values of df2 are the columns names of my interest from df1 and so I have tried the following one-liner in R.

> UP=(df1 %>% as.data.frame)[,df2$ID]


The Up data frame returns me with unmatched columns from my query df2

And it resulted in a data frame UP with 84 columns and 100 rows but none of the columns the above command line returned is matching with the input query data frame df2.

It would be great if someone suggests me an alternative solution

Answer

In R, we can just do

df[as.character(df2$ID)]

assuming that 'ID' column is factor. In case it is character class, it is more easier

df[df2$ID]

But if there are elements in 'ID' that are not in the column names of 'df', it may be better to use intersect

df[intersect(colnames(df), df2$ID)]

If the 'df' is a data.table, the usual way to subset columns will be to include the with =FALSE. It is mentioned in ?data.table

with

By default with=TRUE and j is evaluated within the frame of x; column names can be used as variables.

When with=FALSE j is a character vector of column names, a numeric vector of column positions to select or of the form startcol:endcol, and the value returned is always a data.table. with=FALSE is often useful in data.table to select columns dynamically. Note that x[, cols, with=FALSE] is equivalent to x[, .SD, .SDcols=cols].

Therefore, the above commands would be

 df[, as.character(df2$ID), with = FALSE]

or

 df[, df2$ID, with = FALSE] #if 'ID' is already character class.

Or

 df[, intersect(colnames(df), df2$ID), with = FALSE]