T. BruceLee T. BruceLee - 1 month ago 18
R Question

remove/subset or select columns based on a part of the header name

I have a dataframe of 15000 obs. of 45 variables, with a similar structure/mode/class as the example input dataframe given below.

What I would like to do is select the "Date" column + columns containing "_MIN" as part of the header and put them into a new df. & select the "Date" column + columns containing "_MIN" as part of the header and put them into a new df.

So I am basically looking for a way to select or subset whole columns based on a part of the name of the column header. An example: multiple columns contain a "_" underscore and I would like to delete or select these columns.

INPUT dataframe:
Example <- data.frame(Date=seq(as.Date("1979/1/1"),as.Date("1979/1/5"), by="day"), ca_MIN=1:5, ca_MAX=2:6, cb_MIN=3:7, cb_MAX=4:8)

>Example

Date ca_MIN ca_MAX cb_MIN cb_MAX
1 1979-01-01 1 2 3 4
2 1979-01-02 2 3 4 5
3 1979-01-03 3 4 5 6
4 1979-01-04 4 5 6 7
5 1979-01-05 5 6 7 8

OUTPUT dataframes, selected based on the column names "_MAX", "_MIN":
Example_MIN <- data.frame(Date=seq(as.Date("1979/1/1"),as.Date("1979/1/5"), by="day"), ca_MIN=1:5, cb_MIN=3:7)
Example_MAX <- data.frame(Date=seq(as.Date("1979/1/1"),as.Date("1979/1/5"), by="day"), ca_MAX=2:6, cb_MAX=4:8)

>Example_MIN

Date ca_MIN cb_MIN
1 1979-01-01 1 3
2 1979-01-02 2 4
3 1979-01-03 3 5
4 1979-01-04 4 6
5 1979-01-05 5 7


At the moment it works with:

Example_MIN <-Example
Example_MAX <-Example
subset(ExampleMIN,select=-c(2,4))
subset(ExampleMAX,select=-c(3,5))


This however gives me an error for the large dataset (although the outcome works for me):

Error in `[.data.table`(x, r, vars, with = FALSE) : j out of bounds


Is there a less cumbersome way to do this?

I would prefer it also to work with dataset with a large number of columns & with the column "_MIN" & "_MAX" not sorted as odd and even columnnumber?

I have seen post that are related, but I haven't managed this far to create a code that worked...

Answer

Couple of ways to do this.

grepl("_",names(Example) will give you a list of trues and falses for those columns, so:

Example[,grepl("_",names(Example)] will give you the properly subset data.frame.

A more friendly way to do this is using dplyr:

Example %>% select(contains("_"))

an exercise for you is then going to be how to select those columns AND the date column. Hint for the base-R solution: | is OR and & is logical AND. Good luck!