I often need to select a set of variables from a data.frame in R.
My research is in the social and behavioural sciences, and it is quite common to have a data.frame with several hundreds of variables (e.g., there'll be item level information for a range of survey questions, demographic items, performance measures, etc., etc.).
As part of analyses, I'll often want to select a subset of variables.
For example, I might want to get:
- descriptive statistics for a set of variables
- correlation matrix on a set of variables
- factor analysis on a set of variables
- predictors in a linear model
Now, I know that there are many ways to write the code to select a subset of variables.
Quick-r has a nice overview of common ways of extracting variable subsets from a data.frame
myvars <- c("v1", "v2", "v3")
newdata <- mydata[myvars]
However, I'm interested in the efficiency of this process, particularly where you might need to extract 20 or so variables from a data.frame. The naming convention of variables is often not intuitive, especially where you've inherited a dataset from someone else, so you might be left wondering, was the variable
Multiply this by 20 variables that need to be extracted, and the task of memorising variable names becomes more complicated than it needs to be.
To make the following discussion concrete, I'll use the
data.frame in the
df <- bfi
A1 A2 A3 A4 A5 C1 C2 C3 C4 C5 E1 E2 E3 E4 E5 N1 N2 N3 N4 N5 O1 O2 O3 O4
61617 2 4 3 4 4 2 3 3 4 4 3 3 3 4 4 3 4 2 2 3 3 6 3 4
O5 gender education age
61617 3 1 NA 16
- How can I efficiently select an arbitrary set of variables, which for concreteness, I'll choose
A1, A2, A3, A5, C2, C3, C5, E2, E3, gender, education, age
My current strategy
I currently have a range of strategies that I use.
Of course sometimes I can exploit things like the numeric position of the variables or the naming convention and use either
to select or
to construct. But sometimes I need a more general solution. I've used the following over the while:
In the early days, I used to call
, copy the quoted variable names and then edit until I have what I want.
2. Use a database
Sometimes I'll have a separate data.frame that stores each variable as a row, and has columns for variable names, variable labels, and it has a column which indicates whether the variable should be retained for a particular analysis. I can then filter on that
variable and extract a vector of variable names. I find this particularly useful when I'm developing a psychological test and for various iterations I want to include or exclude certain items.
As Hadley Wickham once pointed out to me
is a good option; e.g.,
is better than
in that it outputs a list that is already in the
c("var1", "var2", ...)
c("A1", "A2", "A3", "A4", "A5", "C1", "C2", "C3", "C4", "C5",
"E1", "E2", "E3", "E4", "E5", "N1", "N2", "N3", "N4", "N5", "O1",
"O2", "O3", "O4", "O5", "gender", "education", "age")
This can then be copied into the script and edited.
But can it be more efficient
is a pretty good variable selection strategy. The efficiency of the process largely depends on how proficient you are in copying the text into your script and then editing the list of names down to those desired.
However, I still remember the efficiency of GUI based systems of variable selection.
For example, in SPSS when you interact with a dialogue box you can point and click with the mouse the variables you want from the dataset. You can shift-click to select a range of variables, you can hold shift and press the down key to select one or more variables, and so on. And then you can press
and the command with extracted variable names is pasted into your script editor.
So, finally the core question
- Is there a simple no frills GUI device that permits the selection of variables from a data.frame (e.g., something like
opens a gui window for variable selection), and returns a vector of variable names selected
c("var1", "var2", ...)
the best general option for selecting a set of variable names in R? Or is there a better way?