Alex M Alex M - 3 months ago 9
R Question

R - interactive subsetting of rows by vector of column headers

I have 2 different data frames of the following format:

DF1 -

v1 v2 v3 v4 v5
a 1 2 +
b 5 2 + +
c 5 2 + +
d 4 3 + +
e 1 5 + +
f 3 5
g 4 2
h 3 1
i 5 5 + +


DF2 -

v1 v2 v3 v4
a 1 2 +
b 5 2 + +
c 5 2 +
d 4 3 +
e 1 5 +
f 3 5
g 4 2
h 3 1
i 5 5 +


My script gives a scatter plot of v1 & v2, but first I remove rows that have at least one "+" in v3-v4 or v3-v5.

My data frames can be bigger with more v1-v2 pairs, but always have either the v3-v4 or v3-v5 columns with "+". I adjust the code manually to specify columns to plot and which rows to remove depending on the DF format I am working on.

It works well but I wanted to make the script more interactive as follows:

# Select v3-v4 or v3-v5 via interactive gui to give vector of column headers.
remove.vars.vector <- select.list(names(DF), # Select columns as vector of column header names via interactive gui.
multiple = TRUE, # Can choose multiple columns.
title = "Choose variables to remove from data set", # Title on gui.
graphics = TRUE) # Allow launch of gui.

# Return columns from DF with this vector of column headers.
remove.vars.subset <- DF[remove.vars.vector]
# Return rows that have at least one "+" in v3-v4 or v3-v5.
remove.vars.subset.+ <- subset(DF, remove.vars.subset == "+")
# Removes all rows that contain >=1 NA.
complete.data.+ <- remove.vars.subset.+[complete.cases(remove.vars.subset.+), ]
# Combine by rows "complete.data.+" with DF.
combo.list <- rbind(DF,complete.data.+)
# Remove duplicate rows from combined data frame.
complete.data <- combo.list[!duplicated(combo.list, fromLast = FALSE) & !duplicated(combo.list, fromLast = TRUE),]


Problem: The above code doesn't completely strip the data frame of rows that contain at least one "+" in v3-4 or v3-5. The problem appears to be these lines:

# Return rows that have at least one "+" in v3-v4 or v3-v5.
remove.vars.subset.+ <- subset(DF, remove.vars.subset == "+")


I also get a number of rows at the end with only NA in every cell hence complete.cases in the next line of code.

The final data frame therefore still contains some rows with "+" in v3-4 or v3-5.

Question:

Is there a better way to subset rows in a data frame using a vector of column headers that may contain "+" in their rows?

Thank you in advance.

EDIT - 09/08/2016 - 18:54
I just noticed something that I didn't clarify about my data frames. Some of the rows don't have "+" in v3-v4 or v3-v5. These are the rows that I eventually want to keep so I can plot the scatter. I've edited data frames accordingly. I'm just looking at answers to try and understand them. I'm quite new to R still.

Answer

Suppose your data DF is

> DF
  v1 v2 v3 v4 v5
1  1  2  +      
2  5  2  +  +   
3  5  2  +     +
4  4  3     +  +
5  1  5  +     +

and I pick v3 and v4. Then remove.vars.subset, following your code, is

> remove.vars.subset
  v3 v4
1  +   
2  +  +
3  +   
4     +
5  +   

and notice that remove.vars.subset == "+" evaluates as

> remove.vars.subset == "+"
        v3    v4
[1,]  TRUE FALSE
[2,]  TRUE  TRUE
[3,]  TRUE FALSE
[4,] FALSE  TRUE
[5,]  TRUE FALSE

What subset is then doing is to ask R to return rows from the dataframe where the condition evaluates to TRUE, i.e.:

DF[c(TRUE, TRUE, TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, TRUE, FALSE),]

concatenating the first and second columns. But there are only 5 rows in the dataframe, whereas there are 10 elements in the logical vector. Since DF only has 5 rows, NAs are appended (so it's as if DF has 10 rows) and the expression is evaluated accordingly. So you see:

> subset(DF, remove.vars.subset == "+")
     v1 v2   v3   v4   v5
1     1  2    +          
2     5  2    +    +     
3     5  2    +         +
5     1  5    +         +
NA   NA NA <NA> <NA> <NA>
NA.1 NA NA <NA> <NA> <NA>

What you might want to try is

DF[!apply(remove.vars.subset, MAR=1, function(x) any(x=="+")), ]
> DF[!apply(remove.vars.subset, MAR=1, function(x) any(x=="+")), ]
[1] v1 v2 v3 v4 v5
<0 rows> (or 0-length row.names)

which returns no rows because all the rows (given the choice of v3 and v4) have at least one "+" in them. But suppose we chose v4 and v5:

> DF[!apply(remove.vars.subset, MAR=1, function(x) any(x=="+")), ]
  v1 v2 v3 v4 v5
1  1  2  +   
Comments