Tom Dale Tom Dale - 2 months ago 7
R Question

Subset R data frame based on string matches in two columns

I have a data frame with three columns and thousands of rows. The first two columns (x and y) contain character strings, and the third (z) contains numeric data. I need to subset the data frame based on matching values in both of the first two columns.

x <- c("a", "b", "c", "d", "f", "g", "h", "i", "j", "k")
y <- c("h", "b", "k", "a", "g", "d", "i", "c", "f", "j")
z <- c(1:10)
df <- data.frame(x, y, z)

x y z
1 a h 1
2 b b 2
3 c k 3
4 d a 4
5 f g 5
6 g d 6
7 h i 7
8 i c 8
9 j f 9
10 k j 10


Say this is my table, and the values I am interested in are "a", "c", "f", "h" and "k". I only want to return the rows in which both x and y contain one of the five, so in this case rows 1 and 3.

I've tried:

df2 <- filter(df,
x == ("a" | "c" | "f" | "h" | "k") &
y == ("a" | "c" | "f" | "h" | "k"))


but this doesn't work for factors or character strings. Is there an equivalent or another way around this?

Thanks in advance.

lmo lmo
Answer

I think this returns what you are looking for:

# build vector of necessary elements
mustHaves <- c("a", "c", "f", "h", "k")
# perform subsetting
df[with(df, x %in% mustHaves & y %in% mustHaves),]
  x y z
1 a h 1
3 c k 3

data

df <- data.frame(x, y, z, stringsAsFactors = FALSE)