I'm trying to create a unique column in a data frame that has a numeric of the character matches between two strings from the left side of both strings.
Each row represents has a comparison string, which we want to use as a test against a user given string. Given a dataframe:
df <- data.frame(x=c("yhf", "rnmqjk", "wok"), y=c("yh", "rnmj", "ok"))
1 yhf yh
2 rnmqjk rnmj
3 wok ok
x y z
1 yhf yh 2
2 rnmqjk rnmj 3
3 wok ok 0
This code works for your example:
df$z <- mapply(function(x, y) which.max(x != y), strsplit(as.character(df$x), split=""), strsplit(as.character(df$y), split="")) - 1 df x y z 1 yhf yh 2 2 rnmqjk rnmj 3 3 wok ok 0
As an outline,
strsplit splits a string vector into a list of character vectors. Here, each element of a vector is a single character (with the split="" argument). The
which.max function returns the first position where it's argument is the maximum of the vector. Since The vectors returned by
x != y are logical,
which.max returns the first position where a difference is observed.
mapply takes a function and lists and applies the provided function to corresponding elements of the lists.
Note that this produces warnings that the lengths of the strings don't match. This could be addressed in a couple of ways, the easiest is wrapping the function in
suppressWarnings if the messages bug you.
As the OP notes int the comments if there are instances where the entire word matches, then
which.max returns 1. To return the same length as the string, I'd add a second line of code that combines logical subsetting with the
df$z[as.character(df$x) == as.character(df$y)] <- nchar(as.character(df$x[as.character(df$x) == as.character(df$y)]))