Connor - 1 year ago 78
R Question

Partial String Matching by Row

I'm trying to create a unique column in a data frame that has a numeric of the character matches between two strings from the left side of both strings.

Each row represents has a comparison string, which we want to use as a test against a user given string. Given a dataframe:

``````df <- data.frame(x=c("yhf", "rnmqjk", "wok"), y=c("yh", "rnmj", "ok"))

x    y
1    yhf   yh
2 rnmqjk rnmj
3    wok   ok
``````

Where x is our comparison string and y is our given string, I'm looking to have the values of "2, 3, 0" output in column z., like so:

``````       x    y    z
1    yhf   yh    2
2 rnmqjk rnmj    3
3    wok   ok    0
``````

Essentially, I'm looking to have the given strings (y) checked from left -> right against a comparison string (x), and when the characters don't line up to not check the rest of the string and record the match numbers.

This code works for your example:

``````df\$z <- mapply(function(x, y) which.max(x != y),
strsplit(as.character(df\$x), split=""),
strsplit(as.character(df\$y), split="")) - 1

df
x    y z
1    yhf   yh 2
2 rnmqjk rnmj 3
3    wok   ok 0
``````

As an outline, `strsplit` splits a string vector into a list of character vectors. Here, each element of a vector is a single character (with the split="" argument). The `which.max` function returns the first position where it's argument is the maximum of the vector. Since The vectors returned by `x != y` are logical, `which.max` returns the first position where a difference is observed. `mapply` takes a function and lists and applies the provided function to corresponding elements of the lists.

Note that this produces warnings that the lengths of the strings don't match. This could be addressed in a couple of ways, the easiest is wrapping the function in `suppressWarnings` if the messages bug you.

As the OP notes int the comments if there are instances where the entire word matches, then `which.max` returns 1. To return the same length as the string, I'd add a second line of code that combines logical subsetting with the `nchar` function:

``````df\$z[as.character(df\$x) == as.character(df\$y)] <-
nchar(as.character(df\$x[as.character(df\$x) == as.character(df\$y)]))
``````
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download