user6678274 user6678274 - 3 months ago 18
R Question

The best way to compare strings

In R I have

data
like this

ID
Peter
peter
peterr
john
johN
JOhn
...


I simply want to collect all the person, for example all that have the name like Peter should be collected, so my new data-set would be like this

ID
Peter, peter, peterr
john, johN, JOhn
...


So I want to write a code that take
peter, Peter, peterr
and collect them and I want to do it for all the names.

What is the best way to do this?

Answer

The function adist() calculates the Levenshtein distance between strings.

df1 <- data.frame(ID=c("Peter", "peter", "peterr",   "john",   "johN",   "JOhn"))
adist(df1$ID)
     [,1] [,2] [,3] [,4] [,5] [,6]
[1,]    0    1    2    5    5    5
[2,]    1    0    1    5    5    5
[3,]    2    1    0    6    6    6
[4,]    5    5    6    0    1    2
[5,]    5    5    6    1    0    3
[6,]    5    5    6    2    3    0

Smaller distance values indicate greater similarity. The index (row) number of the six words "Peter", "peter" etc. within the vector df1$ID corresponds to the column / row number in the matrix.

The programming task then consists in identifying the pairs which have a small distance. Here is one possibility:

dm <- adist(df1$ID)
dm <- dm*upper.tri(dm)
which(dm > 0 & dm < 2, arr.ind=TRUE)
#     row col
#[1,]   1   2
#[2,]   2   3
#[3,]   4   5

These three pairs (1,2), (2,3) and (4,5) indicate the index number of the strings that can be considered to be very similar. Those are: "Peter" and "peter", "peter" and "peterr", as well as "john" and "joHn".

The similarity threshold can be lowered by using, e.g., which(dm > 0 & dm < 3, arr.ind=TRUE). This results in a larger number of similar pairs.

Comments