user6678274 - 10 months ago 53
R Question

# The best way to compare strings

In R I have

`data`
like this

``````ID
Peter
peter
peterr
john
johN
JOhn
...
``````

I simply want to collect all the person, for example all that have the name like Peter should be collected, so my new data-set would be like this

``````ID
Peter, peter, peterr
john, johN, JOhn
...
``````

So I want to write a code that take
`peter, Peter, peterr`
and collect them and I want to do it for all the names.

What is the best way to do this?

Answer

The function `adist()` calculates the Levenshtein distance between strings.

``````df1 <- data.frame(ID=c("Peter", "peter", "peterr",   "john",   "johN",   "JOhn"))
adist(df1\$ID)
[,1] [,2] [,3] [,4] [,5] [,6]
[1,]    0    1    2    5    5    5
[2,]    1    0    1    5    5    5
[3,]    2    1    0    6    6    6
[4,]    5    5    6    0    1    2
[5,]    5    5    6    1    0    3
[6,]    5    5    6    2    3    0
``````

Smaller distance values indicate greater similarity. The index (row) number of the six words "Peter", "peter" etc. within the vector `df1\$ID` corresponds to the column / row number in the matrix.

The programming task then consists in identifying the pairs which have a small distance. Here is one possibility:

``````dm <- adist(df1\$ID)
dm <- dm*upper.tri(dm)
which(dm > 0 & dm < 2, arr.ind=TRUE)
#     row col
#[1,]   1   2
#[2,]   2   3
#[3,]   4   5
``````

These three pairs (1,2), (2,3) and (4,5) indicate the index number of the strings that can be considered to be very similar. Those are: "Peter" and "peter", "peter" and "peterr", as well as "john" and "joHn".

The similarity threshold can be lowered by using, e.g., `which(dm > 0 & dm < 3, arr.ind=TRUE)`. This results in a larger number of similar pairs.

Source (Stackoverflow)