Dennix - 1 year ago 129
R Question

# Calculate Jaccard similarity between each words in 2 vectors

I need calculate Jaccard similarity between each words in 2 vectors. Each word by each word. And extract most similar word.

Here is my bad bad slow code:

``````txt1 <- c('The quick brown fox jumps over the lazy dog')
txt2 <- c('Te quick foks jump ovar lazzy dogg')

words <- strsplit(as.character(txt1), " ")
words.p <- strsplit(as.character(txt2), " ")

r <- length(words[[1]])
c <- length(words.p[[1]])

m <- matrix(nrow=r, ncol=c)
for (i in 1:r){
for (j in 1:c){
m[i,j] = stringdist(tolower(words.p[[1]][j]), tolower(words[[1]][i]), method='jaccard', q=2)
}
}

ind <- which(m == min(m))-nrow(m)
words[[1]][ind]
``````

Please help me to improve and beautify this code for large data frame.

Preparation (added `tolower` here):

``````txt1 <- c('The quick brown fox jumps over the lazy dog')
txt2 <- c('Te quick foks jump ovar lazzy dogg')

words <- unlist(strsplit(tolower(as.character(txt1)), " "))
words.p <- unlist(strsplit(tolower(as.character(txt2)), " "))
``````

Get distances for each word:

``````dists <- sapply(words, Map, f=stringdist, list(words.p), method="jaccard")
``````

For each word in `words` find the closest word from `words.p`:

``````matches <- words.p[sapply(dists, which.min)]

cbind(words, matches)
matches
[1,] "the"   "te"
[2,] "quick" "quick"
[3,] "brown" "ovar"
[4,] "fox"   "foks"
[5,] "jumps" "jump"
[6,] "over"  "ovar"
[7,] "the"   "te"
[8,] "lazy"  "lazzy"
[9,] "dog"   "dogg"
``````
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download