Martin Petri Bagger Martin Petri Bagger - 3 months ago 23
R Question

Levenshtein distance with weight/penalty for adjacency

I am using the string-edit distance (Levenshtein-distance) to compare scan paths from an eye tracking experiment. (Right now I am using the

stringdist
package in R)

Basically the letters of the strings refer to (gaze) position in a 6x4 matrix. The matrix is configured as follows:

[,1] [,2] [,3] [,4]
[1,] 'a' 'g' 'm' 's'
[2,] 'b' 'h' 'n' 't'
[3,] 'c' 'i' 'o' 'u'
[4,] 'd' 'j' 'p' 'v'
[5,] 'e' 'k' 'q' 'w'
[6,] 'f' 'l' 'r' 'x'


If I use the basic Levenshtein distance to compare strings, the comparison of
a
and
g
in a string gives the same estimate as comparicon of
a
and
x
.

E.g.:

'abc' compared to 'agc' -> 1
'abc' compared to 'axc' -> 1


This means that the strings are equally (dis)similar

I would like to be able to put weights on the string comparison in a way that incorporates adjacency in the matrix. E.g. the distance between
a
and
x
should be weighted as larger then that between
a
and
g
.

One way could be to calculate the "walk" (horizontal and vertial steps) from one letter to the other in the matrix and divide by the max "walk"-distance (i.e. from
a
to
x
). E.g. the "walk"-distance from
a
to
g
would be 1 and from
a
to
x
it would be 8 resulting in a weight of 1/8 and 1 respectively.

Is there a way to implement this (in either R or python)?

Answer

You need a version of the Wagner-Fisher algorithm that uses non-unit cost in its inner loop. I.e. where the usual algorithm has +1, use +del_cost(a[i]), etc. and define del_cost, ins_cost and sub_cost as functions taking one or two symbols (probably just table lookups).