Martin Petri Bagger - 4 years ago 183
R Question

# Levenshtein distance with weight/penalty for adjacency

I am using the string-edit distance (Levenshtein-distance) to compare scan paths from an eye tracking experiment. (Right now I am using the

`stringdist`
package in R)

Basically the letters of the strings refer to (gaze) position in a 6x4 matrix. The matrix is configured as follows:

``````     [,1] [,2] [,3] [,4]
[1,]  'a'  'g'  'm'  's'
[2,]  'b'  'h'  'n'  't'
[3,]  'c'  'i'  'o'  'u'
[4,]  'd'  'j'  'p'  'v'
[5,]  'e'  'k'  'q'  'w'
[6,]  'f'  'l'  'r'  'x'
``````

If I use the basic Levenshtein distance to compare strings, the comparison of
`a`
and
`g`
in a string gives the same estimate as comparicon of
`a`
and
`x`
.

E.g.:

``````'abc' compared to 'agc' -> 1
'abc' compared to 'axc' -> 1
``````

This means that the strings are equally (dis)similar

I would like to be able to put weights on the string comparison in a way that incorporates adjacency in the matrix. E.g. the distance between
`a`
and
`x`
should be weighted as larger then that between
`a`
and
`g`
.

One way could be to calculate the "walk" (horizontal and vertial steps) from one letter to the other in the matrix and divide by the max "walk"-distance (i.e. from
`a`
to
`x`
). E.g. the "walk"-distance from
`a`
to
`g`
would be 1 and from
`a`
to
`x`
it would be 8 resulting in a weight of 1/8 and 1 respectively.

Is there a way to implement this (in either R or python)?

You need a version of the Wagner-Fisher algorithm that uses non-unit cost in its inner loop. I.e. where the usual algorithm has `+1`, use `+del_cost(a[i])`, etc. and define `del_cost`, `ins_cost` and `sub_cost` as functions taking one or two symbols (probably just table lookups).