view raw
petrbel petrbel - 1 year ago 104
Python Question

Token-based edit distance in Python?

I'm familiar with python's

module, which is commonly used to compute edit distance of two string.

I am interested in a function which computes such distance but not char-wise as normally but token-wise. By that I mean that you can replace/add/delete whole tokens only (instead of chars).

Example of regular edit distance and my desired tokenized version:

> char_dist("aa bbbb cc",
"aa b cc")
3 # add 'b' character three-times

> token_dist("aa bbbb cc",
"aa b cc")
1 # replace 'bbbb' token with 'b' token

Is there already some function, that can compute
in python? I'd rather use something already implemented and tested than writing my own piece of code. Thanks for tips.


First, install the following:

pip install editdistance

Then the following will give you the token-wise edit distance:

import editdistance
editdistance.eval(list1, list2)


import editdistance
tokens1 = ['aa', 'bb', 'cc']
tokens2 = ['a' , 'bb', 'cc']
editdistance.eval(tokens1, tokens2)
out[4]: 1

For more information, please refere to: