![]() |
|
|
| |
|
||||
In information theory, the Levenshtein distance or edit distance between two strings is given by the minimum number of operations needed to transform one string into the other, where an operation is an insertion, deletion, or substitution. It is named after the Russian scientist Vladimir Levenshtein, who considered this distance in 1965. It is useful in applications that need to determine how similar two strings are, such as spell checkers. For example, the Levenshtein distance between "kitten" and "sitting" is 3, since these three edits change one into the other, and there is no way to do it with less than three edits:
It can be considered a generalization of the Hamming distance, which is used for strings of the same length and only considers substitution edits. There are also further generalizations of the Levenshtein distance that consider, for example, exchanging two characters as an operation.
The algorithmA commonly-used algorithm for computing the Levenshtein distance involves the use of an (n + 1) × (m + 1) matrix, where n and m are the lengths of the two strings. Here is pseudocode for a function LevenshteinDistance that takes two strings, str1 of length lenStr1, and str2 of length lenStr2, and computes the Levenshtein distance between them: int LevenshteinDistance ( char str1[ 1..lenStr1 ], char str2[ 1..lenStr2 ] )
// d is a table with lenStr1+1 rows and lenStr2+1 columns
declare int d[ 0..lenStr1, 0..lenStr2 ]
// i1 and i2 are used to iterate over str1 and str2
declare int i1, i2, cost
for i1 from 0 to lenStr1
d[ i1, 0 ] := i1
for i2 from 0 to lenStr2
d[ 0, i2 ] := i2
for i1 from 1 to lenStr1
for i2 from 1 to lenStr2
if str1[ i1 ] = str2[ i2 ] then cost := 0
else cost := 1
d[ i1, i2 ] = minimum(
d[ i1 - 1, i2 ] + 1, // insertion
d[ i1 , i2 - 1 ] + 1, // deletion
d[ i1 - 1, i2 - 1 ] + cost // substitution
)
return d[ lenStr1, lenStr2 ]
Possible improvementsPossible improvements to this algorithm include:
Proof of correctnessThe invariant is that we can transform the initial segment
This proof fails to validate that the number placed in Upper and lower boundsThe Levenshtein distance has several simple upper and lower bounds that are useful in applications which compute many of them and compare them. These include:
Sample implementationsHaskell
min3 :: Int->Int->Int->Int
min3 x y z = min x (min y z)
cost :: Char->Char->Int
cost a b
| a == b = 0
| otherwise = 1
sumCosts :: String->Char->Int
sumCosts [] a = 0
sumCosts (s:ss) a = cost s a + sumCosts ss a
editDistance :: String->String->Int
editDistance [] [] = 0
editDistance s [] = sumCosts s '-'
editDistance [] t = sumCosts t '-'
editDistance (s:ss) (t:ts) = min3 (cost s t + editDistance ss ts)
(cost s '-' + editDistance ss (t:ts))
(cost '-' t + editDistance (s:ss) ts)
SchemeUses srfi-25 and srfi-42
(define add1 (lambda (x) (+ x 1)))
(define levenshtein-distance
(lambda (s1 s2)
(let* ((width (add1 (string-length s1)))
(height (add1 (string-length s2)))
(d (do ((d (make-array (shape 0 height 0 width) 0))
(x 0 (add1 x)))
((= x width)
(do ((y 0 (add1 y)))
((= y height))
(array-set! d y 0 y))
d)
(array-set! d 0 x x))))
(do-ec (: x (string-length s1)) (: y (string-length s2))
(begin
(array-set!
d (add1 y) (add1 x)
(min
(add1 (array-ref d y (add1 x)))
(add1 (array-ref d (add1 y) x))
(+ (array-ref d y x)
(if (eqv? (string-ref s1 x)
(string-ref s2 y))
0
1))))))
(array-ref d (sub1 height) (sub1 width)))))
External links
|
|
Copyright 2009 wordIQ.com - Privacy Policy
::
Terms of Use
:: Contact Us
:: About Us This article is licensed under the GNU Free Documentation License. It uses material from the Wikipedia article "Levenshtein distance". |