SYNOPSIS

DESCRIPTION

LEVENSHTEIN DISTANCES

DAMERAU DISTANCE

NEEDLEMAN-WUNSCH DISTANCE

HAMMING DISTANCES

BLOOM FILTER DISTANCES

JACCARD DISTANCE

MINKOWSKI DISTANCE

RETURN VALUES

BUGS

AUTHOR

SEE ALSO

DISTANCE ( 3 ) OpenBSD Programmer's Manual DISTANCE ( 3 )

` distance` - compute the distance between two pieces of data

(levenshtein_d *const void , d1 , size_t len1 *const void ,d2);size_t len2
(damerau_d *const void , d1 , size_t len1 *const void , d2);size_t len2
(needleman_wunsch_d *const void , d1 , size_t len1 *const void ,d2 , size_t len2 *matrix);m
(hamming_d *const void , d1 , size_t len1 *const void , d2);size_t len2
(bloom_create *const void , data , size_t len *const void ,digest);size_t digest_len
(bloom_d *const void , digest1 *const void , digest2);size_t digest_len
(jaccard_d *const void , d1 , size_t len1 *const void , d2);size_t len2
(minkowski_d *const void , d1 , size_t len1 *const void , d2 ,size_t len2);int power
*const void , d1 , size_t len1 *const void , d2);size_t len2
*const void , d1 , size_t len1 *const void , d2);size_t len2 |

The Inputs are operated on pairwise as * and the edit distance value is returned.t |

Levenshtein distance (LD) is a measure of the similarity between two inputs, which we will refer to as the

source ` s` and the target input

quired to transform

If s is "test" and t is "test", then LD(s,t) = 0, because no transformations are needed. The strings are al-

ready identical.

If s is "test" and t is "tent", then LD(s,t) = 1, because one substitution (change "s" to "n") is sufficient to

transform s into t.

The greater the Levenshtein distance, the more different the inputs are.

Levenshtein distance is named after the Russian scientist Vladimir Levenshtein, who devised the algo-

rithm in 1965. If you can't spell or pronounce Levenshtein, the metric is also sometimes called edit dis-

tance.

OpenBSD February 14, 2004 1

DISTANCE ( 3 ) OpenBSD Programmer's Manual DISTANCE ( 3 )

The Damerau distance is almost identical to the Levenshtein distance but can tolerate adjascent charac- ters that have been swapped, a common spelling mistake or typographic error. For example, if s is "abcd" and t is "acbd", then DD(s,t) is 0 because of the transposition of "b" and "c". Other costs found in the Lev- enshtein distance are identical. |

The Levenshtein distance algorithm assumes that the cost of all insertions or conversions is equal. Howev- er, in some scenarios this may not be desirable and may mask the acceptable distances between inputs. A modified Levenshtein distance algorithm, also known as the Needleman-Wunsch distance algorithm, accepts a cost matrix as an extra input. This matrix structure contains two cost matricies for each pair of characters to convert from and to. The costs of inserting this character and converting between charac- ters is listed in this matrix. The structure of the matrix is as follows:
These values should be assigned before the distance algorithm is used. |

The Hamming distance H is defined only for inputs of the same length. For two inputs ` s` and

the number of places in which the two string differ, i.e., have different characters.

A Bloom Filter (BF) is a method of encoding a variable length piece of input data to a fixed length signa- ture. The basic idea is to map attributes of the data to specific positions in a binary vector. First, the result vector is initialized to all 0's. Next, a series of hash functions are applied to the data to map some part of the input data to a position in the result vector. Thus, if a is the input data, and m is length of the result vec- tor, then hash(a) produces a value in the range 0...m-1. A similar method encoding data into a fixed length result vector can be accomplished using a crytographic digest. A digest function like MD5 is designed such that the difference between any two digests does not in any way correlate to similar differences in input messages. This is an important property in certain ap- plications however it may also be desirable that the difference between digests corresponds to differences in the input messages. Because a Bloom filter is derived from attributes of the input message, the differ- ence between any two Bloom Filter digests corresponds to differences in the input data. This Bloom filter implementation is meant take any byte sequence as input so a generic hash function is used. The input stream is divided into chunks and hashed using a 32 bit adler modulo m. In order to get the most complete coverage, the chunk N+1 includes all but one of the bytes in chunk N and one byte not included in chunk N. (i.e. the chunk window slides to right one byte per hash.) The difference between any two Bloom filter digests is computed by taking the logical and of the two di- gests and computing the number of bit of similarity. The similar count is then divided by maximum of the total number of bit in either digest. This acts to normalize the bit count for large and small signatures. The similarity is then turned into a distance by take 1 - normalized bit count. |

The Jaccard similarity between two strings is the ratio of the size of their intersection to the size of their

union. This distance is 1 minus this similarity sscore. Indentical strings have a Jacard distance of 0, com-

pletely dissimilar strings have a score of 1. The two inputs must be of the same size.

OpenBSD February 14, 2004 2

DISTANCE ( 3 ) OpenBSD Programmer's Manual DISTANCE ( 3 )

The Minkowski distance between two strings is the geometric distance between two inputs and uses a variable scaling factor, is 2 it yields the Euclidian distance between two strings. When this value grows, the value of the Minkowski distance tends towards a Chebyshev result. Therefore by inceasing power, one can place more numerical value on the largest distance (in terms of elements in the two vectors in question). Two macros are provided for these modifications to the Minkowski distance, power( ) and EUCLID_D().MANHATTAN_DA disadvantage of the Minkowski method is that if one element in the vectors has a wider range than the other elements then that large range may 'dilute' the distances of the small-range elements. |

Each fuction returns the calculated distance between the two inputs. A distance of 0 indicates that the

strings are the same. A distance less than 0 indicates that an error has occurred. For the Hamming and

Jaccard distances, this is because the two inputs are of different sizes.

The ` minkowski_d`( ) implementation is possibly broken, as are the

Lorenzo Seidenari wrote the Levenshtein distance implementation. Jose Nazario <jose@monkey.org> wrote the implementations of the Damerau, Hamming and Jaccard distance algorithms along with the Needleman-Wunsch Levenshtein distance implementation here. Evan Cooke adapted the adler32 routine from Jean-loup Gailly and Mark Adler used in the |

V. I. Levenshtein, "Binary codes capable of correcting deletions, insertions and reversals", *Doklady
Akademii Nauk SSSR*, 4, 163, 845-848, 1965.

Fred J. Damerau, "A technique for computer detection and correction of spelling A0", *Communications of*

S. Burton
R. P. |
S. B. Needleman, and C. D. Wunsch, "A general method applicable to the search for similarities A0 0A1 Burton Bloom, "Space/time trade-offs in hash coding with allowable errors", R. W. Hamming, "Error Detecting and Error Correcting Codes", P. Jaccard, "Etude comparative de la distribution florale dans une portion Etude0 0Etude1 1Etude2 |
3 |

generated using groff -S -Wall -Mtty-char -Thtml -mandoc distance.3