Nikita's blog: java [Anonymoused]

Showing posts with label java. Show all posts

Thursday, March 24, 2011

Fuzzy string search

Intro

Fuzzy search algorithms (also known as similarity search algorithms) are a basis of spell-checkers and full-fledged search engines like Google or Yandex. For example, these algorithms are used to provide the "Did you mean ..." function.

In this article I'll discuss the following concepts, methods and algorithms:

Levenshtein distance
Damerau-Levenshtein distance
Bitap algorithm with modifications by Wu and Manber
Spell-checker method
N-gram method
Signature hashing method
BK-trees

I'll also perform a comparative test of the quality and efficiency of algorithms.

So...

Fuzzy search is a very useful feature of any search engine. However, its effective implementation is much more complicated than implementing a simple search for an exact match.

The problem of fuzzy string searching can be formulated as follows:
"Find in the text or dictionary of size n all the words that match the given word (or start with the given word), taking into account k possible differences (errors)."

For example, if you're requested for "machine" with two possible errors, find the words "marine", "lachine", "martine", and so on.

Fuzzy search algorithms are characterized by metric - a function of distance between two words, which provides a measure of their similarity. A strict mathematical definition of metric includes a requirement to meet triangle inequality (X - a set of words, p - metric):

Meanwhile, in most cases a metric is understood as a more general concept that does not meet the condition above, this concept can also be called distance.

Among the most well-known metrics are Hamming, Levenshtein and Damerau-Levenshtein distances. Note that the Hamming distance is a metric only on a set of words of equal length, and that greatly limits the scope of its application.

However, in practice, the Hamming distance is useless, yielding more natural from the human point of view metrics, which will be discussed below.

Levenshtein distance

The Levenshtein distance, also known as "edit distance", is the most commonly used metric, the algorithms of its computation can be found at every turn.
Nevertheless, it is necessary to make some comments about the most popular algorithm of calculation - Wagner-Fischer method.
The original version of this algorithm has time complexity of O(mn) and consume O(mn) memory, where m and n are the lengths of the compared strings. The whole process can be represented by the following matrix:

Levenshtein distance matrix

If you look at the algorithm's work process, it is easy to see that at each step only the last two rows of the matrix are used, hence, memory consumption can be reduced to O(min(m, n)).

Levenshtein algorithm's work process

But that's not all - the algorithm can be optimized further, if no more than k differences should be found. In this case it is necessary to calculate only the diagonal band of width 2k+1 in matrix (Ukkonen cut-off), which reduces the time complexity to O (k min (m, n)).

Prefix distance

Usually it is necessary to calculate the distance between the prefix pattern and a string - ie, to find the distance between the specified prefix and nearest string prefix. In this case, you must take the smallest of the distances from the prefix pattern to all the prefixes of the string. Obviously, the prefix length can not be considered as a metric in the strict mathematical sense, what limits its application.

Often, the specific value of a distance is not as important as fact that it exceeds a certain value.

Damerau-Levenshtein distance

This variation contributes to the definition of the Levenshtein distance one more rule - transposition of two adjacent letters are also counted as one operation, along with insertions, deletions, and substitutions.
A couple of years ago, Frederick Damerau would ensure that most typing errors - transpositions. Therefore, this metric gives the best results in practice.

Damerau-Levenshtein algorithm's work process

Damerau-Levenshtein algorithm's work process

To calculate this distance, it suffices to slightly modify the regular Levenshtein algorithm as follows: hold not two, but the last three rows, and add an appropriate additional condition - in the case of transposition take into account its cost.

In addition to the above, there are many others sometimes used in practice distances, such as Jaro–Winkler metric, many of which are available in SimMetrics and SecondString libraries.

Fuzzy search algorithms without indexing (Online)

These algorithms are designed to search against previously unknown text, and can be used, for example, in a text editor, document viewers or web browsers to search the page. They do not require text pre-processing and can operate with a continuous stream of data.

Linear search

A simple one-by-one metric computation (eg, Levenshtein metric) for words of the input text. When you use metric limitation, this method allows to achieve optimum speed.
But at the same time, than more k, than more time grows. Asymptotic time complexity - O (kn).

Bitap (also known as Shift-Or or Baeza-Yates-Gonnet, and it's modifications by Wu and Manber)

Bitap algorithm and its various modifications are most often used for fuzzy search without indexing. Its variation is used, for example, in unix-utility agrep, which one functions like the standard grep, but it supports errors in the search query, and even provides a limited ability to use regular expressions.

For the first time the idea of this algorithm is proposed by Ricardo Baeza-Yates and Gaston Gonnet, the relevant article published in 1992.
The original version of the algorithm deals only with letter substitutions, and, in fact, computes the Hamming distance. But later Sun Wu and Udi Manber suggested a modification of this algorithm for computing the Levenshtein distance, ie brought support insertions and deletions, and developed the first version of agrep utility.

Bitshift operation

Insertions

Deletions

Substitutions

Result value

Where k - error count, j - letter index, s_x - letter mask (in mask one bits are placed at positions corresponding to the positions of this letter in the query).
Query match or mismatch is determined by the last bit of the result vector R.

High speed of this algorithm is ensured by the bit parallelism - calculations can be performed on 32 or more bits in a single operation.
In this case, a trivial implementation supports the search for the words shorten than 32 symbols. This restriction is caused by the width of a standard type int (32-bit architectures). We can use wider types, but it's usage may slow down the algorithm's work.

Despite the fact that the asymptotic time of this algorithm O (kn) equals linear method's time, it is much faster when the query is long and number of errors k more than 2.

Testing

Testing was performed on the text of 3.2 million words, average word length - 10.

Exact search

Search time: 3562 ms

Linear search using Levenshtein metric

Search time with k = 2: 5728 ms
Search time with k = 5: 8385 ms

Bitap with Wu-Manber modifications search

Search time with k = 2: 5499 ms
Search time with k = 5: 5928 ms

It is obvious that a simple iteration using the metric, in contrast to the Bitap algorithm, highly depends on the number of errors k.

At the same time, if we should search in constant large text, the search time can be greatly reduced by making text preprocessing (indexing).

Fuzzy search algorithms with indexing (Offline)

Feature of all fuzzy search algorithms with indexing is that the index is based on the dictionary compiled by the original text or list of records in a database.

These algorithms use different approaches to solve problem - some of them use reduction to exact search, while others use properties of metrics to construct various spatial structures and so on.

At the first step, we construct a dictionary using the original text, which would contain words and their position in text. Also, it is possible to calculate the frequency of words and phrases to improve search results.

It is assumed that the index, as well as dictionary, entirely loaded into memory.

Dictionary specifications:

Source text — 8.2 Gb Moshkow's library (lib.ru), 680 millions of words;
Dictionary size — 65 Mb;
Word count - 3.2 million;
Average word length — 9.5 letters;
Average word square length — 10.0 letters;
Russian alphabet (32 letters). Words that contain characters not in the alphabet, are not included in the dictionary.

Dependence of the dictionary size from the text size is not strictly linear - to a certain volume base word set is formed, between 15% on 500 thousand words and 5% on 5 million words, and then approaches a linear dependence, slowly decreasing and reaching 0.5% at 680 million words. Subsequent growth is ensured due to rare words.

Growth of dictionary size

Spell-checker method

As the name implies, this algorithm is often used in spelling correction systems (ie, spell-checker systems), when the size of the dictionary is small, or else when speed is not the main criterion.
It is based on reducing the problem of fuzzy search to the problem of exact search.

A set of "mistaken" words is built from original query. Then every word from this set is searched in the dictionary for exact match.

Spell-checker method

Its time complexity is strongly dependent on the number of errors k and the alphabet size |A|, and for a binary search over the dictionary is:

For example, in case of error count k = 1 and word length of 7 in the English alphabet set of misspelled words will have about 370 words, so we need to make 370 queries in the dictionary, which is quite acceptable.
But even at k = 2 the size of this set will be more than 75,000 words, which corresponds to a complete iteration over a small dictionary and, therefore, time is sufficiently large. In this case, we should not forget also that for each of such words are necessary to search for an exact match in the dictionary.

Specialties:

The algorithm can be easily modified to generate "mistaken" words using arbitrary rules and, moreover, does not require any dictionary preprocessing or additional memory.

Possible improvements:

We can generate not a whole set of "mistaken" words, but only those that are most likely to occur in real situations, like words with common spelling mistakes or typos.

N-gram method

This method was invented long ago, and is the most widely used because its implementation is relatively simple and it provides a reasonably good performance. This algorithm is based on the principle:

"If the word A matches with the word B considering several errors, then they will most likely have at least one common substring of length N".

These substrings of length N are named "N-grams".
At indexing step, the word is partitioned into N-grams, and then the word is added to lists that correspond each of these N-grams. At search step, the query is also partitioned into N-grams, and for each of them corresponding lists are scanned using the metric.

N-gram method

The most frequently used in practice are trigrams - substrings of length 3. Choosing a larger value of N leads to a limitation on the minimum length of words at which error detection is still possible.

Specialties:

N-gram algorithm doesn't find all possible spelling errors. Take, for instance, the word VOTKA (which must be corrected to VODKA, of course), and divide it into trigrams: VOTKA > VOT OTK TKA - we can see that all of these trigrams contain an error T. Thus, the word "VODKA" will not be found because it does not contain any of the trigrams, and will not get into their lists. The shorter the word and more errors in it, the higher chance that it won't contains in corresponding to query N-grams lists and will not appear in the result set.

Meanwhile, N-gram method leaves ample scope for using custom metrics with arbitrary properties and complexity, but there remains a need for brute force of about 15% of the dictionary.

We can separate N-gram hash tables by position of N-gram in the word (first modification M1). As the length of a word and the query can't differ by more than k, and the position of N-grams in the word can't differ by more than k. Thus, we should check only table that corresponds to the N-gram position in the word, and k tables to the left and to the right, total 2k+1 neighboring tables.

First modification of N-gram method

We can even slightly reduce the size of iterating set by separating tables by word length, and, similarly, scanning only the neighboring 2k+1 tables (second modification M2).

Signature hashing

This algorithm is described in L. M. Boytsov's article "Signature hashing". It is based on a fairly obvious representation of the "structure" of the word as a bit word, used as a hash (signature) in the hash table.

When indexing, such hashes are calculated for each word and this word is added in the corresponding table row. Then, during the search the hash is computed for a query and set of adjacent hashes that differ from the query's hash with no more than k bits is generated. For each of these hashes we iterate over corresponding list of words using the metric.

The hash computing process - for each bit of the hash a group of characters from the alphabet is matched. Bit 1 at position i in the hash means that there is a symbol of i-th group of the alphabet in the word. Order of the letters in the word is absolutely does not matter.

Signature hashing

Single character deletion either does not change the hash value (if the word still have characters from the same group of the alphabet), or bit of corresponding group will be changed to 0. When you insert a similar manner or a bit of get up at 1 or no changes will be. Single character substitution a bit more complicated - the hash can either remain unchanged or will change in 1 or 2 bits. In case of transpositions there are no changes in the hash because the order of symbols at the hash construction does not matter as noted earlier. Thus, to fully cover the k errors we need to change at least 2k bits in the hash.

List of hashes with one error

Average time complexity with k "incomplete" (insertions, deletions and transpositions, as well as a part of substitutions) errors:
Signature hashing time complexity

Specialties:

The fact that the replacement of one character can change two bits at a time, the algorithm that works with, for example, changing of 2 bits in hash, in reality won't return the full set of results because of lack of significant (depending on the ratio of the hash size to the alphabet size) amount of the words with two substitutions (and the wider the hash, the more frequently two bits will be changed at the same and the more incomplete set will be returned). In addition, this algorithm does not allow for prefix search.

BK-trees

Burkhard-Keller trees are metric trees, algorithms for constructing such trees based on the ability of the metrics to meet the triangle inequality:

This property allows metrics to form the metric spaces of arbitrary dimension. These metric spaces are not necessarily Euclidean, for example, the Levenshtein and Damerau-Levenshtein metrics form a non-Euclidean space. Based on these properties, we can construct a data structure for searching in a metric space, which is Barkhard-Keller tree.

BK-tree

Improvements:

We can use the ability of some metrics to calculate the limited distance, setting an upper limit to the sum of the maximum distance to the node descendants and the resulting distance, which will speed up the process a bit:

BK-trees metric limitations

Testing

Testing was performed on a laptop with Intel Core Duo T2500 (2GHz/667MHz FSB/2MB), 2Gb RAM, OS — Ubuntu 10.10 Desktop i686, JRE — OpenJDK 6 Update 20.

Time comparison

Testing was performed using Damerau-Levenshtein distance with error count k = 2. Index size is specified including dictionary size (65 Mb).

Spell-checker method

Index size: 65 Mb
Search time: 320 ms / 330 ms
Result recall: 100%

N-gram (original)

Index size: 170 Mb
Index creation time: 32 s
Search time: 71 ms / 110 ms
Result recall: 65%

N-gram (first modification)

Index size: 170 Mb
Index creation time: 32 s
Search time: 39 ms / 46 ms
Result recall: 63%

N-gram (second modification)

Index size: 170 Mb
Index creation time: 32 s
Search time: 37 ms / 45 ms
Result recall: 62%

Signature hashing

Index size: 85 Mb
Index creation time: 0.6 s
Search time: 55 ms
Result recall: 56.5%

BK-trees

Index size: 150 Mb
Index creation time: 120 s
Search time: 540 ms
Result recall: 63%

Conclusion

Most of fuzzy search algorithms with indexing are not truly sublinear (i.e., having an asymptotic time of O (log n) or below), and their performance usually depends on N. Nevertheless, multiple enhancements and improvements make it possible to achieve sufficiently small operational time, even with very large dictionaries.

There is also another set of various and inefficient methods, based on adaptations of techniques and methods from other subject areas to the current. Among these methods is the prefix tree (Trie) adaptation to fuzzy search problems, which I left neglected due to its low efficiency. But there are also algorithms based on the original approaches, for example, the Maass-Novak algorithm, it has sublinear asymptotic time, but it is highly inefficient because of the huge constants hidden behind asymptotic time estimation, which leads to a huge index.

The use of fuzzy search algorithms in real search engines is closely related to the phonetic algorithms, lexical stemming algorithms, which extract base part from different forms of the same word (for example, that functionality provided by Snowball), statistic-based ranking or the use of some complex sophisticated metrics.

This link http://code.google.com/p/fuzzy-search-tools takes you to my Java implementation of the following stuff:

Levenshtein Distance (with cutoff and prefix version);
Damerau-Levenshtein Distance (with cutoff and prefix version);
Bitap (Shift-Or with Wu-Manber modifications);
Spell-checker Method;
N-Gram Method (with some modifications);
Signature Hashing Method;
Burkhard-Keller (BK) Trees.

I tried to make this code easy to understand, but effective enough for practical application. So enjoy.

It is worth noting that in the process of researching this subject I've made some own work, which allows to reduce the search time significantly due to a moderate increase in the index size and some restrictions on freedom of used metrics. But that's another cool story.

Wednesday, March 23, 2011

Phonetic algorithms

Intro

A phonetic algorithm matches two different words with similar pronunciation to the same code, which allows phonetic similarity based word set comparison and indexing.

Often it is quite difficult to find atypical name (or surname) in the database, for example:

— Hey, John, look for Adolf Schwarzenegger.
— Adolf Shwardseneger? There is no such person!

In this case, the use of phonetic algorithms (especially in combination with fuzzy matching algorithms) can significantly simplify the problem.

These algorithms are very useful for searching in lists of people in databases, as well as for using in a spell checker. They are often used in combination with the algorithms of fuzzy search, providing users with a handy search by name and surname in databases, lists of people and so on.

In this article I will discuss the most well-known algorithms, such as Soundex, Daitch-Mokotoff Soundex, NYSIIS, Metaphone, Double Metaphone, Caverphone.

Soundex

One of the first algorithms was Soundex invented in the 1910s by Robert Russell. This algorithm (its American version) matches words to the numerical index like A126. Its working principle is based on the partition of consonants in groups with ordinal numbers, which are then compiled to the resulting value. Later several improvements was suggested.

The first letter is stored, subsequent letters are matched to digits by the table above. Letters that are not listed in the table (all the vowels and some consonants) are ignored. Adjacent letters, or letters from the same group separated by H or W are written as one letter. The result is truncated to 4 characters. Missing positions contain zeros. It is easy to see that after all these procedures there is only 7000 different values of that code, which results in a set of unliked words with the same Soundex code. Thus, in most cases the result involves a large number of "false positive" values.

As you can see in the refined soundex the letters are divided into more groups. In addition, there are no special cases with H and W, they are simply ignored. Also length of the result is not truncated, so the code does not have a fixed length.

Examples

Original Soundex:
F234 → Fusedale.
G535 → Genthner, Gentner, Gianettini, Gunton.
G640 → Garlee, Garley, Garwell, Garwill, Gerrell, Gerrill, Giral, Gorelli, Gorioli, Gourlay, Gourley, Gourlie, Graal, Grahl, Grayley, Grealey, Greally, Grealy, Grioli, Groll, Grolle, Guerola, Gurley.
H326 → Hadcroft, Hadgraft, Hatchard, Hatcher, Hatzar, Hedger, Hitscher, Hodcroft, Hutchcraft.
P630 → Parade, Pardew, Pardey, Pardi, Pardie, Pardoe, Pardue, Pardy, Parradye, Parratt, Parrett, Parrot, Parrott, Pearde, Peart, Peaurt, Peert, Perdue, Peret, Perett, Perot, Perott, Perotti, Perrat, Perrett, Perritt, Perrot, Perrott, Pert, Perutto, Pirdue, Pirdy, Pirot, Pirouet, Pirt, Porrett, Porritt, Port, Porte, Portt, Prate, Prati, Pratt, Pratte, Pratty, Preddy, Preedy, Preto, Pretti, Pretty, Prewett, Priddey, Priddie, Priddy, Pride, Pridie, Pritty, Prott, Proud, Prout, Pryde, Prydie, Purdey, Purdie, Purdy.

Refined Soundex:
B1905 → Braz, Broz
C30908 → Caren, Caron, Carren, Charon, Corain, Coram, Corran, Corrin, Corwin, Curran, Curreen, Currin, Currom, Currum, Curwen
H093 → Hairs, Hark, Hars, Hayers, Heers, Hiers
L7081096 → Lambard, Lambart, Lambert, Lambird, Lampaert, Lampard, Lampart, Lamperd, Lampert, Lamport, Limbert, Lombard
N807608 → Nolton, Noulton

One Soundex code value is matched for 21 surnames. In the case of an refined version of Soundex, 2-3 names match to the same code.

NYSIIS

Developed in 1970 as part of the "New York State Identification and Intelligence System", this algorithm gives better results relatively to the original Soundex using more sophisticated rules for transforming the original word to the result code. This algorithm is designed to work specifically with American names.

NYSIIS computation algorithm

Transform the beginning of the word using the following rules:
MAC → MCC
KN → N
K → C
PH, PF → FF
SCH → SSS
Transform the ending of the word using the following rules:
EE → Y
IE → Y
DT, RT, RD, NT, ND → D
Transform all letters except first using the rules below:
EV → AF
A, E, I, O, U → A
Q → G
Z → S
M → N
KN → N
K → C
SCH → SSS
PH → FF
After a vowel: remove H and transform W → A
Remove S at the end
Transform AY at the end → Y
Remove A at the end
Truncate word to 6 letters (optional).

Examples

DAGAL → Diggell, Dougal, Doughill, Dougill, Dowgill, Dugall, Dugall.
GLAND → Glinde.
PLANRAG → Plumridge.
SANAC → Chinnick, Chinnock, Chinnock, Chomicki, Chomicz, Schimek, Shimuk, Simak, Simek, Simic, Sinnock, Sinnocke, Sunnex, Sunnucks, Sunock.
WABARLY → Webberley, Wibberley.

NYSIIS matches to the same code a bit more than two surnames.

Daitch-Mokotoff Soundex

This algorithm was developed by two genealogist, Gary Mokotoff and Randy Daitch in 1985. They tried to achieve the best results with Eastern European (including Russian, Jewish) surnames.
This algorithm has little in common with the original Soundex, except that the result is still a sequence of digits. But now the first letter is also encoded as a digit.

It has a much more complicated conversion rules. Calculation of resulting code is also involves not only single characters, but the groups of characters. In addition, the result of the form 023689 provides about 600 000 different values of that code. This feature coupled together with complex rules reduces the number of "false positive" words in the result set.

Transformations are performed using the table below. The order of conversions is the order of letter combinations in the table.

Letter combination	At the start	After a vowel	Other
AI, AJ, AY, EI, EY, EJ, OI, OJ, OY, UI, UJ, UY	0	1
AU	0	7
IA, IE, IO, IU	1
EU	1	1
A, UE, E, I, O, U, Y	0
J	1	1	1
SCHTSCH, SCHTSH, SCHTCH, SHTCH, SHCH, SHTSH, STCH, STSCH, STRZ, STRS, STSH, SZCZ, SZCS	2	4	4
SHT, SCHT, SCHD, ST, SZT, SHD, SZD, SD	2	43	43
CSZ, CZS, CS, CZ, DRZ, DRS, DSH, DS, DZH, DZS, DZ, TRZ, TRS, TRCH, TSH, TTSZ, TTZ, TZS, TSZ, SZ, TTCH, TCH, TTSCH, ZSCH, ZHSH, SCH, SH, TTS, TC, TS, TZ, ZH, ZS	4	4	4
SC	2	4	4
DT, D, TH, T	3	3	3
CHS, KS, X	5	54	54
S, Z	4	4	4
CH, CK, C, G, KH, K, Q	5	5	5
MN, NM		66	66
M, N	6	6	6
FB, B, PH, PF, F, P, V, W	7	7	7
H	5	5
L	8	8	8
R	9	9	9

"Alternative" variants of letter combinations (used for multiple alternative codes generation):
CH → TCH
CK → TSK
C → TZ
J → DZH

Example

147740 → Iozefovich, Jacobovitch, Jacobovitz, Jacobowicz, Jacobowits, Jacobowitz, Josefovic, Josefowicz, Josifovic, Josifovitz, Josipovic, Josipovitz, Josofovitz, Jozefowicz, Yezafovich.
147750 → Iozefovich, Josefovic, Josifovic, Josipovic, Yezafovich.
345750 → Dashkovich, Djakovic, Djekovic, Djokovic.
783940 → Baldrick, Fielders, Walters, Weldrick, Wolters.
783950 → Baldrick, Weldrake, Weldrick, Welldrake.
964660 → Runchman, Runcieman, Runciman.
965660 → Runchman, Runcieman, Runciman.
596490 → Grancher, Greenacre.
596590 → Gehringer, Grainger, Grancher, Granger, Grangier, Greenacre, Grunguer.

This algorithm transforms about 5 surnames to the same code.

Subsequently Alexander Beider and Stephen Morse developed Beider-Morse Name Matching Algorithm, aimed to reducing the number of "false positive" values in Daitch-Mokotoff Soundex results with Jewish (Ashkenazic) surnames.

Metaphone

Metaphone (1990) algorithm has somewhat better efficiency. It has different approach to the encoding process: it transforms the original word using English pronunciation rules, so the conversion rules are much more complicated. The algorithm loses much less information, since the letters are not divided into groups. The final code is a set of characters 0BFHJKLMNPRSTWXY, but in the beginning of a word can also appear vowels from the set AEIOU.

Metaphone code computation algorithm

Remove all repeating neighboring letters except letter C.
The beginning of the word should be transformed using the following rules:
KN → N
GN → N
PN → N
AE → E
WR → R
Remove B letter at the end, if it is after M letter.
Replace C using the rules below:
With Х: CIA → XIA, SCH → SKH, CH → XH
With S: CI → SI, CE → SE, CY → SY
With K: C → K
Replace D using the following rules:
With J: DGE → JGE, DGY → JGY, DGI → JGY
With T: D → T
Replace GH → H, except it is at the end or before a vowel.
Replace GN → N and GNED → NED, if they are at the end.
Replace G using the following rules
With J: GI → JI, GE → JE, GY → JY
With K: G → K
Remove all H after a vowel but not before a vowel.
Perform following transformations using the rules below:
CK → K
PH → F
Q → K
V → F
Z → S
Replace S with X:
SH → XH
SIO → XIO
SIA → XIA
Replace T using the following rules
With X: TIA → XIA, TIO → XIO
With 0: TH → 0
Remove: TCH → CH
Transform WH → W at the beginning. Remove W if there is no vowel after it.
If X is at the beginning, then replace X → S, else replace X → KS
Remove all Y which are not before a vowel.
Remove all vowels except vowel at the start of the word.

Examples

FXPL → Fishpool, Fishpoole.
JLTL → Gellately, Gelletly.
LWRS → Lowers, Lowerson.
MLBR → Mallabar, Melbert, Melbourn, Melbourne, Melburg, Melbury, Milberry, Milborn, Milbourn, Milbourne, Milburn, Milburne, Millberg, Mulberry, Mulbery, Mulbry.
SP → Saipy, Sapey, Sapp, Sappy, Sepey, Seppey, Sopp, Zoppie, Zoppo, Zupa, Zupo, Zuppa.

Same code is matched to 6 surnames in the average.

Double Metaphone

Double Metaphone (2000) differs a bit from other phonetic algorithms by generating from the original word two code values (both up to 4 characters) - one reflects the basic version of word pronunciation, another - an alternative version. It has large number of different rules that take into account origin of words, focusing on Eastern European, Italian, Chinese and another words. Transformation rules are sufficiently numerous, so everyone can read about this algorithm in the Dr Dobbs article.

Examples

APLF → Abelevitz, Abelov, Abelovitz, Abilowitz, Appleford.
APLT → Abelwhite, Abilowitz, Ablett, Ablewhite, Ablitt, Ablott, Appleton, Applewhaite, Applewhite, Epelett, Epilet, Eplate, Eplett, Euplate, Ipplett.
LPS → Labbez, Labes, Libbis, Llopis, Lopes, Lopez.
MKFX → Mackiewicz.
MKTS → Mackiewicz, Mccuthais, Mecozzi.
MTF → Mateev, Mathew, Mattevi, Mattheeuw, Matthew, Middiff.
M0 → Maith, Mathe, Mathew, Mathey, Mathie, Mathieu, Mathou, Mathy, Matthai, Mattheeuw, Matthew, Matthiae, Meth, Moth, Mouth.
SLFT → Salvador, Salvadore, Salvadori, Salvati, Salvatore, Slaughter.
XLFT → Slaughter.

Double Metaphone matches 8 or 9 surnames on the average to the same code.

Caverphone

Caverphone algorithm was developed in 2002 as part of one of New Zealand project to match the data in the old and new electoral lists, therefore it is most focused on a New Zealand pronunciation, but for foreign surnames ut gives quite acceptable results.

Caverphone code computation algorithm

Convert the first or last name to lowercase (the algorithm is case sensitive).
Remove letters e at the end.
Transform the beginning of the word using the table below (This is relevant only for New Zealand names). In this case digit 2 means temporary placeholder for a consonant letter, which will subsequently be removed.
cough rough tough enough gn mb

cou2f rou2f tou2f enou2f 2n m2
Replace the letters using the table below:
cq ci ce cy tch c q x v dg tio tia d ph b sh z

2q si se sy 2ch k k k f 2g sio sia t fh p s2 s
Replace all vowel at the word beginning with A, in other cases replace them with 3. So the digit 3 is a temporary placeholder for a vowel letter, that will be used in subsequent transformations and then will be removed. At the next step, it is necessary to replace using the following tables (the legend: s+ - group of consecutive letters, ^h - letter at the start, w$ - letter at the end):
j ^y3 ^y y 3gh3 gh g s+ t+ p+ k+ f+ m+

y Y3 A 3 3kh3 22 k S T P K F M

n+ w3 wh3 w$ w ^h h r3 r$ r l3 l$ l

N W3 Wh3 3 2 A 2 R3 3 2 L3 3 2
Remove all digits 2. If there is a digit 3 at the end 3, replace it with A. After that remove all digits 3.
Truncate the word to 10 letters or fill it to 10 letters with digit 1.

Examples

ANRKSN1111 → Henrichsen, Henricsson, Henriksson, Hinrichsen.
ASKKA11111 → Izchaki.
MKLFTA1111 → Maclaverty, Mccleverty, Mcclifferty, Mclafferty, Mclaverty.
SLKMP11111 → Slocomb, Slocombe, Slocumb.
WTLM111111 → Whitlam.

Caverphone matches about 4-5 surnames to the same code.

Conclusion

Most of these algorithms are implemented in a variety of languages, including C, C + +, Java, C # and PHP. Some of them, like Soundex and Metaphone, integrated or implemented as plug-ins for many popular databases, as well as used in the full-fledged search engines (for example, Apache Lucene). Their application is specific enough, because they are mostly suitable for surnames.

cough	rough	tough	enough	gn	mb
cou2f	rou2f	tou2f	enou2f	2n	m2

Nikita's blog

Thursday, March 24, 2011

Fuzzy string search

Intro

So...

Levenshtein distance

Prefix distance

Damerau-Levenshtein distance

Fuzzy search algorithms without indexing (Online)

Linear search

Bitap (also known as Shift-Or or Baeza-Yates-Gonnet, and it's modifications by Wu and Manber)

Testing

Exact search

Linear search using Levenshtein metric

Bitap with Wu-Manber modifications search

Fuzzy search algorithms with indexing (Offline)

Spell-checker method

Specialties:

Possible improvements:

N-gram method

Specialties:

Signature hashing

Specialties:

BK-trees

Improvements:

Testing

Spell-checker method

N-gram (original)

N-gram (first modification)

N-gram (second modification)

Signature hashing

BK-trees

Conclusion

Links:

Wednesday, March 23, 2011

Phonetic algorithms

Intro

Soundex

Examples

NYSIIS

NYSIIS computation algorithm

Examples

Daitch-Mokotoff Soundex

Example

Metaphone

Metaphone code computation algorithm

Examples

Double Metaphone

Examples

Caverphone

Caverphone code computation algorithm

Examples

Conclusion

Links

Blog Archive

About Me