Indistinct Comparisonby: G.E. Ozz Nixon Jr.
Published: October 2007
Copyright 2009 by Friends of FPC
Data Searching is a very complex art in itself. Ignoring the long history of Internet "search engines" growing from Archie in 1992 to today's Google engine. Original techniques to provide users with "full text" search required a solution designed for common mistakes, one of which is called Indistinct Matching.
In the mathematical discipline of graph theory a matching or edge-independent set in a graph is a set of edges without common vertices. When applied to words, this theory is capable of calculating words which sound a like but are spelled differently. Unfortunately, it is on a rare occasion you will get the results with words that you would expect when running this theory against graphs.
Code snippet from Brain Patchwork DX, LLC.
Type TRetCount = packed record lngSubRows : Word; lngCountLike : Word; end; function IndistinctMatching(MaxMatching : Integer; strInputMatching: PChar; strInputStandart: PChar): Integer; Var gret,tret:TRetCount; lngCurLen:Integer; begin If (MaxMatching = 0) Or (strInputMatching^ = #0) Or (strInputStandart^ = #0) Then begin IndistinctMatching:= 0; exit; end; gret.lngCountLike:= 0; gret.lngSubRows := 0; For lngCurLen:= 1 To MaxMatching do begin tret:= Matching(strInputMatching, strInputStandart, lngCurLen); gret.lngCountLike := gret.lngCountLike + tret.lngCountLike; gret.lngSubRows := gret.lngSubRows + tret.lngSubRows; tret:= Matching(strInputStandart, strInputMatching, lngCurLen); gret.lngCountLike := gret.lngCountLike + tret.lngCountLike; gret.lngSubRows := gret.lngSubRows + tret.lngSubRows; end; If gret.lngSubRows = 0 Then begin IndistinctMatching:= 0; exit; end; IndistinctMatching:= Trunc((gret.lngCountLike / gret.lngSubRows) * 100); end;
Using this algorithm, both "Robert" and "Rupert" return "42%" the same, while "Rubin" and "Robin" yields "49%". "Ashcroft" and "Ashcraft" yields "57%". However, "George" and "Jorge" yields "60%", and "Dave" and "David" yields "49%" the same.
In this sample, we see another way of viewing the differences between words. In this technique, there is a higher penalty for words that sound a like but are not spelled a like. However, notice that George and Jorge scored the highest due to the fact that 4 of the 5 "sides" are identical, only the first "side/sides" were different. Yet, for a pattern that differs in the middle or even the end - the score drops. So, this theory is useful for pattern matches that probably start differently. Please continue reading the other algorithms I use in my search engines.
G.E. Ozz Nixon Jr.
These are popular related words: