Jump to content

String metric: Difference between revisions

From Wikipedia, the free encyclopedia
Content deleted Content added
(5 intermediate revisions by 5 users not shown)
Line 1: Line 1:
{{redirect|String distance|the distance between strings and the fingerboard in musical instruments|Action (music)}}
{{redirect|String distance|the distance between strings and the fingerboard in musical instruments|Action (music)}}


In [[mathematics]] and [[computer science]], a '''string metric''' (also known as a '''string similarity metric''' or '''string distance function''') is a [[metric (mathematics)|metric]] that measures [[distance]] ("inverse similarity") between two [[string (computer science)|text strings]] for [[approximate string matching]] or comparison and in [[approximate string matching|fuzzy string searching]]. A requirement for a string ''metric'' (e.g. in contrast to [[string matching]]) is fulfillment of the [[triangle inequality]]. For example, the strings "Sam" and "Samuel" can be considered to be close.<ref>{{cite journal
In [[mathematics]] and [[computer science]], a '''string metric''' (also known as a '''string similarity metric''' or '''string distance function''') is a [[metric (mathematics)|metric]] that measures [[distance]] ("inverse similarity") between two [[string (computer science)|text strings]] for [[approximate string matching]] or comparison and in [[approximate string matching|fuzzy string searching]]. A requirement for a string ''metric'' (e.g. in contrast to [[string matching]]) is fulfillment of the [[triangle inequality]]. For example, the strings "Sam" and "Samuel" can be considered to be close.<ref>{{cite book
| last = Lu
| last = Lu
| first = Jiaheng |display-authors=etal
| first = Jiaheng | title = Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data | chapter = String similarity measures and joins with synonyms |display-authors=etal
| title = String similarity measures and joins with synonyms
| journal = Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
| year=2013
| year=2013
| pages=373–384
| pages=373–384
| url=https://dl.acm.org/citation.cfm?id=2465313| doi = 10.1145/2463676.2465313
| chapter-url=https://dl.acm.org/citation.cfm?id=2465313| doi = 10.1145/2463676.2465313
| isbn = 9781450320375
| isbn = 9781450320375
| s2cid = 2091942 }}</ref> A string metric provides a number indicating an algorithm-specific indication of distance.
| s2cid = 2091942 }}</ref> A string metric provides a number indicating an algorithm-specific indication of distance.
Line 24: Line 22:
}}</ref> It operates between two input strings, returning a number equivalent to the number of substitutions and deletions needed in order to transform one input string into another. Simplistic string metrics such as [[Levenshtein distance]] have expanded to include phonetic, [[token (parser)|token]], grammatical and character-based methods of statistical comparisons.
}}</ref> It operates between two input strings, returning a number equivalent to the number of substitutions and deletions needed in order to transform one input string into another. Simplistic string metrics such as [[Levenshtein distance]] have expanded to include phonetic, [[token (parser)|token]], grammatical and character-based methods of statistical comparisons.


String metrics are used heavily in [[information integration]] and are currently used in areas including [[Data analysis techniques for fraud detection|fraud detection]], [[fingerprint analysis]], [[plagiarism detection]], [[ontology merging]], [[DNA analysis]], RNA analysis, [[image analysis]], evidence-based [[machine learning]], [[database]] [[data deduplication]], [[data mining]], [[incremental search]], [[data integration]], Malware Detection, <ref>{{cite journal |author1=[[Shlomi Dolev]] | last2=Mohammad |first2=Ghanayim |last3=Alexander |first3=Binun |last4=Sergey |first4=Frenkel |last5=Yeali |first5=S. Sun |title=Relationship of Jaccard and edit distance in malware clustering and online identification |journal=16th IEEE International Symposium on Network Computing and Applications |date=2017 |pages=369–373}}</ref> and semantic [[knowledge integration]].
String metrics are used heavily in [[information integration]] and are currently used in areas including [[Data analysis techniques for fraud detection|fraud detection]], [[fingerprint analysis]], [[plagiarism detection]], [[ontology merging]], [[DNA analysis]], RNA analysis, [[image analysis]], evidence-based [[machine learning]], [[database]] [[data deduplication]], [[data mining]], [[incremental search]], [[data integration]], [[Malware#Detection|malware detection]],<ref>{{cite journal |author1=[[Shlomi Dolev]] | last2=Mohammad |first2=Ghanayim |last3=Alexander |first3=Binun |last4=Sergey |first4=Frenkel |last5=Yeali |first5=S. Sun |title=Relationship of Jaccard and edit distance in malware clustering and online identification |journal=16th IEEE International Symposium on Network Computing and Applications |date=2017 |pages=369–373}}</ref> and semantic [[knowledge integration]].


==List of string metrics==
==List of string metrics==
Line 64: Line 62:
|-
|-
|[[Levenshtein distance]] and [[Damerau–Levenshtein distance]]
|[[Levenshtein distance]] and [[Damerau–Levenshtein distance]]
| Generalisation of Hamming distance that allows for different length strings, and (with Damerau) for transpositions
| Generalization of Hamming distance that allows for different length strings, and (with Damerau) for transpositions
| {{mono|'''k'''itt'''e'''n}} and {{mono|'''s'''itt'''i'''n'''g'''}} have a distance of 3.
| {{mono|'''k'''itt'''e'''n}} and {{mono|'''s'''itt'''i'''n'''g'''}} have a distance of 3.
# {{mono|'''k'''itten}} → {{mono|'''s'''itten}} (substitution of "s" for "k")
# {{mono|'''k'''itten}} → {{mono|'''s'''itten}} (substitution of "s" for "k")
Line 122: Line 120:


{{Reflist}}
{{Reflist}}



==External links==
==External links==

Revision as of 23:12, 14 May 2024

In mathematics and computer science, a string metric (also known as a string similarity metric or string distance function) is a metric that measures distance ("inverse similarity") between two text strings for approximate string matching or comparison and in fuzzy string searching. A requirement for a string metric (e.g. in contrast to string matching) is fulfillment of the triangle inequality. For example, the strings "Sam" and "Samuel" can be considered to be close.[1] A string metric provides a number indicating an algorithm-specific indication of distance.

The most widely known string metric is a rudimentary one called the Levenshtein distance (also known as edit distance).[2] It operates between two input strings, returning a number equivalent to the number of substitutions and deletions needed in order to transform one input string into another. Simplistic string metrics such as Levenshtein distance have expanded to include phonetic, token, grammatical and character-based methods of statistical comparisons.

String metrics are used heavily in information integration and are currently used in areas including fraud detection, fingerprint analysis, plagiarism detection, ontology merging, DNA analysis, RNA analysis, image analysis, evidence-based machine learning, database data deduplication, data mining, incremental search, data integration, malware detection,[3] and semantic knowledge integration.

List of string metrics

There also exist functions which measure a dissimilarity between strings, but do not necessarily fulfill the triangle inequality, and as such are not metrics in the mathematical sense. An example of such function is the Jaro–Winkler distance.

Selected string measures examples

Name Description Example
Hamming distance Only for strings of the same length. Number of changed characters. "karolin" and "kathrin" is 3.
Levenshtein distance and Damerau–Levenshtein distance Generalization of Hamming distance that allows for different length strings, and (with Damerau) for transpositions kitten and sitting have a distance of 3.
  1. kittensitten (substitution of "s" for "k")
  2. sittensittin (substitution of "i" for "e")
  3. sittinsitting (insertion of "g" at the end).
Jaro–Winkler distance JaroWinklerDist("MARTHA","MARHTA") =
  • is the number of matching characters;
  • is half the number of transpositions("MARTHA"[3]!=H, "MARHTA"[3]!=T).
Most frequent k characters MostFreqKeySimilarity('research', 'seeking', 2) = 2

References

  1. ^ Lu, Jiaheng; et al. (2013). "String similarity measures and joins with synonyms". Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data. pp. 373–384. doi:10.1145/2463676.2465313. ISBN 9781450320375. S2CID 2091942.
  2. ^ Navarro, Gonzalo (2001). "A guided tour to approximate string matching". ACM Computing Surveys. 33 (1): 31–88. doi:10.1145/375360.375365. hdl:10533/172862. S2CID 207551224.
  3. ^ Shlomi Dolev; Mohammad, Ghanayim; Alexander, Binun; Sergey, Frenkel; Yeali, S. Sun (2017). "Relationship of Jaccard and edit distance in malware clustering and online identification". 16th IEEE International Symposium on Network Computing and Applications: 369–373.
  4. ^ a b c d e Sam's String Metrics - Computational Linguistics and Phonetics
  5. ^ Russell, David J., et al. "A grammar-based distance metric enables fast and accurate clustering of large sets of 16S sequences." BMC bioinformatics 11.1 (2010): 1-14.
  6. ^ Cohen, William; Ravikumar, Pradeep; Fienberg, Stephen (2003-08-01). "A Comparison of String Distance Metrics for Name-Matching Tasks": 73–78. {{cite journal}}: Cite journal requires |journal= (help)