diff --git a/docs/comparison.md b/docs/functions.md similarity index 51% rename from docs/comparison.md rename to docs/functions.md index 280c37f..2454200 100644 --- a/docs/comparison.md +++ b/docs/functions.md @@ -1,10 +1,12 @@ -String Comparison -================= +# Functions -These methods are all measures of the difference (aka `edit distance`) between two strings. +Jellyfish provides a variety of functions for string comparison, phonetic encoding, and stemming. -Levenshtein Distance --------------------- +## String Comparison + +These methods are all measures of the difference (aka edit distance) between two strings. + +### Levenshtein Distance ``` python def levenshtein_distance(s1: str, s2: str) @@ -18,8 +20,7 @@ For example: ``levenshtein_distance('berne', 'born') == 2`` representing the tra See the [Levenshtein distance article at Wikipedia](http://en.wikipedia.org/wiki/Levenshtein_distance) for more details. -Damerau-Levenshtein Distance ----------------------------- +### Damerau-Levenshtein Distance ``` python def damerau_levenshtein_distance(s1: str, s2: str) @@ -34,8 +35,7 @@ though ``damerau_levenshtein_distance('fish', 'ifsh') == 1`` as this counts as a See the [Damerau-Levenshtein distance article at Wikipedia](http://en.wikipedia.org/wiki/Damerau-Levenshtein_distance) for more details. -Hamming Distance ----------------- +### Hamming Distance ``` python def hamming_distance(s1: str, s2: str) @@ -50,8 +50,7 @@ considers extra characters as differing. For example ``hamming_distance('abc', See the [Hamming distance article at Wikipedia](http://en.wikipedia.org/wiki/Hamming_distance) for more details. -Jaro Similarity ----------------- +### Jaro Similarity ``` python def jaro_similarity(s1: str, s2: str) @@ -66,8 +65,7 @@ Jaro distance is a string-edit distance that gives a floating point response in Prior to 0.8.1 this function was named jaro_distance. That name is still available, but is no longer recommended. It will be replaced in 1.0 with a correct version. -Jaro-Winkler Similarity ------------------------ +### Jaro-Winkler Similarity ``` python def jaro_winkler_similarity(s1: str, s2: str) @@ -84,8 +82,7 @@ Jaro-Winkler is a modification/improvement to Jaro distance, like Jaro it gives See the [Jaro-Winkler distance article at Wikipedia](http://en.wikipedia.org/wiki/Jaro-Winkler_distance) for more details. -Match Rating Approach (comparison) ----------------------------------- +### Match Rating Approach (comparison) ``` python def match_rating_comparison(s1, s2) @@ -97,3 +94,85 @@ The Match rating approach algorithm is an algorithm for determining whether or n pronounced similarly. Strings are first encoded using :py:func:`match_rating_codex` then compared according to the MRA algorithm. See the [Match Rating Approach article at Wikipedia](http://en.wikipedia.org/wiki/Match_rating_approach) for more details. + +## Phonetic Encoding + +These algorithms convert a string to a normalized phonetic encoding, converting a word to a representation of its pronunciation. Each takes a single string and returns a coded representation. + + +### American Soundex + +``` python +def soundex(s: str) +``` + +Calculate the American Soundex of the string s. + +Soundex is an algorithm to convert a word (typically a name) to a four digit code in the form +'A123' where 'A' is the first letter of the name and the digits represent similar sounds. + +For example ``soundex('Ann') == soundex('Anne') == 'A500'`` and +``soundex('Rupert') == soundex('Robert') == 'R163'``. + +See the [Soundex article at Wikipedia](http://en.wikipedia.org/wiki/Soundex) for more details. + + +### Metaphone + +``` python +def metaphone(s: str) +``` + +Calculate the metaphone code for the string s. + +The metaphone algorithm was designed as an improvement on Soundex. It transforms a word into a +string consisting of '0BFHJKLMNPRSTWXY' where '0' is pronounced 'th' and 'X' is a '[sc]h' sound. + +For example ``metaphone('Klumpz') == metaphone('Clumps') == 'KLMPS'``. + +See the [Metaphone article at Wikipedia](http://en.wikipedia.org/wiki/Metaphone) for more details. + + +### NYSIIS + +``` python +def nysiis(s: str) +``` + +Calculate the NYSIIS code for the string s. + +The NYSIIS algorithm is an algorithm developed by the New York State Identification and Intelligence System. It transforms a word into a phonetic code. Like soundex and metaphone it is primarily intended for use on names (as they would be pronounced in English). + +For example ``nysiis('John') == nysiis('Jan') == JAN``. + +See the [NYSIIS article at Wikipedia](http://en.wikipedia.org/wiki/New_York_State_Identification_and_Intelligence_System) for more details. + +### Match Rating Approach (codex) + +``` python +def match_rating_codex(s: str) +``` + +Calculate the match rating approach value (also called PNI) for the string s. + +The Match rating approach algorithm is an algorithm for determining whether or not two names are +pronounced similarly. The algorithm consists of an encoding function (similar to soundex or nysiis) +which is implemented here as well as :py:func:`match_rating_comparison` which does the actual comparison. + +See the [Match Rating Approach article at Wikipedia](http://en.wikipedia.org/wiki/Match_rating_approach) for more details. + +## Stemming + +### Porter Stemmer + +``` python +def porter_stem(s: str) +``` + +Reduce the string s to its stem using the common Porter stemmer. + +Stemming is the process of reducing a word to its root form, for example 'stemmed' to 'stem'. + +Martin Porter's algorithm is a common algorithm used for stemming that works for many purposes. + +See the [official homepage for the Porter Stemming Algorithm](http://tartarus.org/martin/PorterStemmer/) for more details. diff --git a/docs/phonetic.md b/docs/phonetic.md deleted file mode 100644 index 22d228e..0000000 --- a/docs/phonetic.md +++ /dev/null @@ -1,70 +0,0 @@ -Phonetic Encoding -================= - -These algorithms convert a string to a normalized phonetic encoding, converting a word to a representation of its pronunciation. Each takes a single string and returns a coded representation. - - -American Soundex ----------------- - -``` python -def soundex(s: str) -``` - -Calculate the American Soundex of the string s. - -Soundex is an algorithm to convert a word (typically a name) to a four digit code in the form -'A123' where 'A' is the first letter of the name and the digits represent similar sounds. - -For example ``soundex('Ann') == soundex('Anne') == 'A500'`` and -``soundex('Rupert') == soundex('Robert') == 'R163'``. - -See the [Soundex article at Wikipedia](http://en.wikipedia.org/wiki/Soundex) for more details. - - -Metaphone ---------- - -``` python -def metaphone(s: str) -``` - -Calculate the metaphone code for the string s. - -The metaphone algorithm was designed as an improvement on Soundex. It transforms a word into a -string consisting of '0BFHJKLMNPRSTWXY' where '0' is pronounced 'th' and 'X' is a '[sc]h' sound. - -For example ``metaphone('Klumpz') == metaphone('Clumps') == 'KLMPS'``. - -See the [Metaphone article at Wikipedia](http://en.wikipedia.org/wiki/Metaphone) for more details. - - -NYSIIS ------- - -``` python -def nysiis(s: str) -``` - -Calculate the NYSIIS code for the string s. - -The NYSIIS algorithm is an algorithm developed by the New York State Identification and Intelligence System. It transforms a word into a phonetic code. Like soundex and metaphone it is primarily intended for use on names (as they would be pronounced in English). - -For example ``nysiis('John') == nysiis('Jan') == JAN``. - -See the [NYSIIS article at Wikipedia](http://en.wikipedia.org/wiki/New_York_State_Identification_and_Intelligence_System) for more details. - -Match Rating Approach (codex) ------------------------------ - -``` python -def match_rating_codex(s: str) -``` - -Calculate the match rating approach value (also called PNI) for the string s. - -The Match rating approach algorithm is an algorithm for determining whether or not two names are -pronounced similarly. The algorithm consists of an encoding function (similar to soundex or nysiis) -which is implemented here as well as :py:func:`match_rating_comparison` which does the actual comparison. - -See the [Match Rating Approach article at Wikipedia](http://en.wikipedia.org/wiki/Match_rating_approach) for more details. diff --git a/docs/stemming.md b/docs/stemming.md deleted file mode 100644 index 1cb2b1a..0000000 --- a/docs/stemming.md +++ /dev/null @@ -1,17 +0,0 @@ -Stemming -======== - -Porter Stemmer --------------- - -``` python -def porter_stem(s: str) -``` - -Reduce the string s to its stem using the common Porter stemmer. - -Stemming is the process of reducing a word to its root form, for example 'stemmed' to 'stem'. - -Martin Porter's algorithm is a common algorithm used for stemming that works for many purposes. - -See the [official homepage for the Porter Stemming Algorithm](http://tartarus.org/martin/PorterStemmer/) for more details. diff --git a/mkdocs.yml b/mkdocs.yml index 2a78702..23ed88c 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -46,7 +46,5 @@ extra_css: - assets/extra.css nav: - 'index.md' - - 'phonetic.md' - - 'comparison.md' - - 'stemming.md' + - 'functions.md' - 'changelog.md'