From 2e6a2ca21be682173487dfb2717f48b78278df64 Mon Sep 17 00:00:00 2001 From: James Turk Date: Wed, 25 Feb 2015 15:49:26 -0500 Subject: [PATCH] needs review, but all methods documented --- docs/comparison.rst | 76 +++++++++++++++++++++++++++ docs/index.rst | 38 +++++--------- docs/{algorithms.rst => phonetic.rst} | 28 +++------- docs/stemming.rst | 15 ++++++ 4 files changed, 111 insertions(+), 46 deletions(-) create mode 100644 docs/comparison.rst rename docs/{algorithms.rst => phonetic.rst} (76%) create mode 100644 docs/stemming.rst diff --git a/docs/comparison.rst b/docs/comparison.rst new file mode 100644 index 0000000..03aa9fc --- /dev/null +++ b/docs/comparison.rst @@ -0,0 +1,76 @@ +String Comparison +================= + +Levenshtein Distance +-------------------- + +.. py:function:: jellyfish.levenshtein_distance(s1, s2) + + Compute the Levenshtein distance between s1 and s2. + +Levenshtein distance represents the number of insertions, deletions, and subsititutions required +to change one word to another. + +For example: ``levenshtein_distance('berne', 'born') == 2`` representing the transformation of the +first e to o and the deletion of the second e. + +See the `Levenshtein distance article at Wikipedia `_ for more details. + +Damerau-Levenshtein Distance +---------------------------- + +.. py:function:: jellyfish.damerau_levenshtein_distance(s1, s2) + + Compute the Damerau-Levenshtein distance between s1 and s2. + +A modification of Levenshtein distance, Damerau-Levenshtein distance counts transpositions (such as ifhs for fish) as a single edit. + +Where ``levenshtein_distance('fish', 'ifsh') == 2`` as it would require a deletion and an insertion, +though ``damerau_levenshtein_distance('fish', 'ifsh') == 1`` as this counts as a transposition. + +See the `Damerau-Levenshtein distance article at Wikipedia `_ for more details. + +Hamming Distance +---------------- + +.. py:function:: jellyfish.hamming(s1, s2) + + Compute the Hamming distance between s1 and s2. + +(TODO: fill this part in once we're sure Hamming works) + +See the `Hamming distance article at Wikipedia `_ for more details. + +Jaro Distance +------------- + +.. py:function:: jellyfish.jaro_distance(s1, s2) + + Compute the Jaro distance between s1 and s2. + +Jaro distance is a string-edit distance that gives a floating point response in [0,1] where 0 represents +two completely dissimilar strings and 1 represents identical strings. + +Jaro-Winkler Distance +--------------------- + +.. py:function:: jellyfish.jaro_winkler(s1, s2) + + Compute the Jaro-Winkler distance between s1 and s2. + +Jaro-Winkler is a modification/improvement to Jaro distance, like Jaro it gives a floating point response in [0,1] where 0 represents two completely dissimilar strings and 1 represents identical strings. + +See the `Jaro-Winkler distance article at Wikipedia `_ for more details. + +Match Rating Approach (comparison) +---------------------------------- + +.. py:function:: jellyfish.match_rating_comparison(s1, s2) + + Compare s1 and s2 using the match rating approach algorithm, returns ``True`` if strings are considered equivalent or ``False`` if not. Can also return ``None`` if s1 and s2 are not comparable (length differs by more than 3). + +The Match rating approach algorithm is an algorithm for determining whether or not two names are +pronounced similarly. Strings are first encoded using ``match_rating_codex`` then compared according +to the MRA algorithm. + +See the `Match Rating Approach `_ for more details. diff --git a/docs/index.rst b/docs/index.rst index aedd7d1..e5214e3 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -6,32 +6,22 @@ Overview jellyfish is a library of functions for approximate and phonetic matching of strings. - -Included Algorithms -~~~~~~~~~~~~~~~~~~~ - -String Comparison: - - * Levenshtein Distance - * Damerau-Levenshtein Distance - * Jaro Distance - * Jaro-Winkler Distance - * Match Rating Approach Comparison - * Hamming Distance - -Phonetic Encoding: - - * American Soundex - * Metaphone - * NYSIIS (New York State Identification and Intelligence System) - * Match Rating Codex - - -Contents --------- +The library provides implementations of the following algorithms: .. toctree:: - :maxdepth: 2 + :maxdepth: 3 + + phonetic + stemming + comparison + +Implementation +-------------- + +Each algorithm has C and Python implementations, on a typical CPython install the C implementation will be used. + +The Python versions are available for PyPy and systems where compiling the CPython extension is not possible. + Indices and tables ================== diff --git a/docs/algorithms.rst b/docs/phonetic.rst similarity index 76% rename from docs/algorithms.rst rename to docs/phonetic.rst index 9b65f43..166d9cf 100644 --- a/docs/algorithms.rst +++ b/docs/phonetic.rst @@ -1,16 +1,13 @@ -Algorithms -========== - Phonetic Encoding -~~~~~~~~~~~~~~~~~ +================= These algorithms convert a string to a normalized phonetic encoding, converting a word to a representation of its pronunciation. Each algorithm takes a single string and returns a coded representation. -soundex (American Soundex) --------------------------- +American Soundex +---------------- .. py:function:: jellyfish.soundex(s) @@ -25,7 +22,7 @@ For example ``soundex('Ann') == soundex('Anne') == 'A500'`` and See the `Soundex article at Wikipedia `_ for more details. -metaphone +Metaphone --------- .. py:function:: jellyfish.metaphone(s) @@ -40,7 +37,7 @@ For example ``metaphone('Klump') == metaphone('Clump') == 'KLMP'``. See the `Metaphone article at Wikipedia `_ for more details. -nysiis +NYSIIS ------ .. py:function:: jellyfish.nysiis(s) @@ -62,21 +59,8 @@ Match Rating Approach (codex) The Match rating approach algorithm is an algorithm for determining whether or not two names are pronounced similarly. The algorithm consists of an encoding function (similar to soundex or nysiis) -as well as a special comparison algorithm that can be used to compare names for equality using -the generated encoding. +which is implemented here as well as ``match_rating_comparison`` which does the actual comparison. See the `Match Rating Approach `_ for more details. -porter_stem ------------ -.. py:function:: jellyfish.porter_stem(s) - - Reduce the string s to its stem using the common Porter stemmer. - -Stemming is the process of reducing a word to its root form, for example 'stemmed' to 'stem'. - -Martin Porter's algorithm is a common algorithm used for stemming that works for many purposes. - -See the `official homepage for the Porter Stemming Algorithm ` for more details. -: diff --git a/docs/stemming.rst b/docs/stemming.rst new file mode 100644 index 0000000..c2adf32 --- /dev/null +++ b/docs/stemming.rst @@ -0,0 +1,15 @@ +Stemming +======== + +Porter Stemmer +-------------- + +.. py:function:: jellyfish.porter_stem(s) + + Reduce the string s to its stem using the common Porter stemmer. + +Stemming is the process of reducing a word to its root form, for example 'stemmed' to 'stem'. + +Martin Porter's algorithm is a common algorithm used for stemming that works for many purposes. + +See the `official homepage for the Porter Stemming Algorithm ` for more details.