needs review, but all methods documented

2015-02-25 15:49:26 -05:00 · 2015-02-25 15:49:26 -05:00 · 2e6a2ca21b
parent 2c7234092e
commit 2e6a2ca21b
4 changed files with 111 additions and 46 deletions
--- a/docs/comparison.rst
+++ b/docs/comparison.rst
@ -0,0 +1,76 @@
+String Comparison
+=================
+
+Levenshtein Distance
+--------------------
+
+.. py:function:: jellyfish.levenshtein_distance(s1, s2)
+
+    Compute the Levenshtein distance between s1 and s2.
+
+Levenshtein distance represents the number of insertions, deletions, and subsititutions required
+to change one word to another.
+
+For example: ``levenshtein_distance('berne', 'born') == 2`` representing the transformation of the
+first e to o and the deletion of the second e.
+
+See the `Levenshtein distance article at Wikipedia <http://en.wikipedia.org/wiki/Levenshtein_distance>`_ for more details.
+
+Damerau-Levenshtein Distance
+----------------------------
+
+.. py:function:: jellyfish.damerau_levenshtein_distance(s1, s2)
+
+    Compute the Damerau-Levenshtein distance between s1 and s2.
+
+A modification of Levenshtein distance, Damerau-Levenshtein distance counts transpositions (such as ifhs for fish) as a single edit.
+
+Where ``levenshtein_distance('fish', 'ifsh') == 2`` as it would require a deletion and an insertion,
+though ``damerau_levenshtein_distance('fish', 'ifsh') == 1`` as this counts as a transposition.
+
+See the `Damerau-Levenshtein distance article at Wikipedia <http://en.wikipedia.org/wiki/Damerau-Levenshtein_distance>`_ for more details.
+
+Hamming Distance
+----------------
+
+.. py:function:: jellyfish.hamming(s1, s2)
+
+    Compute the Hamming distance between s1 and s2.
+
+(TODO: fill this part in once we're sure Hamming works)
+
+See the `Hamming distance article at Wikipedia <http://en.wikipedia.org/wiki/Hamming_distance>`_ for more details.
+
+Jaro Distance
+-------------
+
+.. py:function:: jellyfish.jaro_distance(s1, s2)
+
+    Compute the Jaro distance between s1 and s2.
+
+Jaro distance is a string-edit distance that gives a floating point response in [0,1] where 0 represents
+two completely dissimilar strings and 1 represents identical strings.
+
+Jaro-Winkler Distance
+---------------------
+
+.. py:function:: jellyfish.jaro_winkler(s1, s2)
+
+    Compute the Jaro-Winkler distance between s1 and s2.
+
+Jaro-Winkler is a modification/improvement to Jaro distance, like Jaro it gives a floating point response in [0,1] where 0 represents two completely dissimilar strings and 1 represents identical strings.
+
+See the `Jaro-Winkler distance article at Wikipedia <http://en.wikipedia.org/wiki/Jaro-Winkler_distance>`_ for more details.
+
+Match Rating Approach (comparison)
+----------------------------------
+
+.. py:function:: jellyfish.match_rating_comparison(s1, s2)
+
+    Compare s1 and s2 using the match rating approach algorithm, returns ``True`` if strings are considered equivalent or ``False`` if not.  Can also return ``None`` if s1 and s2 are not comparable (length differs by more than 3).
+
+The Match rating approach algorithm is an algorithm for determining whether or not two names are
+pronounced similarly.  Strings are first encoded using ``match_rating_codex`` then compared according
+to the MRA algorithm.
+
+See the `Match Rating Approach <http://en.wikipedia.org/wiki/Match_rating_approach>`_ for more details.
--- a/docs/index.rst
+++ b/docs/index.rst
@ -6,32 +6,22 @@ Overview

 jellyfish is a library of functions for approximate and phonetic matching of strings.

-
-Included Algorithms
-~~~~~~~~~~~~~~~~~~~
-
-String Comparison:
-
-    * Levenshtein Distance
-    * Damerau-Levenshtein Distance
-    * Jaro Distance
-    * Jaro-Winkler Distance
-    * Match Rating Approach Comparison
-    * Hamming Distance
-
-Phonetic Encoding:
-
-    * American Soundex
-    * Metaphone
-    * NYSIIS (New York State Identification and Intelligence System)
-    * Match Rating Codex
-
-
-Contents
--------
+The library provides implementations of the following algorithms:

 .. toctree::
-   :maxdepth: 2
+   :maxdepth: 3
+
+   phonetic
+   stemming
+   comparison
+
+Implementation
+--------------
+
+Each algorithm has C and Python implementations, on a typical CPython install the C implementation will be used.
+
+The Python versions are available for PyPy and systems where compiling the CPython extension is not possible.
+

 Indices and tables
 ==================
--- a/docs/algorithms.rst
+++ b/docs/algorithms.rst
@ -1,16 +1,13 @@
-Algorithms
-==========
-
 Phonetic Encoding
-~~~~~~~~~~~~~~~~~
+=================

 These algorithms convert a string to a normalized phonetic encoding, converting a word to a
 representation of its pronunciation.  Each algorithm takes a single string and returns a coded
 representation.


-soundex (American Soundex)
--------------------------
+American Soundex
+----------------

 .. py:function:: jellyfish.soundex(s)

@ -25,7 +22,7 @@ For example ``soundex('Ann') == soundex('Anne') == 'A500'`` and
 See the `Soundex article at Wikipedia <http://en.wikipedia.org/wiki/Soundex>`_ for more details.


-metaphone
+Metaphone
 ---------

 .. py:function:: jellyfish.metaphone(s)
@ -40,7 +37,7 @@ For example ``metaphone('Klump') == metaphone('Clump') == 'KLMP'``.
 See the `Metaphone article at Wikipedia <http://en.wikipedia.org/wiki/Metaphone>`_ for more details.


-nysiis
+NYSIIS
 ------

 .. py:function:: jellyfish.nysiis(s)
@ -62,21 +59,8 @@ Match Rating Approach (codex)

 The Match rating approach algorithm is an algorithm for determining whether or not two names are
 pronounced similarly.  The algorithm consists of an encoding function (similar to soundex or nysiis)
-as well as a special comparison algorithm that can be used to compare names for equality using
-the generated encoding.
+which is implemented here as well as ``match_rating_comparison`` which does the actual comparison.

 See the `Match Rating Approach <http://en.wikipedia.org/wiki/Match_rating_approach>`_ for more details.

-porter_stem
-----------

-.. py:function:: jellyfish.porter_stem(s)
-
-    Reduce the string s to its stem using the common Porter stemmer.
-
-Stemming is the process of reducing a word to its root form, for example 'stemmed' to 'stem'.
-
-Martin Porter's algorithm is a common algorithm used for stemming that works for many purposes.
-
-See the `official homepage for the Porter Stemming Algorithm <http://tartarus.org/martin/PorterStemmer/>` for more details.
-:
--- a/docs/stemming.rst
+++ b/docs/stemming.rst
@ -0,0 +1,15 @@
+Stemming
+========
+
+Porter Stemmer
+--------------
+
+.. py:function:: jellyfish.porter_stem(s)
+
+    Reduce the string s to its stem using the common Porter stemmer.
+
+Stemming is the process of reducing a word to its root form, for example 'stemmed' to 'stem'.
+
+Martin Porter's algorithm is a common algorithm used for stemming that works for many purposes.
+
+See the `official homepage for the Porter Stemming Algorithm <http://tartarus.org/martin/PorterStemmer/>` for more details.