needs review, but all methods documented

This commit is contained in:
James Turk 2015-02-25 15:49:26 -05:00
parent 2c7234092e
commit 2e6a2ca21b
4 changed files with 111 additions and 46 deletions

76
docs/comparison.rst Normal file
View File

@ -0,0 +1,76 @@
String Comparison
=================
Levenshtein Distance
--------------------
.. py:function:: jellyfish.levenshtein_distance(s1, s2)
Compute the Levenshtein distance between s1 and s2.
Levenshtein distance represents the number of insertions, deletions, and subsititutions required
to change one word to another.
For example: ``levenshtein_distance('berne', 'born') == 2`` representing the transformation of the
first e to o and the deletion of the second e.
See the `Levenshtein distance article at Wikipedia <http://en.wikipedia.org/wiki/Levenshtein_distance>`_ for more details.
Damerau-Levenshtein Distance
----------------------------
.. py:function:: jellyfish.damerau_levenshtein_distance(s1, s2)
Compute the Damerau-Levenshtein distance between s1 and s2.
A modification of Levenshtein distance, Damerau-Levenshtein distance counts transpositions (such as ifhs for fish) as a single edit.
Where ``levenshtein_distance('fish', 'ifsh') == 2`` as it would require a deletion and an insertion,
though ``damerau_levenshtein_distance('fish', 'ifsh') == 1`` as this counts as a transposition.
See the `Damerau-Levenshtein distance article at Wikipedia <http://en.wikipedia.org/wiki/Damerau-Levenshtein_distance>`_ for more details.
Hamming Distance
----------------
.. py:function:: jellyfish.hamming(s1, s2)
Compute the Hamming distance between s1 and s2.
(TODO: fill this part in once we're sure Hamming works)
See the `Hamming distance article at Wikipedia <http://en.wikipedia.org/wiki/Hamming_distance>`_ for more details.
Jaro Distance
-------------
.. py:function:: jellyfish.jaro_distance(s1, s2)
Compute the Jaro distance between s1 and s2.
Jaro distance is a string-edit distance that gives a floating point response in [0,1] where 0 represents
two completely dissimilar strings and 1 represents identical strings.
Jaro-Winkler Distance
---------------------
.. py:function:: jellyfish.jaro_winkler(s1, s2)
Compute the Jaro-Winkler distance between s1 and s2.
Jaro-Winkler is a modification/improvement to Jaro distance, like Jaro it gives a floating point response in [0,1] where 0 represents two completely dissimilar strings and 1 represents identical strings.
See the `Jaro-Winkler distance article at Wikipedia <http://en.wikipedia.org/wiki/Jaro-Winkler_distance>`_ for more details.
Match Rating Approach (comparison)
----------------------------------
.. py:function:: jellyfish.match_rating_comparison(s1, s2)
Compare s1 and s2 using the match rating approach algorithm, returns ``True`` if strings are considered equivalent or ``False`` if not. Can also return ``None`` if s1 and s2 are not comparable (length differs by more than 3).
The Match rating approach algorithm is an algorithm for determining whether or not two names are
pronounced similarly. Strings are first encoded using ``match_rating_codex`` then compared according
to the MRA algorithm.
See the `Match Rating Approach <http://en.wikipedia.org/wiki/Match_rating_approach>`_ for more details.

View File

@ -6,32 +6,22 @@ Overview
jellyfish is a library of functions for approximate and phonetic matching of strings.
Included Algorithms
~~~~~~~~~~~~~~~~~~~
String Comparison:
* Levenshtein Distance
* Damerau-Levenshtein Distance
* Jaro Distance
* Jaro-Winkler Distance
* Match Rating Approach Comparison
* Hamming Distance
Phonetic Encoding:
* American Soundex
* Metaphone
* NYSIIS (New York State Identification and Intelligence System)
* Match Rating Codex
Contents
--------
The library provides implementations of the following algorithms:
.. toctree::
:maxdepth: 2
:maxdepth: 3
phonetic
stemming
comparison
Implementation
--------------
Each algorithm has C and Python implementations, on a typical CPython install the C implementation will be used.
The Python versions are available for PyPy and systems where compiling the CPython extension is not possible.
Indices and tables
==================

View File

@ -1,16 +1,13 @@
Algorithms
==========
Phonetic Encoding
~~~~~~~~~~~~~~~~~
=================
These algorithms convert a string to a normalized phonetic encoding, converting a word to a
representation of its pronunciation. Each algorithm takes a single string and returns a coded
representation.
soundex (American Soundex)
--------------------------
American Soundex
----------------
.. py:function:: jellyfish.soundex(s)
@ -25,7 +22,7 @@ For example ``soundex('Ann') == soundex('Anne') == 'A500'`` and
See the `Soundex article at Wikipedia <http://en.wikipedia.org/wiki/Soundex>`_ for more details.
metaphone
Metaphone
---------
.. py:function:: jellyfish.metaphone(s)
@ -40,7 +37,7 @@ For example ``metaphone('Klump') == metaphone('Clump') == 'KLMP'``.
See the `Metaphone article at Wikipedia <http://en.wikipedia.org/wiki/Metaphone>`_ for more details.
nysiis
NYSIIS
------
.. py:function:: jellyfish.nysiis(s)
@ -62,21 +59,8 @@ Match Rating Approach (codex)
The Match rating approach algorithm is an algorithm for determining whether or not two names are
pronounced similarly. The algorithm consists of an encoding function (similar to soundex or nysiis)
as well as a special comparison algorithm that can be used to compare names for equality using
the generated encoding.
which is implemented here as well as ``match_rating_comparison`` which does the actual comparison.
See the `Match Rating Approach <http://en.wikipedia.org/wiki/Match_rating_approach>`_ for more details.
porter_stem
-----------
.. py:function:: jellyfish.porter_stem(s)
Reduce the string s to its stem using the common Porter stemmer.
Stemming is the process of reducing a word to its root form, for example 'stemmed' to 'stem'.
Martin Porter's algorithm is a common algorithm used for stemming that works for many purposes.
See the `official homepage for the Porter Stemming Algorithm <http://tartarus.org/martin/PorterStemmer/>` for more details.
:

15
docs/stemming.rst Normal file
View File

@ -0,0 +1,15 @@
Stemming
========
Porter Stemmer
--------------
.. py:function:: jellyfish.porter_stem(s)
Reduce the string s to its stem using the common Porter stemmer.
Stemming is the process of reducing a word to its root form, for example 'stemmed' to 'stem'.
Martin Porter's algorithm is a common algorithm used for stemming that works for many purposes.
See the `official homepage for the Porter Stemming Algorithm <http://tartarus.org/martin/PorterStemmer/>` for more details.