needs review, but all methods documented
This commit is contained in:
parent
2c7234092e
commit
2e6a2ca21b
|
@ -0,0 +1,76 @@
|
|||
String Comparison
|
||||
=================
|
||||
|
||||
Levenshtein Distance
|
||||
--------------------
|
||||
|
||||
.. py:function:: jellyfish.levenshtein_distance(s1, s2)
|
||||
|
||||
Compute the Levenshtein distance between s1 and s2.
|
||||
|
||||
Levenshtein distance represents the number of insertions, deletions, and subsititutions required
|
||||
to change one word to another.
|
||||
|
||||
For example: ``levenshtein_distance('berne', 'born') == 2`` representing the transformation of the
|
||||
first e to o and the deletion of the second e.
|
||||
|
||||
See the `Levenshtein distance article at Wikipedia <http://en.wikipedia.org/wiki/Levenshtein_distance>`_ for more details.
|
||||
|
||||
Damerau-Levenshtein Distance
|
||||
----------------------------
|
||||
|
||||
.. py:function:: jellyfish.damerau_levenshtein_distance(s1, s2)
|
||||
|
||||
Compute the Damerau-Levenshtein distance between s1 and s2.
|
||||
|
||||
A modification of Levenshtein distance, Damerau-Levenshtein distance counts transpositions (such as ifhs for fish) as a single edit.
|
||||
|
||||
Where ``levenshtein_distance('fish', 'ifsh') == 2`` as it would require a deletion and an insertion,
|
||||
though ``damerau_levenshtein_distance('fish', 'ifsh') == 1`` as this counts as a transposition.
|
||||
|
||||
See the `Damerau-Levenshtein distance article at Wikipedia <http://en.wikipedia.org/wiki/Damerau-Levenshtein_distance>`_ for more details.
|
||||
|
||||
Hamming Distance
|
||||
----------------
|
||||
|
||||
.. py:function:: jellyfish.hamming(s1, s2)
|
||||
|
||||
Compute the Hamming distance between s1 and s2.
|
||||
|
||||
(TODO: fill this part in once we're sure Hamming works)
|
||||
|
||||
See the `Hamming distance article at Wikipedia <http://en.wikipedia.org/wiki/Hamming_distance>`_ for more details.
|
||||
|
||||
Jaro Distance
|
||||
-------------
|
||||
|
||||
.. py:function:: jellyfish.jaro_distance(s1, s2)
|
||||
|
||||
Compute the Jaro distance between s1 and s2.
|
||||
|
||||
Jaro distance is a string-edit distance that gives a floating point response in [0,1] where 0 represents
|
||||
two completely dissimilar strings and 1 represents identical strings.
|
||||
|
||||
Jaro-Winkler Distance
|
||||
---------------------
|
||||
|
||||
.. py:function:: jellyfish.jaro_winkler(s1, s2)
|
||||
|
||||
Compute the Jaro-Winkler distance between s1 and s2.
|
||||
|
||||
Jaro-Winkler is a modification/improvement to Jaro distance, like Jaro it gives a floating point response in [0,1] where 0 represents two completely dissimilar strings and 1 represents identical strings.
|
||||
|
||||
See the `Jaro-Winkler distance article at Wikipedia <http://en.wikipedia.org/wiki/Jaro-Winkler_distance>`_ for more details.
|
||||
|
||||
Match Rating Approach (comparison)
|
||||
----------------------------------
|
||||
|
||||
.. py:function:: jellyfish.match_rating_comparison(s1, s2)
|
||||
|
||||
Compare s1 and s2 using the match rating approach algorithm, returns ``True`` if strings are considered equivalent or ``False`` if not. Can also return ``None`` if s1 and s2 are not comparable (length differs by more than 3).
|
||||
|
||||
The Match rating approach algorithm is an algorithm for determining whether or not two names are
|
||||
pronounced similarly. Strings are first encoded using ``match_rating_codex`` then compared according
|
||||
to the MRA algorithm.
|
||||
|
||||
See the `Match Rating Approach <http://en.wikipedia.org/wiki/Match_rating_approach>`_ for more details.
|
|
@ -6,32 +6,22 @@ Overview
|
|||
|
||||
jellyfish is a library of functions for approximate and phonetic matching of strings.
|
||||
|
||||
|
||||
Included Algorithms
|
||||
~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
String Comparison:
|
||||
|
||||
* Levenshtein Distance
|
||||
* Damerau-Levenshtein Distance
|
||||
* Jaro Distance
|
||||
* Jaro-Winkler Distance
|
||||
* Match Rating Approach Comparison
|
||||
* Hamming Distance
|
||||
|
||||
Phonetic Encoding:
|
||||
|
||||
* American Soundex
|
||||
* Metaphone
|
||||
* NYSIIS (New York State Identification and Intelligence System)
|
||||
* Match Rating Codex
|
||||
|
||||
|
||||
Contents
|
||||
--------
|
||||
The library provides implementations of the following algorithms:
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 2
|
||||
:maxdepth: 3
|
||||
|
||||
phonetic
|
||||
stemming
|
||||
comparison
|
||||
|
||||
Implementation
|
||||
--------------
|
||||
|
||||
Each algorithm has C and Python implementations, on a typical CPython install the C implementation will be used.
|
||||
|
||||
The Python versions are available for PyPy and systems where compiling the CPython extension is not possible.
|
||||
|
||||
|
||||
Indices and tables
|
||||
==================
|
||||
|
|
|
@ -1,16 +1,13 @@
|
|||
Algorithms
|
||||
==========
|
||||
|
||||
Phonetic Encoding
|
||||
~~~~~~~~~~~~~~~~~
|
||||
=================
|
||||
|
||||
These algorithms convert a string to a normalized phonetic encoding, converting a word to a
|
||||
representation of its pronunciation. Each algorithm takes a single string and returns a coded
|
||||
representation.
|
||||
|
||||
|
||||
soundex (American Soundex)
|
||||
--------------------------
|
||||
American Soundex
|
||||
----------------
|
||||
|
||||
.. py:function:: jellyfish.soundex(s)
|
||||
|
||||
|
@ -25,7 +22,7 @@ For example ``soundex('Ann') == soundex('Anne') == 'A500'`` and
|
|||
See the `Soundex article at Wikipedia <http://en.wikipedia.org/wiki/Soundex>`_ for more details.
|
||||
|
||||
|
||||
metaphone
|
||||
Metaphone
|
||||
---------
|
||||
|
||||
.. py:function:: jellyfish.metaphone(s)
|
||||
|
@ -40,7 +37,7 @@ For example ``metaphone('Klump') == metaphone('Clump') == 'KLMP'``.
|
|||
See the `Metaphone article at Wikipedia <http://en.wikipedia.org/wiki/Metaphone>`_ for more details.
|
||||
|
||||
|
||||
nysiis
|
||||
NYSIIS
|
||||
------
|
||||
|
||||
.. py:function:: jellyfish.nysiis(s)
|
||||
|
@ -62,21 +59,8 @@ Match Rating Approach (codex)
|
|||
|
||||
The Match rating approach algorithm is an algorithm for determining whether or not two names are
|
||||
pronounced similarly. The algorithm consists of an encoding function (similar to soundex or nysiis)
|
||||
as well as a special comparison algorithm that can be used to compare names for equality using
|
||||
the generated encoding.
|
||||
which is implemented here as well as ``match_rating_comparison`` which does the actual comparison.
|
||||
|
||||
See the `Match Rating Approach <http://en.wikipedia.org/wiki/Match_rating_approach>`_ for more details.
|
||||
|
||||
porter_stem
|
||||
-----------
|
||||
|
||||
.. py:function:: jellyfish.porter_stem(s)
|
||||
|
||||
Reduce the string s to its stem using the common Porter stemmer.
|
||||
|
||||
Stemming is the process of reducing a word to its root form, for example 'stemmed' to 'stem'.
|
||||
|
||||
Martin Porter's algorithm is a common algorithm used for stemming that works for many purposes.
|
||||
|
||||
See the `official homepage for the Porter Stemming Algorithm <http://tartarus.org/martin/PorterStemmer/>` for more details.
|
||||
:
|
|
@ -0,0 +1,15 @@
|
|||
Stemming
|
||||
========
|
||||
|
||||
Porter Stemmer
|
||||
--------------
|
||||
|
||||
.. py:function:: jellyfish.porter_stem(s)
|
||||
|
||||
Reduce the string s to its stem using the common Porter stemmer.
|
||||
|
||||
Stemming is the process of reducing a word to its root form, for example 'stemmed' to 'stem'.
|
||||
|
||||
Martin Porter's algorithm is a common algorithm used for stemming that works for many purposes.
|
||||
|
||||
See the `official homepage for the Porter Stemming Algorithm <http://tartarus.org/martin/PorterStemmer/>` for more details.
|
Loading…
Reference in New Issue