83 lines
2.9 KiB
ReStructuredText
83 lines
2.9 KiB
ReStructuredText
Indel
|
|
-----
|
|
|
|
Functions
|
|
^^^^^^^^^
|
|
|
|
distance
|
|
~~~~~~~~
|
|
.. autofunction:: rapidfuzz.distance.Indel.distance
|
|
|
|
normalized_distance
|
|
~~~~~~~~~~~~~~~~~~~
|
|
.. autofunction:: rapidfuzz.distance.Indel.normalized_distance
|
|
|
|
similarity
|
|
~~~~~~~~~~
|
|
.. autofunction:: rapidfuzz.distance.Indel.similarity
|
|
|
|
normalized_similarity
|
|
~~~~~~~~~~~~~~~~~~~~~
|
|
.. autofunction:: rapidfuzz.distance.Indel.normalized_similarity
|
|
|
|
editops
|
|
~~~~~~~
|
|
.. autofunction:: rapidfuzz.distance.Indel.editops
|
|
|
|
opcodes
|
|
~~~~~~~
|
|
.. autofunction:: rapidfuzz.distance.Indel.opcodes
|
|
|
|
Performance
|
|
^^^^^^^^^^^
|
|
Since the Levenshtein module uses different implementations based on the weights
|
|
used, this leads to different performance characteristics. The following sections
|
|
show the performance for the different possible weights.
|
|
|
|
Indel
|
|
~~~~~
|
|
The following image shows a benchmark of the Indel distance in RapidFuzz
|
|
and python-Levenshtein. Similar to the normal Levenshtein distance
|
|
python-Levenshtein uses an implementation with a time complexity of ``O(NM)``,
|
|
while RapidFuzz has a time complexity of ``O([N/64]M)``.
|
|
|
|
.. image:: img/indel_levenshtein.svg
|
|
:align: center
|
|
|
|
|
|
Implementation Notes
|
|
^^^^^^^^^^^^^^^^^^^^
|
|
|
|
The following implementation is used with a worst-case performance of ``O([N/64]M)``.
|
|
|
|
- if max is 0 the similarity can be calculated using a direct comparision,
|
|
since no difference between the strings is allowed. The time complexity of
|
|
this algorithm is ``O(N)``.
|
|
|
|
- if max is 1 and the two strings have a similar length, the similarity can be
|
|
calculated using a direct comparision aswell, since a substitution would cause
|
|
a edit distance higher than max. The time complexity of this algorithm
|
|
is ``O(N)``.
|
|
|
|
- A common prefix/suffix of the two compared strings does not affect
|
|
the Levenshtein distance, so the affix is removed before calculating the
|
|
similarity.
|
|
|
|
- If max is ≤ 4 the mbleven algorithm is used. This algorithm
|
|
checks all possible edit operations that are possible under
|
|
the threshold `max`. As a difference to the normal Levenshtein distance this
|
|
algorithm can even be used up to a threshold of 4 here, since the higher weight
|
|
of substitutions decreases the amount of possible edit operations.
|
|
The time complexity of this algorithm is ``O(N)``.
|
|
|
|
- If the length of the shorter string is ≤ 64 after removing the common affix
|
|
Hyyrös' lcs algorithm is used, which calculates the Indel distance in
|
|
parallel. The algorithm is described by :cite:t:`2004:hyrroe` and is extended with support
|
|
for UTF32 in this implementation. The time complexity of this
|
|
algorithm is ``O(N)``.
|
|
|
|
- If the length of the shorter string is ≥ 64 after removing the common affix
|
|
a blockwise implementation of the Hyyrös' lcs algorithm is used, which calculates
|
|
the Levenshtein distance in parallel (64 characters at a time).
|
|
The algorithm is described by :cite:t:`2004:hyrroe`. The time complexity of this
|
|
algorithm is ``O([N/64]M)``. |