3.9 KiB
API differences to fuzzywuzzy
Rapidfuzz
does provide a very similar API to fuzzywuzzy
/thefuzz
making it a drop in replacement for a large amount of projects.
However there are some differences which are listed below:
ratio implementation
fuzzywuzzy
provides two implementations of the algorithm:
- a pure Python version implemented using difflib (Ratcliff and Obershelp algorithm)
- an accelerated version using the Indel similarity (similar to the Levenshtein distance but only allows for Insertions / Deletions)
This leads to different results depending on the version in use. RapidFuzz
always uses the Indel similarity both in the pure Python
fallback implementation and the C++ based implementation to provide consistent matching results.
partial_ratio implementation
fuzzywuzzy
searches for the optimal matching substring and then calculates the similarity using ratio
. This substring is searches using either:
difflib.SequenceMatcher.get_matching_blocks
(based on Ratcliff and Obershelp algorithm)Levenshtein.matching_blocks
(backtracks Levenshtein matrix)
This implementation has a couple of issues:
- in the pure Python implementation the automatic junk heuristic of difflib is not deactivated. This heuristic improves the performance for long strings, but can lead to completely incorrect results.
- the accelerated version backtracks the Levenshtein matrix to find the same alignment found by the Python implementation. However the algorithm just uses one of multiple optimal alignment. There is no guarantee for this alignment to include the longest common substring.
- the optimal substring is assumed to start at one of these
matching_blocks
. However this is not guaranteed.
RapidFuzz
uses a sliding window approach (with some optimizations to skip impossible alignments) to find the optimal alignment. This approach is guaranteed
to find the optimal alignment.
differences in preprocessing
fuzzywuzzy
provides the function utils.full_process
to preprocess strings. This function is called utils.default_process
in RapidFuzz
. It behaves similar with the only exception
that it does not provide the optional argument force_ascii
which removes any non ascii characters from a string.
differences in scorers
fuzzywuzzy
has the following scorers which preprocess strings by default:
fuzz.token_sort_ratio
fuzz.token_set_ratio
fuzz.partial_token_sort_ratio
fuzz.partial_token_set_ratio
fuzz.WRatio
fuzz.QRatio
fuzz.UWRatio
fuzz.UQRatio
With the exception fuzz.UWRatio
and fuzz.UQRatio
of all have force_ascii
enabled forthe peprocessing function by default.
In RapidFuzz
no scorer preprocesses strings by default to keep the interface consistent. However a preprocessing function can be provided using the processor
argument. In addition the functions fuzz.UWRatio
and fuzz.UQRatio
do not exist, since they are the same as fuzz.WRatio
/ fuzz.QRatio
with force_ascii
disabled. Since in RapidFuzz
the force_ascii
argument does not exist these functions do not provide any value.
differences in processor functions
In fuzzywuzzy
the process module includes the following functions:
extractWithoutOrder
(generator over unsorted results)extract
(find the N best matches in a sorted list)extractBests
(same as extract but with an addition score_cutoff parameter to filter bad matches)extractOne
(find best match)dedupe
(deduplicate list)
In RapidFuzz
these functions are sometimes available under different names:
extractWithoutOrder
is calledextract_iter
extract
/extractBests
are a single function calledextract
which povides the optionalscore_cutoff
argumentextractOne
is available under the same namededupe
is not available
In addition these functions do not preprocess strings by default. However preprocessing can be enabled using the processor
argument.