66 lines
3.9 KiB
Markdown
66 lines
3.9 KiB
Markdown
|
# API differences to `fuzzywuzzy`
|
||
|
|
||
|
`Rapidfuzz` does provide a very similar API to `fuzzywuzzy`/`thefuzz` making it a drop in replacement for a large amount of projects.
|
||
|
However there are some differences which are listed below:
|
||
|
|
||
|
## ratio implementation
|
||
|
|
||
|
`fuzzywuzzy` provides two implementations of the algorithm:
|
||
|
1) a pure Python version implemented using difflib (Ratcliff and Obershelp algorithm)
|
||
|
2) an accelerated version using the Indel similarity (similar to the Levenshtein distance but only allows for Insertions / Deletions)
|
||
|
|
||
|
This leads to different results depending on the version in use. `RapidFuzz` always uses the Indel similarity both in the pure Python
|
||
|
fallback implementation and the C++ based implementation to provide consistent matching results.
|
||
|
|
||
|
## partial_ratio implementation
|
||
|
`fuzzywuzzy` uses searches fo the optimal matching substring and then calculates the similarity using `ratio`. This substring is searches using either:
|
||
|
1) `difflib.SequenceMatcher.get_matching_blocks` (based on Ratcliff and Obershelp algorithm)
|
||
|
2) `Levenshtein.matching_blocks` (backtracks Levenshtein matrix)
|
||
|
|
||
|
This implementation has a couple of issues:
|
||
|
1) in the pure Python implementation the automatic junk heuristic of difflib is not deactivated. This heuristic improves the performance for long strings,
|
||
|
but can lead to completely incorrect results.
|
||
|
2) the accellerated version backtracks the Levenshtein matrix to find the same alignment found by the Python implementation. However the algorithm just uses
|
||
|
one of multiple optimal alignment. There is no guarantee for this alignment to include the longest common substring.
|
||
|
3) the optimal substring is assumed to start at one of these `matching_blocks`. However this is not guaranteed.
|
||
|
|
||
|
`RapidFuzz` uses a sliding window approach (with some optimizations to skip impossible alignments) to find the optimal alignment. This approach is guaranteed
|
||
|
to find the optimal alignment.
|
||
|
|
||
|
## differences in preprocessing
|
||
|
|
||
|
`fuzzywuzzy` provides the function `utils.full_process` to preprocess strings. This function is called `utils.default_process` in `RapidFuzz`. It behaves similar with the only exception
|
||
|
that it does not provide the optional argument `force_ascii` which removes any non ascii characters from a string.
|
||
|
|
||
|
## differences in scorers
|
||
|
|
||
|
`fuzzywuzzy` has the following scorers which preprocess strings by default:
|
||
|
- `fuzz.token_sort_ratio`
|
||
|
- `fuzz.token_set_ratio`
|
||
|
- `fuzz.partial_token_sort_ratio`
|
||
|
- `fuzz.partial_token_set_ratio`
|
||
|
- `fuzz.WRatio`
|
||
|
- `fuzz.QRatio`
|
||
|
- `fuzz.UWRatio`
|
||
|
- `fuzz.UQRatio`
|
||
|
|
||
|
With the exception `fuzz.UWRatio` and `fuzz.UQRatio` of all have `force_ascii` enabled forthe peprocessing function by default.
|
||
|
|
||
|
In `RapidFuzz` no scorer preprocesses strings by default to keep the interface consistent. However a preprocessing function can be provided using the `processor` argument. In addition the functions `fuzz.UWRatio` and `fuzz.UQRatio` do not exist, since they are the same as `fuzz.WRatio` / `fuzz.QRatio` with `force_ascii` disabled. Since in `RapidFuzz` the `force_ascii` argument does not exist these functions do not provide any value.
|
||
|
|
||
|
## differences in processor functions
|
||
|
|
||
|
In `fuzzywuzzy` the process module includes the following functions:
|
||
|
- `extractWithoutOrder` (generator over unsorted results)
|
||
|
- `extract` (find the N best matches in a sorted list)
|
||
|
- `extractBests` (same as extract but with an addition score_cutoff parameter to filter bad matches)
|
||
|
- `extractOne` (find best match)
|
||
|
- `dedupe` (deduplicate list)
|
||
|
|
||
|
In `RapidFuzz` these functions are sometimes available under different names:
|
||
|
- `extractWithoutOrder` is called `extract_iter`
|
||
|
- `extract` / `extractBests` are a single function called `extract` which povides the optional `score_cutoff` argument
|
||
|
- `extractOne` is available under the same name
|
||
|
- `dedupe` is not available
|
||
|
|
||
|
In addition these functions do not preprocess strings by default. However preprocessing can be enabled using the `processor` argument.
|