Release v1.0.0 (#68)
- all normalized string_metrics can now be used as scorer for process.extract/extractOne
- Implementation of the C++ Wrapper completely refactored to make it easier to add more scorers, processors and string matching algorithms in the future.
- increased test coverage, that already helped to fix some bugs and help to prevent regressions in the future
- improved docstrings of functions
- Added bitparallel implementation of the Levenshtein distance for the weights (1,1,1) and (1,1,2).
- Added specialized implementation of the Levenshtein distance for cases with a small maximum edit distance, that is even faster, than the bitparallel implementation.
- Improved performance of `fuzz.partial_ratio`
-> Since `fuzz.ratio` and `fuzz.partial_ratio` are used in most scorers, this improves the overall performance.
- Improved performance of `process.extract` and `process.extractOne`
- the `rapidfuzz.levenshtein` module is now deprecated and will be removed in v2.0.0
These functions are now placed in `rapidfuzz.string_metric`. `distance`, `normalized_distance`, `weighted_distance` and `weighted_normalized_distance` are combined into `levenshtein` and `normalized_levenshtein`.
- added normalized version of the hamming distance in `string_metric.normalized_hamming`
- process.extract_iter as a generator, that yields the similarity of all elements, that have a similarity >= score_cutoff
- multiple bugs in extractOne when used with a scorer, thats not from RapidFuzz
- fixed bug in `token_ratio`
- fixed bug in result normalisation causing zero division
2021-02-12 15:37:44 +00:00
|
|
|
import importlib
|
|
|
|
import random
|
2022-10-02 08:06:27 +00:00
|
|
|
import string
|
Release v1.0.0 (#68)
- all normalized string_metrics can now be used as scorer for process.extract/extractOne
- Implementation of the C++ Wrapper completely refactored to make it easier to add more scorers, processors and string matching algorithms in the future.
- increased test coverage, that already helped to fix some bugs and help to prevent regressions in the future
- improved docstrings of functions
- Added bitparallel implementation of the Levenshtein distance for the weights (1,1,1) and (1,1,2).
- Added specialized implementation of the Levenshtein distance for cases with a small maximum edit distance, that is even faster, than the bitparallel implementation.
- Improved performance of `fuzz.partial_ratio`
-> Since `fuzz.ratio` and `fuzz.partial_ratio` are used in most scorers, this improves the overall performance.
- Improved performance of `process.extract` and `process.extractOne`
- the `rapidfuzz.levenshtein` module is now deprecated and will be removed in v2.0.0
These functions are now placed in `rapidfuzz.string_metric`. `distance`, `normalized_distance`, `weighted_distance` and `weighted_normalized_distance` are combined into `levenshtein` and `normalized_levenshtein`.
- added normalized version of the hamming distance in `string_metric.normalized_hamming`
- process.extract_iter as a generator, that yields the similarity of all elements, that have a similarity >= score_cutoff
- multiple bugs in extractOne when used with a scorer, thats not from RapidFuzz
- fixed bug in `token_ratio`
- fixed bug in result normalisation causing zero division
2021-02-12 15:37:44 +00:00
|
|
|
from timeit import timeit
|
|
|
|
|
2022-10-02 08:06:27 +00:00
|
|
|
import matplotlib
|
|
|
|
import matplotlib.pyplot as plt
|
|
|
|
import numpy as np
|
Release v1.0.0 (#68)
- all normalized string_metrics can now be used as scorer for process.extract/extractOne
- Implementation of the C++ Wrapper completely refactored to make it easier to add more scorers, processors and string matching algorithms in the future.
- increased test coverage, that already helped to fix some bugs and help to prevent regressions in the future
- improved docstrings of functions
- Added bitparallel implementation of the Levenshtein distance for the weights (1,1,1) and (1,1,2).
- Added specialized implementation of the Levenshtein distance for cases with a small maximum edit distance, that is even faster, than the bitparallel implementation.
- Improved performance of `fuzz.partial_ratio`
-> Since `fuzz.ratio` and `fuzz.partial_ratio` are used in most scorers, this improves the overall performance.
- Improved performance of `process.extract` and `process.extractOne`
- the `rapidfuzz.levenshtein` module is now deprecated and will be removed in v2.0.0
These functions are now placed in `rapidfuzz.string_metric`. `distance`, `normalized_distance`, `weighted_distance` and `weighted_normalized_distance` are combined into `levenshtein` and `normalized_levenshtein`.
- added normalized version of the hamming distance in `string_metric.normalized_hamming`
- process.extract_iter as a generator, that yields the similarity of all elements, that have a similarity >= score_cutoff
- multiple bugs in extractOne when used with a scorer, thats not from RapidFuzz
- fixed bug in `token_ratio`
- fixed bug in result normalisation causing zero division
2021-02-12 15:37:44 +00:00
|
|
|
|
|
|
|
random.seed(18)
|
|
|
|
|
2022-10-02 08:24:00 +00:00
|
|
|
plt.rc("font", size=13) # controls default text sizes
|
|
|
|
plt.rc("axes", titlesize=18) # fontsize of the axes title
|
|
|
|
plt.rc("axes", labelsize=15) # fontsize of the x and y labels
|
|
|
|
plt.rc("xtick", labelsize=15) # fontsize of the tick labels
|
|
|
|
plt.rc("ytick", labelsize=15) # fontsize of the tick labels
|
|
|
|
plt.rc("legend", fontsize=15) # legend fontsize
|
Release v1.0.0 (#68)
- all normalized string_metrics can now be used as scorer for process.extract/extractOne
- Implementation of the C++ Wrapper completely refactored to make it easier to add more scorers, processors and string matching algorithms in the future.
- increased test coverage, that already helped to fix some bugs and help to prevent regressions in the future
- improved docstrings of functions
- Added bitparallel implementation of the Levenshtein distance for the weights (1,1,1) and (1,1,2).
- Added specialized implementation of the Levenshtein distance for cases with a small maximum edit distance, that is even faster, than the bitparallel implementation.
- Improved performance of `fuzz.partial_ratio`
-> Since `fuzz.ratio` and `fuzz.partial_ratio` are used in most scorers, this improves the overall performance.
- Improved performance of `process.extract` and `process.extractOne`
- the `rapidfuzz.levenshtein` module is now deprecated and will be removed in v2.0.0
These functions are now placed in `rapidfuzz.string_metric`. `distance`, `normalized_distance`, `weighted_distance` and `weighted_normalized_distance` are combined into `levenshtein` and `normalized_levenshtein`.
- added normalized version of the hamming distance in `string_metric.normalized_hamming`
- process.extract_iter as a generator, that yields the similarity of all elements, that have a similarity >= score_cutoff
- multiple bugs in extractOne when used with a scorer, thats not from RapidFuzz
- fixed bug in `token_ratio`
- fixed bug in result normalisation causing zero division
2021-02-12 15:37:44 +00:00
|
|
|
|
|
|
|
PROCESSOR = {
|
2022-10-02 08:24:00 +00:00
|
|
|
"ratio": False,
|
|
|
|
"partial_ratio": False,
|
|
|
|
"token_sort_ratio": True,
|
|
|
|
"token_set_ratio": True,
|
|
|
|
"partial_token_sort_ratio": True,
|
|
|
|
"partial_token_set_ratio": True,
|
|
|
|
"QRatio": True,
|
|
|
|
"WRatio": True,
|
Release v1.0.0 (#68)
- all normalized string_metrics can now be used as scorer for process.extract/extractOne
- Implementation of the C++ Wrapper completely refactored to make it easier to add more scorers, processors and string matching algorithms in the future.
- increased test coverage, that already helped to fix some bugs and help to prevent regressions in the future
- improved docstrings of functions
- Added bitparallel implementation of the Levenshtein distance for the weights (1,1,1) and (1,1,2).
- Added specialized implementation of the Levenshtein distance for cases with a small maximum edit distance, that is even faster, than the bitparallel implementation.
- Improved performance of `fuzz.partial_ratio`
-> Since `fuzz.ratio` and `fuzz.partial_ratio` are used in most scorers, this improves the overall performance.
- Improved performance of `process.extract` and `process.extractOne`
- the `rapidfuzz.levenshtein` module is now deprecated and will be removed in v2.0.0
These functions are now placed in `rapidfuzz.string_metric`. `distance`, `normalized_distance`, `weighted_distance` and `weighted_normalized_distance` are combined into `levenshtein` and `normalized_levenshtein`.
- added normalized version of the hamming distance in `string_metric.normalized_hamming`
- process.extract_iter as a generator, that yields the similarity of all elements, that have a similarity >= score_cutoff
- multiple bugs in extractOne when used with a scorer, thats not from RapidFuzz
- fixed bug in `token_ratio`
- fixed bug in result normalisation causing zero division
2021-02-12 15:37:44 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
LIBRARIES = (
|
|
|
|
"ratio",
|
|
|
|
"partial_ratio",
|
|
|
|
"token_sort_ratio",
|
|
|
|
"token_set_ratio",
|
|
|
|
"partial_token_sort_ratio",
|
|
|
|
"partial_token_set_ratio",
|
|
|
|
"QRatio",
|
|
|
|
"WRatio",
|
|
|
|
)
|
|
|
|
|
2022-10-02 08:24:00 +00:00
|
|
|
|
Release v1.0.0 (#68)
- all normalized string_metrics can now be used as scorer for process.extract/extractOne
- Implementation of the C++ Wrapper completely refactored to make it easier to add more scorers, processors and string matching algorithms in the future.
- increased test coverage, that already helped to fix some bugs and help to prevent regressions in the future
- improved docstrings of functions
- Added bitparallel implementation of the Levenshtein distance for the weights (1,1,1) and (1,1,2).
- Added specialized implementation of the Levenshtein distance for cases with a small maximum edit distance, that is even faster, than the bitparallel implementation.
- Improved performance of `fuzz.partial_ratio`
-> Since `fuzz.ratio` and `fuzz.partial_ratio` are used in most scorers, this improves the overall performance.
- Improved performance of `process.extract` and `process.extractOne`
- the `rapidfuzz.levenshtein` module is now deprecated and will be removed in v2.0.0
These functions are now placed in `rapidfuzz.string_metric`. `distance`, `normalized_distance`, `weighted_distance` and `weighted_normalized_distance` are combined into `levenshtein` and `normalized_levenshtein`.
- added normalized version of the hamming distance in `string_metric.normalized_hamming`
- process.extract_iter as a generator, that yields the similarity of all elements, that have a similarity >= score_cutoff
- multiple bugs in extractOne when used with a scorer, thats not from RapidFuzz
- fixed bug in `token_ratio`
- fixed bug in result normalisation causing zero division
2021-02-12 15:37:44 +00:00
|
|
|
def load_func(target):
|
2022-10-02 08:24:00 +00:00
|
|
|
modname, funcname = target.rsplit(".", maxsplit=1)
|
Release v1.0.0 (#68)
- all normalized string_metrics can now be used as scorer for process.extract/extractOne
- Implementation of the C++ Wrapper completely refactored to make it easier to add more scorers, processors and string matching algorithms in the future.
- increased test coverage, that already helped to fix some bugs and help to prevent regressions in the future
- improved docstrings of functions
- Added bitparallel implementation of the Levenshtein distance for the weights (1,1,1) and (1,1,2).
- Added specialized implementation of the Levenshtein distance for cases with a small maximum edit distance, that is even faster, than the bitparallel implementation.
- Improved performance of `fuzz.partial_ratio`
-> Since `fuzz.ratio` and `fuzz.partial_ratio` are used in most scorers, this improves the overall performance.
- Improved performance of `process.extract` and `process.extractOne`
- the `rapidfuzz.levenshtein` module is now deprecated and will be removed in v2.0.0
These functions are now placed in `rapidfuzz.string_metric`. `distance`, `normalized_distance`, `weighted_distance` and `weighted_normalized_distance` are combined into `levenshtein` and `normalized_levenshtein`.
- added normalized version of the hamming distance in `string_metric.normalized_hamming`
- process.extract_iter as a generator, that yields the similarity of all elements, that have a similarity >= score_cutoff
- multiple bugs in extractOne when used with a scorer, thats not from RapidFuzz
- fixed bug in `token_ratio`
- fixed bug in result normalisation causing zero division
2021-02-12 15:37:44 +00:00
|
|
|
|
|
|
|
module = importlib.import_module(modname)
|
|
|
|
return getattr(module, funcname)
|
|
|
|
|
2022-10-02 08:24:00 +00:00
|
|
|
|
Release v1.0.0 (#68)
- all normalized string_metrics can now be used as scorer for process.extract/extractOne
- Implementation of the C++ Wrapper completely refactored to make it easier to add more scorers, processors and string matching algorithms in the future.
- increased test coverage, that already helped to fix some bugs and help to prevent regressions in the future
- improved docstrings of functions
- Added bitparallel implementation of the Levenshtein distance for the weights (1,1,1) and (1,1,2).
- Added specialized implementation of the Levenshtein distance for cases with a small maximum edit distance, that is even faster, than the bitparallel implementation.
- Improved performance of `fuzz.partial_ratio`
-> Since `fuzz.ratio` and `fuzz.partial_ratio` are used in most scorers, this improves the overall performance.
- Improved performance of `process.extract` and `process.extractOne`
- the `rapidfuzz.levenshtein` module is now deprecated and will be removed in v2.0.0
These functions are now placed in `rapidfuzz.string_metric`. `distance`, `normalized_distance`, `weighted_distance` and `weighted_normalized_distance` are combined into `levenshtein` and `normalized_levenshtein`.
- added normalized version of the hamming distance in `string_metric.normalized_hamming`
- process.extract_iter as a generator, that yields the similarity of all elements, that have a similarity >= score_cutoff
- multiple bugs in extractOne when used with a scorer, thats not from RapidFuzz
- fixed bug in `token_ratio`
- fixed bug in result normalisation causing zero division
2021-02-12 15:37:44 +00:00
|
|
|
def get_platform():
|
|
|
|
import platform
|
2022-10-02 08:24:00 +00:00
|
|
|
|
Release v1.0.0 (#68)
- all normalized string_metrics can now be used as scorer for process.extract/extractOne
- Implementation of the C++ Wrapper completely refactored to make it easier to add more scorers, processors and string matching algorithms in the future.
- increased test coverage, that already helped to fix some bugs and help to prevent regressions in the future
- improved docstrings of functions
- Added bitparallel implementation of the Levenshtein distance for the weights (1,1,1) and (1,1,2).
- Added specialized implementation of the Levenshtein distance for cases with a small maximum edit distance, that is even faster, than the bitparallel implementation.
- Improved performance of `fuzz.partial_ratio`
-> Since `fuzz.ratio` and `fuzz.partial_ratio` are used in most scorers, this improves the overall performance.
- Improved performance of `process.extract` and `process.extractOne`
- the `rapidfuzz.levenshtein` module is now deprecated and will be removed in v2.0.0
These functions are now placed in `rapidfuzz.string_metric`. `distance`, `normalized_distance`, `weighted_distance` and `weighted_normalized_distance` are combined into `levenshtein` and `normalized_levenshtein`.
- added normalized version of the hamming distance in `string_metric.normalized_hamming`
- process.extract_iter as a generator, that yields the similarity of all elements, that have a similarity >= score_cutoff
- multiple bugs in extractOne when used with a scorer, thats not from RapidFuzz
- fixed bug in `token_ratio`
- fixed bug in result normalisation causing zero division
2021-02-12 15:37:44 +00:00
|
|
|
uname = platform.uname()
|
|
|
|
pyver = platform.python_version()
|
2022-10-02 08:24:00 +00:00
|
|
|
return f"Python {pyver} on {uname.system} ({uname.machine})"
|
|
|
|
|
Release v1.0.0 (#68)
- all normalized string_metrics can now be used as scorer for process.extract/extractOne
- Implementation of the C++ Wrapper completely refactored to make it easier to add more scorers, processors and string matching algorithms in the future.
- increased test coverage, that already helped to fix some bugs and help to prevent regressions in the future
- improved docstrings of functions
- Added bitparallel implementation of the Levenshtein distance for the weights (1,1,1) and (1,1,2).
- Added specialized implementation of the Levenshtein distance for cases with a small maximum edit distance, that is even faster, than the bitparallel implementation.
- Improved performance of `fuzz.partial_ratio`
-> Since `fuzz.ratio` and `fuzz.partial_ratio` are used in most scorers, this improves the overall performance.
- Improved performance of `process.extract` and `process.extractOne`
- the `rapidfuzz.levenshtein` module is now deprecated and will be removed in v2.0.0
These functions are now placed in `rapidfuzz.string_metric`. `distance`, `normalized_distance`, `weighted_distance` and `weighted_normalized_distance` are combined into `levenshtein` and `normalized_levenshtein`.
- added normalized version of the hamming distance in `string_metric.normalized_hamming`
- process.extract_iter as a generator, that yields the similarity of all elements, that have a similarity >= score_cutoff
- multiple bugs in extractOne when used with a scorer, thats not from RapidFuzz
- fixed bug in `token_ratio`
- fixed bug in result normalisation causing zero division
2021-02-12 15:37:44 +00:00
|
|
|
|
|
|
|
def benchmark():
|
2022-10-02 08:24:00 +00:00
|
|
|
words = [
|
|
|
|
"".join(random.choice(string.ascii_letters + string.digits) for _ in range(8))
|
|
|
|
for _ in range(10000)
|
|
|
|
]
|
Release v1.0.0 (#68)
- all normalized string_metrics can now be used as scorer for process.extract/extractOne
- Implementation of the C++ Wrapper completely refactored to make it easier to add more scorers, processors and string matching algorithms in the future.
- increased test coverage, that already helped to fix some bugs and help to prevent regressions in the future
- improved docstrings of functions
- Added bitparallel implementation of the Levenshtein distance for the weights (1,1,1) and (1,1,2).
- Added specialized implementation of the Levenshtein distance for cases with a small maximum edit distance, that is even faster, than the bitparallel implementation.
- Improved performance of `fuzz.partial_ratio`
-> Since `fuzz.ratio` and `fuzz.partial_ratio` are used in most scorers, this improves the overall performance.
- Improved performance of `process.extract` and `process.extractOne`
- the `rapidfuzz.levenshtein` module is now deprecated and will be removed in v2.0.0
These functions are now placed in `rapidfuzz.string_metric`. `distance`, `normalized_distance`, `weighted_distance` and `weighted_normalized_distance` are combined into `levenshtein` and `normalized_levenshtein`.
- added normalized version of the hamming distance in `string_metric.normalized_hamming`
- process.extract_iter as a generator, that yields the similarity of all elements, that have a similarity >= score_cutoff
- multiple bugs in extractOne when used with a scorer, thats not from RapidFuzz
- fixed bug in `token_ratio`
- fixed bug in result normalisation causing zero division
2021-02-12 15:37:44 +00:00
|
|
|
sample_rate = len(words) // 100
|
|
|
|
sample = words[::sample_rate]
|
|
|
|
total = len(words) * len(sample)
|
|
|
|
|
2022-10-02 08:24:00 +00:00
|
|
|
print("System:", get_platform())
|
|
|
|
print("Words :", len(words))
|
|
|
|
print("Sample:", len(sample))
|
|
|
|
print("Total : %s calls\n" % total)
|
Release v1.0.0 (#68)
- all normalized string_metrics can now be used as scorer for process.extract/extractOne
- Implementation of the C++ Wrapper completely refactored to make it easier to add more scorers, processors and string matching algorithms in the future.
- increased test coverage, that already helped to fix some bugs and help to prevent regressions in the future
- improved docstrings of functions
- Added bitparallel implementation of the Levenshtein distance for the weights (1,1,1) and (1,1,2).
- Added specialized implementation of the Levenshtein distance for cases with a small maximum edit distance, that is even faster, than the bitparallel implementation.
- Improved performance of `fuzz.partial_ratio`
-> Since `fuzz.ratio` and `fuzz.partial_ratio` are used in most scorers, this improves the overall performance.
- Improved performance of `process.extract` and `process.extractOne`
- the `rapidfuzz.levenshtein` module is now deprecated and will be removed in v2.0.0
These functions are now placed in `rapidfuzz.string_metric`. `distance`, `normalized_distance`, `weighted_distance` and `weighted_normalized_distance` are combined into `levenshtein` and `normalized_levenshtein`.
- added normalized version of the hamming distance in `string_metric.normalized_hamming`
- process.extract_iter as a generator, that yields the similarity of all elements, that have a similarity >= score_cutoff
- multiple bugs in extractOne when used with a scorer, thats not from RapidFuzz
- fixed bug in `token_ratio`
- fixed bug in result normalisation causing zero division
2021-02-12 15:37:44 +00:00
|
|
|
|
2022-10-02 04:58:07 +00:00
|
|
|
def wrap_cdist(scorer, processor):
|
|
|
|
from rapidfuzz.process import cdist
|
2022-10-02 08:24:00 +00:00
|
|
|
|
Release v1.0.0 (#68)
- all normalized string_metrics can now be used as scorer for process.extract/extractOne
- Implementation of the C++ Wrapper completely refactored to make it easier to add more scorers, processors and string matching algorithms in the future.
- increased test coverage, that already helped to fix some bugs and help to prevent regressions in the future
- improved docstrings of functions
- Added bitparallel implementation of the Levenshtein distance for the weights (1,1,1) and (1,1,2).
- Added specialized implementation of the Levenshtein distance for cases with a small maximum edit distance, that is even faster, than the bitparallel implementation.
- Improved performance of `fuzz.partial_ratio`
-> Since `fuzz.ratio` and `fuzz.partial_ratio` are used in most scorers, this improves the overall performance.
- Improved performance of `process.extract` and `process.extractOne`
- the `rapidfuzz.levenshtein` module is now deprecated and will be removed in v2.0.0
These functions are now placed in `rapidfuzz.string_metric`. `distance`, `normalized_distance`, `weighted_distance` and `weighted_normalized_distance` are combined into `levenshtein` and `normalized_levenshtein`.
- added normalized version of the hamming distance in `string_metric.normalized_hamming`
- process.extract_iter as a generator, that yields the similarity of all elements, that have a similarity >= score_cutoff
- multiple bugs in extractOne when used with a scorer, thats not from RapidFuzz
- fixed bug in `token_ratio`
- fixed bug in result normalisation causing zero division
2021-02-12 15:37:44 +00:00
|
|
|
def func():
|
2022-10-02 04:58:07 +00:00
|
|
|
cdist(sample, words, scorer=scorer, processor=processor)
|
2022-10-02 08:24:00 +00:00
|
|
|
|
2022-10-02 04:58:07 +00:00
|
|
|
return func
|
|
|
|
|
|
|
|
def wrap_iterate(scorer, processor):
|
|
|
|
def func():
|
|
|
|
for query in sample:
|
|
|
|
for choice in words:
|
|
|
|
scorer(query, choice)
|
2022-10-02 08:24:00 +00:00
|
|
|
|
Release v1.0.0 (#68)
- all normalized string_metrics can now be used as scorer for process.extract/extractOne
- Implementation of the C++ Wrapper completely refactored to make it easier to add more scorers, processors and string matching algorithms in the future.
- increased test coverage, that already helped to fix some bugs and help to prevent regressions in the future
- improved docstrings of functions
- Added bitparallel implementation of the Levenshtein distance for the weights (1,1,1) and (1,1,2).
- Added specialized implementation of the Levenshtein distance for cases with a small maximum edit distance, that is even faster, than the bitparallel implementation.
- Improved performance of `fuzz.partial_ratio`
-> Since `fuzz.ratio` and `fuzz.partial_ratio` are used in most scorers, this improves the overall performance.
- Improved performance of `process.extract` and `process.extractOne`
- the `rapidfuzz.levenshtein` module is now deprecated and will be removed in v2.0.0
These functions are now placed in `rapidfuzz.string_metric`. `distance`, `normalized_distance`, `weighted_distance` and `weighted_normalized_distance` are combined into `levenshtein` and `normalized_levenshtein`.
- added normalized version of the hamming distance in `string_metric.normalized_hamming`
- process.extract_iter as a generator, that yields the similarity of all elements, that have a similarity >= score_cutoff
- multiple bugs in extractOne when used with a scorer, thats not from RapidFuzz
- fixed bug in `token_ratio`
- fixed bug in result normalisation causing zero division
2021-02-12 15:37:44 +00:00
|
|
|
return func
|
|
|
|
|
|
|
|
fuzz = []
|
|
|
|
rfuzz = []
|
|
|
|
|
2022-10-02 08:24:00 +00:00
|
|
|
header_list = ["Function", "RapidFuzz", "FuzzyWuzzy", "SpeedImprovement"]
|
|
|
|
row_format = "{:>25}" * len(header_list)
|
Release v1.0.0 (#68)
- all normalized string_metrics can now be used as scorer for process.extract/extractOne
- Implementation of the C++ Wrapper completely refactored to make it easier to add more scorers, processors and string matching algorithms in the future.
- increased test coverage, that already helped to fix some bugs and help to prevent regressions in the future
- improved docstrings of functions
- Added bitparallel implementation of the Levenshtein distance for the weights (1,1,1) and (1,1,2).
- Added specialized implementation of the Levenshtein distance for cases with a small maximum edit distance, that is even faster, than the bitparallel implementation.
- Improved performance of `fuzz.partial_ratio`
-> Since `fuzz.ratio` and `fuzz.partial_ratio` are used in most scorers, this improves the overall performance.
- Improved performance of `process.extract` and `process.extractOne`
- the `rapidfuzz.levenshtein` module is now deprecated and will be removed in v2.0.0
These functions are now placed in `rapidfuzz.string_metric`. `distance`, `normalized_distance`, `weighted_distance` and `weighted_normalized_distance` are combined into `levenshtein` and `normalized_levenshtein`.
- added normalized version of the hamming distance in `string_metric.normalized_hamming`
- process.extract_iter as a generator, that yields the similarity of all elements, that have a similarity >= score_cutoff
- multiple bugs in extractOne when used with a scorer, thats not from RapidFuzz
- fixed bug in `token_ratio`
- fixed bug in result normalisation causing zero division
2021-02-12 15:37:44 +00:00
|
|
|
print(row_format.format(*header_list))
|
|
|
|
for target in LIBRARIES:
|
|
|
|
scorer = load_func("fuzzywuzzy.fuzz." + target)
|
2022-10-02 08:24:00 +00:00
|
|
|
sec = timeit(
|
|
|
|
"func()",
|
|
|
|
globals={"func": wrap_iterate(scorer, PROCESSOR[target])},
|
|
|
|
number=1,
|
|
|
|
)
|
Release v1.0.0 (#68)
- all normalized string_metrics can now be used as scorer for process.extract/extractOne
- Implementation of the C++ Wrapper completely refactored to make it easier to add more scorers, processors and string matching algorithms in the future.
- increased test coverage, that already helped to fix some bugs and help to prevent regressions in the future
- improved docstrings of functions
- Added bitparallel implementation of the Levenshtein distance for the weights (1,1,1) and (1,1,2).
- Added specialized implementation of the Levenshtein distance for cases with a small maximum edit distance, that is even faster, than the bitparallel implementation.
- Improved performance of `fuzz.partial_ratio`
-> Since `fuzz.ratio` and `fuzz.partial_ratio` are used in most scorers, this improves the overall performance.
- Improved performance of `process.extract` and `process.extractOne`
- the `rapidfuzz.levenshtein` module is now deprecated and will be removed in v2.0.0
These functions are now placed in `rapidfuzz.string_metric`. `distance`, `normalized_distance`, `weighted_distance` and `weighted_normalized_distance` are combined into `levenshtein` and `normalized_levenshtein`.
- added normalized version of the hamming distance in `string_metric.normalized_hamming`
- process.extract_iter as a generator, that yields the similarity of all elements, that have a similarity >= score_cutoff
- multiple bugs in extractOne when used with a scorer, thats not from RapidFuzz
- fixed bug in `token_ratio`
- fixed bug in result normalisation causing zero division
2021-02-12 15:37:44 +00:00
|
|
|
calls = total / sec
|
|
|
|
fuzz.append(calls)
|
|
|
|
|
|
|
|
rscorer = load_func("rapidfuzz.fuzz." + target)
|
2022-10-02 08:24:00 +00:00
|
|
|
rsec = timeit(
|
|
|
|
"func()", globals={"func": wrap_cdist(rscorer, PROCESSOR[target])}, number=1
|
|
|
|
)
|
Release v1.0.0 (#68)
- all normalized string_metrics can now be used as scorer for process.extract/extractOne
- Implementation of the C++ Wrapper completely refactored to make it easier to add more scorers, processors and string matching algorithms in the future.
- increased test coverage, that already helped to fix some bugs and help to prevent regressions in the future
- improved docstrings of functions
- Added bitparallel implementation of the Levenshtein distance for the weights (1,1,1) and (1,1,2).
- Added specialized implementation of the Levenshtein distance for cases with a small maximum edit distance, that is even faster, than the bitparallel implementation.
- Improved performance of `fuzz.partial_ratio`
-> Since `fuzz.ratio` and `fuzz.partial_ratio` are used in most scorers, this improves the overall performance.
- Improved performance of `process.extract` and `process.extractOne`
- the `rapidfuzz.levenshtein` module is now deprecated and will be removed in v2.0.0
These functions are now placed in `rapidfuzz.string_metric`. `distance`, `normalized_distance`, `weighted_distance` and `weighted_normalized_distance` are combined into `levenshtein` and `normalized_levenshtein`.
- added normalized version of the hamming distance in `string_metric.normalized_hamming`
- process.extract_iter as a generator, that yields the similarity of all elements, that have a similarity >= score_cutoff
- multiple bugs in extractOne when used with a scorer, thats not from RapidFuzz
- fixed bug in `token_ratio`
- fixed bug in result normalisation causing zero division
2021-02-12 15:37:44 +00:00
|
|
|
rcalls = total / rsec
|
|
|
|
rfuzz.append(rcalls)
|
|
|
|
|
2022-10-02 08:24:00 +00:00
|
|
|
print(
|
|
|
|
row_format.format(
|
|
|
|
target, f"{rcalls//1000}k", f"{calls//1000}k", f"{int(100 * sec/rsec)}%"
|
|
|
|
)
|
|
|
|
)
|
Release v1.0.0 (#68)
- all normalized string_metrics can now be used as scorer for process.extract/extractOne
- Implementation of the C++ Wrapper completely refactored to make it easier to add more scorers, processors and string matching algorithms in the future.
- increased test coverage, that already helped to fix some bugs and help to prevent regressions in the future
- improved docstrings of functions
- Added bitparallel implementation of the Levenshtein distance for the weights (1,1,1) and (1,1,2).
- Added specialized implementation of the Levenshtein distance for cases with a small maximum edit distance, that is even faster, than the bitparallel implementation.
- Improved performance of `fuzz.partial_ratio`
-> Since `fuzz.ratio` and `fuzz.partial_ratio` are used in most scorers, this improves the overall performance.
- Improved performance of `process.extract` and `process.extractOne`
- the `rapidfuzz.levenshtein` module is now deprecated and will be removed in v2.0.0
These functions are now placed in `rapidfuzz.string_metric`. `distance`, `normalized_distance`, `weighted_distance` and `weighted_normalized_distance` are combined into `levenshtein` and `normalized_levenshtein`.
- added normalized version of the hamming distance in `string_metric.normalized_hamming`
- process.extract_iter as a generator, that yields the similarity of all elements, that have a similarity >= score_cutoff
- multiple bugs in extractOne when used with a scorer, thats not from RapidFuzz
- fixed bug in `token_ratio`
- fixed bug in result normalisation causing zero division
2021-02-12 15:37:44 +00:00
|
|
|
|
|
|
|
labels = LIBRARIES
|
|
|
|
|
|
|
|
x = np.arange(len(labels)) # the label locations
|
|
|
|
width = 0.35 # the width of the bars
|
|
|
|
|
2022-10-02 08:24:00 +00:00
|
|
|
fig, ax = plt.subplots(figsize=(17, 10))
|
|
|
|
rects1 = ax.bar(x - width / 2, fuzz, width, label="FuzzyWuzzy", color="xkcd:coral")
|
|
|
|
rects2 = ax.bar(x + width / 2, rfuzz, width, label="RapidFuzz", color="#6495ED")
|
Release v1.0.0 (#68)
- all normalized string_metrics can now be used as scorer for process.extract/extractOne
- Implementation of the C++ Wrapper completely refactored to make it easier to add more scorers, processors and string matching algorithms in the future.
- increased test coverage, that already helped to fix some bugs and help to prevent regressions in the future
- improved docstrings of functions
- Added bitparallel implementation of the Levenshtein distance for the weights (1,1,1) and (1,1,2).
- Added specialized implementation of the Levenshtein distance for cases with a small maximum edit distance, that is even faster, than the bitparallel implementation.
- Improved performance of `fuzz.partial_ratio`
-> Since `fuzz.ratio` and `fuzz.partial_ratio` are used in most scorers, this improves the overall performance.
- Improved performance of `process.extract` and `process.extractOne`
- the `rapidfuzz.levenshtein` module is now deprecated and will be removed in v2.0.0
These functions are now placed in `rapidfuzz.string_metric`. `distance`, `normalized_distance`, `weighted_distance` and `weighted_normalized_distance` are combined into `levenshtein` and `normalized_levenshtein`.
- added normalized version of the hamming distance in `string_metric.normalized_hamming`
- process.extract_iter as a generator, that yields the similarity of all elements, that have a similarity >= score_cutoff
- multiple bugs in extractOne when used with a scorer, thats not from RapidFuzz
- fixed bug in `token_ratio`
- fixed bug in result normalisation causing zero division
2021-02-12 15:37:44 +00:00
|
|
|
|
|
|
|
# Add some text for labels, title and custom x-axis tick labels, etc.
|
2022-10-02 08:24:00 +00:00
|
|
|
ax.set_ylabel("evaluated word pairs [inputs/s]")
|
|
|
|
ax.set_xlabel("Scorer")
|
|
|
|
ax.set_title(
|
|
|
|
"The number of word pairs evaluated per second\n(the larger the better)"
|
|
|
|
)
|
Release v1.0.0 (#68)
- all normalized string_metrics can now be used as scorer for process.extract/extractOne
- Implementation of the C++ Wrapper completely refactored to make it easier to add more scorers, processors and string matching algorithms in the future.
- increased test coverage, that already helped to fix some bugs and help to prevent regressions in the future
- improved docstrings of functions
- Added bitparallel implementation of the Levenshtein distance for the weights (1,1,1) and (1,1,2).
- Added specialized implementation of the Levenshtein distance for cases with a small maximum edit distance, that is even faster, than the bitparallel implementation.
- Improved performance of `fuzz.partial_ratio`
-> Since `fuzz.ratio` and `fuzz.partial_ratio` are used in most scorers, this improves the overall performance.
- Improved performance of `process.extract` and `process.extractOne`
- the `rapidfuzz.levenshtein` module is now deprecated and will be removed in v2.0.0
These functions are now placed in `rapidfuzz.string_metric`. `distance`, `normalized_distance`, `weighted_distance` and `weighted_normalized_distance` are combined into `levenshtein` and `normalized_levenshtein`.
- added normalized version of the hamming distance in `string_metric.normalized_hamming`
- process.extract_iter as a generator, that yields the similarity of all elements, that have a similarity >= score_cutoff
- multiple bugs in extractOne when used with a scorer, thats not from RapidFuzz
- fixed bug in `token_ratio`
- fixed bug in result normalisation causing zero division
2021-02-12 15:37:44 +00:00
|
|
|
ax.set_xticks(x)
|
2022-10-02 08:24:00 +00:00
|
|
|
ax.set_xticklabels(labels, rotation=30)
|
Release v1.0.0 (#68)
- all normalized string_metrics can now be used as scorer for process.extract/extractOne
- Implementation of the C++ Wrapper completely refactored to make it easier to add more scorers, processors and string matching algorithms in the future.
- increased test coverage, that already helped to fix some bugs and help to prevent regressions in the future
- improved docstrings of functions
- Added bitparallel implementation of the Levenshtein distance for the weights (1,1,1) and (1,1,2).
- Added specialized implementation of the Levenshtein distance for cases with a small maximum edit distance, that is even faster, than the bitparallel implementation.
- Improved performance of `fuzz.partial_ratio`
-> Since `fuzz.ratio` and `fuzz.partial_ratio` are used in most scorers, this improves the overall performance.
- Improved performance of `process.extract` and `process.extractOne`
- the `rapidfuzz.levenshtein` module is now deprecated and will be removed in v2.0.0
These functions are now placed in `rapidfuzz.string_metric`. `distance`, `normalized_distance`, `weighted_distance` and `weighted_normalized_distance` are combined into `levenshtein` and `normalized_levenshtein`.
- added normalized version of the hamming distance in `string_metric.normalized_hamming`
- process.extract_iter as a generator, that yields the similarity of all elements, that have a similarity >= score_cutoff
- multiple bugs in extractOne when used with a scorer, thats not from RapidFuzz
- fixed bug in `token_ratio`
- fixed bug in result normalisation causing zero division
2021-02-12 15:37:44 +00:00
|
|
|
ax.get_yaxis().set_major_formatter(
|
2022-10-02 08:24:00 +00:00
|
|
|
matplotlib.ticker.FuncFormatter(lambda x, p: format(int(x), ","))
|
|
|
|
)
|
Release v1.0.0 (#68)
- all normalized string_metrics can now be used as scorer for process.extract/extractOne
- Implementation of the C++ Wrapper completely refactored to make it easier to add more scorers, processors and string matching algorithms in the future.
- increased test coverage, that already helped to fix some bugs and help to prevent regressions in the future
- improved docstrings of functions
- Added bitparallel implementation of the Levenshtein distance for the weights (1,1,1) and (1,1,2).
- Added specialized implementation of the Levenshtein distance for cases with a small maximum edit distance, that is even faster, than the bitparallel implementation.
- Improved performance of `fuzz.partial_ratio`
-> Since `fuzz.ratio` and `fuzz.partial_ratio` are used in most scorers, this improves the overall performance.
- Improved performance of `process.extract` and `process.extractOne`
- the `rapidfuzz.levenshtein` module is now deprecated and will be removed in v2.0.0
These functions are now placed in `rapidfuzz.string_metric`. `distance`, `normalized_distance`, `weighted_distance` and `weighted_normalized_distance` are combined into `levenshtein` and `normalized_levenshtein`.
- added normalized version of the hamming distance in `string_metric.normalized_hamming`
- process.extract_iter as a generator, that yields the similarity of all elements, that have a similarity >= score_cutoff
- multiple bugs in extractOne when used with a scorer, thats not from RapidFuzz
- fixed bug in `token_ratio`
- fixed bug in result normalisation causing zero division
2021-02-12 15:37:44 +00:00
|
|
|
ax.legend()
|
|
|
|
|
|
|
|
def autolabel(rects):
|
|
|
|
"""Attach a text label above each bar in *rects*, displaying its height."""
|
|
|
|
for rect in rects:
|
|
|
|
height = rect.get_height()
|
2022-10-02 08:24:00 +00:00
|
|
|
ax.annotate(
|
|
|
|
f"{int(height):,}",
|
|
|
|
xy=(rect.get_x() + rect.get_width() / 2, height),
|
|
|
|
xytext=(0, 3), # 3 points vertical offset
|
|
|
|
textcoords="offset points",
|
|
|
|
ha="center",
|
|
|
|
va="bottom",
|
|
|
|
)
|
Release v1.0.0 (#68)
- all normalized string_metrics can now be used as scorer for process.extract/extractOne
- Implementation of the C++ Wrapper completely refactored to make it easier to add more scorers, processors and string matching algorithms in the future.
- increased test coverage, that already helped to fix some bugs and help to prevent regressions in the future
- improved docstrings of functions
- Added bitparallel implementation of the Levenshtein distance for the weights (1,1,1) and (1,1,2).
- Added specialized implementation of the Levenshtein distance for cases with a small maximum edit distance, that is even faster, than the bitparallel implementation.
- Improved performance of `fuzz.partial_ratio`
-> Since `fuzz.ratio` and `fuzz.partial_ratio` are used in most scorers, this improves the overall performance.
- Improved performance of `process.extract` and `process.extractOne`
- the `rapidfuzz.levenshtein` module is now deprecated and will be removed in v2.0.0
These functions are now placed in `rapidfuzz.string_metric`. `distance`, `normalized_distance`, `weighted_distance` and `weighted_normalized_distance` are combined into `levenshtein` and `normalized_levenshtein`.
- added normalized version of the hamming distance in `string_metric.normalized_hamming`
- process.extract_iter as a generator, that yields the similarity of all elements, that have a similarity >= score_cutoff
- multiple bugs in extractOne when used with a scorer, thats not from RapidFuzz
- fixed bug in `token_ratio`
- fixed bug in result normalisation causing zero division
2021-02-12 15:37:44 +00:00
|
|
|
|
|
|
|
autolabel(rects1)
|
|
|
|
autolabel(rects2)
|
|
|
|
fig.tight_layout()
|
|
|
|
plt.show()
|
|
|
|
|
|
|
|
|
2022-10-02 08:24:00 +00:00
|
|
|
if __name__ == "__main__":
|
Release v1.0.0 (#68)
- all normalized string_metrics can now be used as scorer for process.extract/extractOne
- Implementation of the C++ Wrapper completely refactored to make it easier to add more scorers, processors and string matching algorithms in the future.
- increased test coverage, that already helped to fix some bugs and help to prevent regressions in the future
- improved docstrings of functions
- Added bitparallel implementation of the Levenshtein distance for the weights (1,1,1) and (1,1,2).
- Added specialized implementation of the Levenshtein distance for cases with a small maximum edit distance, that is even faster, than the bitparallel implementation.
- Improved performance of `fuzz.partial_ratio`
-> Since `fuzz.ratio` and `fuzz.partial_ratio` are used in most scorers, this improves the overall performance.
- Improved performance of `process.extract` and `process.extractOne`
- the `rapidfuzz.levenshtein` module is now deprecated and will be removed in v2.0.0
These functions are now placed in `rapidfuzz.string_metric`. `distance`, `normalized_distance`, `weighted_distance` and `weighted_normalized_distance` are combined into `levenshtein` and `normalized_levenshtein`.
- added normalized version of the hamming distance in `string_metric.normalized_hamming`
- process.extract_iter as a generator, that yields the similarity of all elements, that have a similarity >= score_cutoff
- multiple bugs in extractOne when used with a scorer, thats not from RapidFuzz
- fixed bug in `token_ratio`
- fixed bug in result normalisation causing zero division
2021-02-12 15:37:44 +00:00
|
|
|
benchmark()
|