2020-08-22 19:07:08 +00:00
|
|
|
#!/usr/bin/env python
|
|
|
|
# -*- coding: utf-8 -*-
|
|
|
|
|
2020-05-24 07:57:08 +00:00
|
|
|
import unittest
|
|
|
|
|
|
|
|
from rapidfuzz import process, fuzz, utils
|
|
|
|
|
|
|
|
class UtilsTest(unittest.TestCase):
|
|
|
|
def test_fullProcess(self):
|
|
|
|
mixed_strings = [
|
|
|
|
"Lorem Ipsum is simply dummy text of the printing and typesetting industry.",
|
|
|
|
"C'est la vie",
|
2020-08-22 19:07:08 +00:00
|
|
|
u"Ça va?",
|
|
|
|
u"Cães danados",
|
|
|
|
u"¬Camarões assados",
|
|
|
|
u"a¬ሴ€耀",
|
|
|
|
u"Á"
|
2020-05-24 07:57:08 +00:00
|
|
|
]
|
|
|
|
mixed_strings_proc = [
|
|
|
|
"lorem ipsum is simply dummy text of the printing and typesetting industry",
|
|
|
|
"c est la vie",
|
Release v1.0.0 (#68)
- all normalized string_metrics can now be used as scorer for process.extract/extractOne
- Implementation of the C++ Wrapper completely refactored to make it easier to add more scorers, processors and string matching algorithms in the future.
- increased test coverage, that already helped to fix some bugs and help to prevent regressions in the future
- improved docstrings of functions
- Added bitparallel implementation of the Levenshtein distance for the weights (1,1,1) and (1,1,2).
- Added specialized implementation of the Levenshtein distance for cases with a small maximum edit distance, that is even faster, than the bitparallel implementation.
- Improved performance of `fuzz.partial_ratio`
-> Since `fuzz.ratio` and `fuzz.partial_ratio` are used in most scorers, this improves the overall performance.
- Improved performance of `process.extract` and `process.extractOne`
- the `rapidfuzz.levenshtein` module is now deprecated and will be removed in v2.0.0
These functions are now placed in `rapidfuzz.string_metric`. `distance`, `normalized_distance`, `weighted_distance` and `weighted_normalized_distance` are combined into `levenshtein` and `normalized_levenshtein`.
- added normalized version of the hamming distance in `string_metric.normalized_hamming`
- process.extract_iter as a generator, that yields the similarity of all elements, that have a similarity >= score_cutoff
- multiple bugs in extractOne when used with a scorer, thats not from RapidFuzz
- fixed bug in `token_ratio`
- fixed bug in result normalisation causing zero division
2021-02-12 15:37:44 +00:00
|
|
|
u"ça va",
|
2020-08-22 19:07:08 +00:00
|
|
|
u"cães danados",
|
Release v1.0.0 (#68)
- all normalized string_metrics can now be used as scorer for process.extract/extractOne
- Implementation of the C++ Wrapper completely refactored to make it easier to add more scorers, processors and string matching algorithms in the future.
- increased test coverage, that already helped to fix some bugs and help to prevent regressions in the future
- improved docstrings of functions
- Added bitparallel implementation of the Levenshtein distance for the weights (1,1,1) and (1,1,2).
- Added specialized implementation of the Levenshtein distance for cases with a small maximum edit distance, that is even faster, than the bitparallel implementation.
- Improved performance of `fuzz.partial_ratio`
-> Since `fuzz.ratio` and `fuzz.partial_ratio` are used in most scorers, this improves the overall performance.
- Improved performance of `process.extract` and `process.extractOne`
- the `rapidfuzz.levenshtein` module is now deprecated and will be removed in v2.0.0
These functions are now placed in `rapidfuzz.string_metric`. `distance`, `normalized_distance`, `weighted_distance` and `weighted_normalized_distance` are combined into `levenshtein` and `normalized_levenshtein`.
- added normalized version of the hamming distance in `string_metric.normalized_hamming`
- process.extract_iter as a generator, that yields the similarity of all elements, that have a similarity >= score_cutoff
- multiple bugs in extractOne when used with a scorer, thats not from RapidFuzz
- fixed bug in `token_ratio`
- fixed bug in result normalisation causing zero division
2021-02-12 15:37:44 +00:00
|
|
|
u"camarões assados",
|
2021-02-21 18:42:36 +00:00
|
|
|
u"a ሴ 耀",
|
Release v1.0.0 (#68)
- all normalized string_metrics can now be used as scorer for process.extract/extractOne
- Implementation of the C++ Wrapper completely refactored to make it easier to add more scorers, processors and string matching algorithms in the future.
- increased test coverage, that already helped to fix some bugs and help to prevent regressions in the future
- improved docstrings of functions
- Added bitparallel implementation of the Levenshtein distance for the weights (1,1,1) and (1,1,2).
- Added specialized implementation of the Levenshtein distance for cases with a small maximum edit distance, that is even faster, than the bitparallel implementation.
- Improved performance of `fuzz.partial_ratio`
-> Since `fuzz.ratio` and `fuzz.partial_ratio` are used in most scorers, this improves the overall performance.
- Improved performance of `process.extract` and `process.extractOne`
- the `rapidfuzz.levenshtein` module is now deprecated and will be removed in v2.0.0
These functions are now placed in `rapidfuzz.string_metric`. `distance`, `normalized_distance`, `weighted_distance` and `weighted_normalized_distance` are combined into `levenshtein` and `normalized_levenshtein`.
- added normalized version of the hamming distance in `string_metric.normalized_hamming`
- process.extract_iter as a generator, that yields the similarity of all elements, that have a similarity >= score_cutoff
- multiple bugs in extractOne when used with a scorer, thats not from RapidFuzz
- fixed bug in `token_ratio`
- fixed bug in result normalisation causing zero division
2021-02-12 15:37:44 +00:00
|
|
|
u"á"
|
2020-05-24 07:57:08 +00:00
|
|
|
]
|
|
|
|
|
|
|
|
for string, proc_string in zip(mixed_strings, mixed_strings_proc):
|
|
|
|
self.assertEqual(
|
|
|
|
utils.default_process(string),
|
|
|
|
proc_string)
|
|
|
|
|
|
|
|
if __name__ == '__main__':
|
|
|
|
unittest.main()
|
|
|
|
|