Commit Graph

79 Commits

Author SHA1 Message Date
Max Bachmann 1d10dbc56a Fix Cython ABI stability 2021-03-30 00:14:55 +02:00
Max Bachmann 05f907bf2b
add distance support to process.*
## Changed
- added processor support to `levenshtein` and `hamming`
- added distance support to extract/extractOne/extract_iter

## Fixes
- incorrect results of `normalized_hamming` and `normalized_levenshtein` when used with `utils.default_process` as processor
2021-03-29 19:09:22 +02:00
Max Bachmann 853681f7cf fix bug in mbleven implementation 2021-03-20 12:04:12 +01:00
Max Bachmann 0d84a8b933 ignore some compiler warnings for cython 2021-03-20 06:35:35 +01:00
Max Bachmann 90cc67be00 fix bug in LCS implementation 2021-03-20 03:46:02 +01:00
Max Bachmann c31a2d96b5 fix some typos in normalized Levenshtein distance 2021-03-16 02:30:33 +01:00
Max Bachmann e124f4f32e improve performance of Levenshtein distance 2021-03-08 01:15:13 +01:00
Max Bachmann 53b8e3bd61 update build mechanism 2021-03-07 17:45:24 +01:00
Max Bachmann c6eebb70a5 fix incorrect ref counting 2021-03-03 16:08:42 +01:00
Max Bachmann e8102a4e87 Fix result conversion process.extract 2021-02-23 14:58:44 +01:00
Max Bachmann 5383d286b2
Release v1.1.0 (#75)
## Changed
- string_metric.normalized_levenshtein supports now all weights
- when different weights are used for Insertion and Deletion the strings can not be swapped inside the Levenshtein implementation anymore. So different weights for Insertion and Deletion are now supported.
- replace C++ implementation with a Cython implementation. This has the following advantages:
  - The implementation is less error prone, since a lot of the complex things are done by Cython
  - slighly faster than the current implementation (up to 10% for some parts)
  - about 33% smaller binary size
  - reduced compile time
- Added **kwargs argument to process.extract/extractOne/extract_iter that is passed to the scorer
- Add max argument to hamming distance
- Add support for whole Unicode range to utils.default_process

## Performance
- replaced Wagner Fischer usage in the normal Levenshtein distance with a bitparallel implementation
2021-02-21 19:42:36 +01:00
Max Bachmann 88a86a1028 deactivate bitparallel LCS
The algorithm to find the longest common subsequence after calculating it in bitparall
appears to have a bug. Deactivate it until this bug is fixed
2021-02-19 15:20:31 +01:00
Max Bachmann 7139004214 fix uninitialized variable 2021-02-17 23:08:56 +01:00
Max Bachmann 375c13e436 Release v1.0.0 (#68)
- all normalized string_metrics can now be used as scorer for process.extract/extractOne
- Implementation of the C++ Wrapper completely refactored to make it easier to add more scorers, processors and string matching algorithms in the future.
- increased test coverage, that already helped to fix some bugs and help to prevent regressions in the future
- improved docstrings of functions

- Added bitparallel implementation of the Levenshtein distance for the weights (1,1,1) and (1,1,2).
- Added specialized implementation of the Levenshtein distance for cases with a small maximum edit distance, that is even faster, than the bitparallel implementation.
- Improved performance of `fuzz.partial_ratio`
-> Since `fuzz.ratio` and `fuzz.partial_ratio` are used in most scorers, this improves the overall performance.
- Improved performance of `process.extract` and `process.extractOne`

- the `rapidfuzz.levenshtein` module is now deprecated and will be removed in v2.0.0
  These functions are now placed in `rapidfuzz.string_metric`. `distance`, `normalized_distance`, `weighted_distance` and `weighted_normalized_distance` are combined into `levenshtein` and `normalized_levenshtein`.

- added normalized version of the hamming distance in `string_metric.normalized_hamming`
- process.extract_iter as a generator, that yields the similarity of all elements, that have a similarity >= score_cutoff

- multiple bugs in extractOne when used with a scorer, thats not from RapidFuzz
- fixed bug in `token_ratio`
- fixed bug in result normalisation causing zero division
2021-02-12 16:48:10 +01:00
Max Bachmann 2f4d0ed957 increment version number 2020-12-31 01:25:24 +01:00
Max Bachmann cc5fa23c32 fix custom processors in fuzz.* 2020-12-13 16:55:45 +01:00
Max Bachmann fa17ff98e8 Add hamming distance metric
Co-authored-by: jleem <jinwoo.leem@gmail.com>
2020-12-09 01:13:21 +01:00
Max Bachmann fb6824e849 increase performance of default_process
use lookup tables for the string conversions instead of many
branches, which greatly increases the performance
2020-12-05 16:10:43 +01:00
Max Bachmann 8f9a61e8c0
Add virtual destructor (see #65) 2020-11-30 18:19:17 +01:00
Max Bachmann 67b02ff967 add C++11 support 2020-11-21 18:25:47 +01:00
Max Bachmann b224dc27d9 fix wrong reference counting
The reference count was decreased to early
2020-11-21 10:16:30 +01:00
Max Bachmann 316303d858
Exit early when a score of 100 is found (#56) 2020-11-17 17:48:59 +01:00
Max Bachmann 426fbb24e9
implement process.extractOne in C++ (#53)
* start to simplify complexion

* start implementation

* add extractOne to C++

* fix a couple of bugs in the implementation

* start adressing performance issues
2020-11-15 20:18:46 +01:00
Max Bachmann b3af7641a4 fix documentation of process.extractOne 2020-10-26 18:46:56 +01:00
Max Bachmann 9b64ad2fee
add wheels for Python2.7 on Windows (#47) 2020-10-22 05:54:39 +02:00
Max Bachmann 06d4484d8a
increment Version 2020-10-09 10:06:01 +02:00
maxbachmann 865fbf0d8a
fix inconsistency towards fuzzywuzzy 2020-10-01 22:42:58 +02:00
maxbachmann 82e77dbb41
reduce template complexity for msvc 2020-09-30 18:02:34 +02:00
maxbachmann 789941dc40 replace difflib 2020-09-29 00:18:24 +02:00
maxbachmann 13a828ce1b
fix partial_token_set_ratio returning 100 all the time 2020-09-22 18:24:34 +02:00
maxbachmann 588f73c2ef
add version, author and license to __init__.py 2020-09-12 18:29:16 +02:00
maxbachmann 6efaf59dc1
do not auto junk in partial ratio 2020-09-01 02:46:38 +02:00
maxbachmann 10946dfac0 add python 2.7 support 2020-08-22 23:06:05 +02:00
maxbachmann 3d2cfe8b4a update rapidfuzz-cpp and add support for tuple scores in processors 2020-08-14 11:45:07 +02:00
maxbachmann eae941a647
further reduce tarball size 2020-06-27 12:49:32 +02:00
maxbachmann 293caa1242
fix inconsistency from #32 2020-05-24 08:19:28 +02:00
maxbachmann 3137df9e96
remove boost::optional dependency 2020-05-22 14:38:13 +02:00
maxbachmann f0f8247d02
allow any object with items 2020-05-21 08:39:13 +02:00
maxbachmann a4bfbeb2f5
exit early when exact match was found 2020-05-19 18:31:24 +02:00
maxbachmann 46cf20aa4e
remove intermediate python function to improve performance 2020-05-12 08:56:28 +02:00
maxbachmann 3e7c410c44
use common interface for all fuzzy ratios 2020-05-07 14:40:37 +02:00
maxbachmann d5995e2f18
help pylint finding members better 2020-05-07 11:07:22 +02:00
maxbachmann 3121457e42
exit early in token_sort_ratio 2020-04-29 13:32:11 +02:00
maxbachmann 8f596ae7c7
manually check punctuation for #29 2020-04-28 09:27:56 +02:00
maxbachmann 96b5660720
increment version 2020-04-24 20:31:43 +02:00
maxbachmann 7fa6c88a3d
support choice: match_choice dict 2020-04-22 19:00:46 +02:00
maxbachmann 438be93cb6
fix #25 2020-04-16 10:13:01 +02:00
maxbachmann e4006839fc
add missing files to tarball 2020-04-15 23:17:35 +02:00
maxbachmann 66fbf1f574
increment version 2020-04-13 10:31:07 +02:00
maxbachmann cc872fdd3a
sort by largest 2020-04-09 16:26:08 +02:00