RapidFuzz/CHANGELOG.md at 9693b7da76784548d39f66dd1d5d7787b266d4cb

21 KiB

Raw Blame History

Changelog

[2.14.0] -

Fixed

improve handling of functions wrapped using functools.wraps

[2.13.2] - 2022-11-05

Fixed

fix incorrect results in Hamming.normalized_similarity
fix incorrect score_cutoff handling in pure python implementation of Postfix.normalized_distance and Prefix.normalized_distance
fix Levenshtein.normalized_similarity and Levenshtein.normalized_distance when used in combination with the process module
fuzz.partial_ratio was not always symmetric when len(s1) == len(s2)

[2.13.1] - 2022-11-02

Fixed

fix bug in normalized_similarity of most scorers, leading to incorrect results when used in combination with the process module
fix sse2 support
fix bug in JaroWinkler and Jaro when used in the pure python process module
forward kwargs in pure Python implementation of process.extract

[2.13.0] - 2022-10-30

Fixed

fix bug in Levenshtein.editops leading to crashes when used with score_hint

Changed

moved capi from rapidfuzz_capi into rapidfuzz, since it will always succeed the installation now that there is a pure Python mode
add score_hint argument to process module
add score_hint argument to Levenshtein module

[2.12.0] - 2022-10-24

Changed

drop support for Python 3.6

Added

added Prefix/Suffix similarity

Fixed

fixed packaging with pyinstaller

[2.11.1] - 2022-10-05

Fixed

Fix segmentation fault in process.cdist when used with an empty query sequence

[2.11.0] - 2022-10-02

Changes

move jarowinkler dependency into rapidfuzz to simplify maintenance

Performance

add SIMD implementation for fuzz.ratio/fuzz.QRatio/Levenshtein/Indel/LCSseq/OSA to improve performance for short strings in cdist

[2.10.3] - 2022-09-30

Fixed

use scikit-build=0.14.1 on Linux, since scikit-build=0.15.0 fails to find the Python Interpreter
workaround gcc in bug in template type deduction

[2.10.2] - 2022-09-27

Fixed

fix support for cmake versions below 3.17

[2.10.1] - 2022-09-25

Changed

modernize cmake build to fix most conda-forge builds

[2.10.0] - 2022-09-18

Added

add editops to hamming distance

Performance

strip common affix in osa distance

Fixed

ignore missing pandas in Python3.11 tests

[2.9.0] - 2022-09-16

Added

add optimal string alignment (OSA)

[2.8.0] - 2022-09-11

Fixed

fuzz.partial_ratio did not find the optimal alignment in some edge cases (#219)

Performance

improve performance of fuzz.partial_ratio

Changed

increased minimum C++ version to C++17 (see #255)

[2.7.0] - 2022-09-11

Performance

improve performance of Levenshtein.distance/Levenshtein.editops for long sequences.

Added

add score_hint parameter to Levenshtein.editops which allows the use of a faster implementation

Changed

all functions in the string_metric module do now raise a deprecation warning. They are now only wrappers for their replacement functions, which makes them slower when used with the process module

[2.6.1] - 2022-09-03

Fixed

fix incorrect results of partial_ratio for long needles (#257)

[2.6.0] - 2022-08-20

Fixed

fix hashing for custom classes

Added

add support for slicing in Editops.__getitem__/Editops.__delitem__
add DamerauLevenshtein module

[2.5.0] - 2022-08-14

Added

added support for KeyboardInterrupt in processor module It might still take a bit until the KeyboardInterrupt is registered, but no longer runs all text comparisons after pressing Ctrl + C

Fixed

fix default scorer used by cdist to use C++ implementation if possible

[2.4.4] - 2022-08-12

Changed

Added support for Python3.11

[2.4.3] - 2022-08-08

Fixed

fix value range of jaro_similarity/jaro_winkler_similarity in the pure Python mode for the string_metric module
fix missing atomix symbol on arm 32 bit

[2.4.2] - 2022-07-30

Fixed

add missing symbol to pure Python which made the usage impossible

[2.4.1] - 2022-07-29

Fixed

fix version number

[2.4.0] - 2022-07-29

Fixed

fix banded Levenshtein implementation

Performance

improve performance and memory usage of Levenshtein.editops
- memory usage is reduced from O(NM) to O(N)
- performance is improved for long sequences

[2.3.0] - 2022-07-23

Added

add as_matching_blocks to Editops/Opcodes
add support for deletions from Editops
add Editops.apply/Opcodes.apply
add Editops.remove_subsequence

Changed

merge adjacent similar blocks in Opcodes

Fixed

fix usage of eval(repr(Editop)), eval(repr(Editops)), eval(repr(Opcode)) and eval(repr(Opcodes))
fix opcode conversion for empty source sequence
fix validation for empty Opcode list passed into Opcodes.__init__

[2.2.0] - 2022-07-19

Changed

added in-tree build backend to install cmake and ninja only when it is not installed yet and only when wheels are available

[2.1.4] - 2022-07-17

Changed

changed internal implementation of cdist to remove build dependency to numpy

Added

added wheels for musllinux and manylinux ppc64le, s390x

[2.1.3] - 2022-07-09

Fixed

fix missing type stubs

[2.1.2] - 2022-07-04

Changed

change src layout to make package import from root directory possible

[2.1.1] - 2022-06-30

Changed

allow installation without the C++ extension if it fails to compile
allow selection of implementation via the environment variable RAPIDFUZZ_IMPLEMENTATION which can be set to "cpp" or "python"

[2.1.0] - 2022-06-29

Added

added pure python fallback for all implementations with the following exceptions:
- no support for sequences of hashables. Only strings supported so far
- *.editops / *.opcodes functions not implemented yet
- process.cdist does not support multithreading

Fixed

fuzz.partial_ratio_alignment ignored the score_cutoff
fix implementation of Hamming.normalized_similarity
fix default score_cutoff of Hamming.similarity
fix implementation of LCSseq.distance when used in the process module
treat hash for -1 and -2 as different

[2.0.15] - 2022-06-24

Fixed

fix integer wraparound in partial_ratio/partial_ratio_alignment

[2.0.14] - 2022-06-23

Fixed

fix unlimited recursion in LCSseq when used in combination with the process module

Changed

add fallback implementations of taskflow, rapidfuzz-cpp and jarowinkler-cpp back to wheel, since some package building systems like piwheels can't clone sources

[2.0.13] - 2022-06-22

Changed

use system version of cmake on arm platforms, since the cmake package fails to compile

[2.0.12] - 2022-06-22

Changed

add tests to sdist
remove cython dependency for sdist

[2.0.11] - 2022-04-23

Changed

relax version requirements of dependencies to simplify packaging

[2.0.10] - 2022-04-17

Fixed

Do not include installations of jaro_winkler in wheels (regression from 2.0.7)

Changed

Allow installation from system installed versions of rapidfuzz-cpp, jarowinkler-cpp and taskflow

Added

Added PyPy3.9 wheels on Linux

[2.0.9] - 2022-04-07

Fixed

Add missing Cython code in sdist
consider float imprecision in score_cutoff (see #210)

[2.0.8] - 2022-04-07

Fixed

fix incorrect score_cutoff handling in token_set_ratio and token_ratio

Added

add longest common subsequence

[2.0.7] - 2022-03-13

Fixed

Do not include installations of jaro_winkler and taskflow in wheels

[2.0.6] - 2022-03-06

Fixed

fix incorrect population of sys.modules which lead to submodules overshadowing other imports

Changed

moved JaroWinkler and Jaro into a separate package

[2.0.5] - 2022-02-25

Fixed

fix signed integer overflow inside hashmap implementation

[2.0.4] - 2022-02-21

Fixed

fix binary size increase due to debug symbols
fix segmentation fault in Levenshtein.editops

[2.0.3] - 2022-02-18

Added

Added fuzz.partial_ratio_alignment, which returns the result of fuzz.partial_ratio combined with the alignment this result stems from

Fixed

Fix Indel distance returning incorrect result when using score_cutoff=1, when the strings are not equal. This affected other scorers like fuzz.WRatio, which use the Indel distance as well.

[2.0.2] - 2022-02-12

Fixed

fix type hints
Add back transpiled cython files to the sdist to simplify builds in package builders like FreeBSD port build or conda-forge

[2.0.1] - 2022-02-11

Fixed

fix type hints
Indel.normalized_similarity mistakenly used the implementation of Indel.normalized_distance

[2.0.0] - 2022-02-09

Added

added C-Api which can be used to extend RapidFuzz from different Python modules using any programming language which allows the usage of C-Apis (C/C++/Rust)
added new scorers in rapidfuzz.distance.*
- port existing distances to this new api
- add Indel distance along with the corresponding editops function

Changed

when the result of string_metric.levenshtein or string_metric.hamming is below max they do now return max + 1 instead of -1
Build system moved from setuptools to scikit-build
Stop including all modules in __init__.py, since they significantly slowed down import time

Removed

remove the rapidfuzz.levenshtein module which was deprecated in v1.0.0 and scheduled for removal in v2.0.0
dropped support for Python2.7 and Python3.5

Deprecated

deprecate support to specify processor in form of a boolean (will be removed in v3.0.0)
- new functions will not get support for this in the first place
deprecate rapidfuzz.string_metric (will be removed in v3.0.0). Similar scorers are available in rapidfuzz.distance.*

Fixed

process.cdist did raise an exception when used with a pure python scorer

Performance

improve performance and memory usage of rapidfuzz.string_metric.levenshtein_editops
- memory usage is reduced by 33%
- performance is improved by around 10%-20%
significantly improve performance of rapidfuzz.string_metric.levenshtein for max <= 31 using a banded implementation

[1.9.1] - 2021-12-13

Fixed

fix bug in new editops implementation, causing it to SegFault on some inputs (see qurator-spk/dinglehopper#64)

[1.9.0] - 2021-12-11

Fixed

Fix some issues in the type annotations (see #163)

Performance

improve performance and memory usage of rapidfuzz.string_metric.levenshtein_editops
- memory usage is reduced by 10x
- performance is improved from O(N * M) to O([N / 64] * M)

[1.8.3] - 2021-11-19

Added

Added missing wheels for Python3.6 on MacOs and Windows (see #159)

[1.8.2] - 2021-10-27

Added

Add wheels for Python 3.10 on MacOs

[1.8.1] - 2021-10-22

Fixed

Fix incorrect editops results (See #148)

[1.8.0] - 2021-10-20

Changed

Add Wheels for Python3.10 on all platforms except MacOs (see #141)
Improve performance of string_metric.jaro_similarity and string_metric.jaro_winkler_similarity for strings with a length <= 64

[1.7.1] - 2021-10-02

Fixed

fixed incorrect results of fuzz.partial_ratio for long needles (see #138)

[1.7.0] - 2021-09-27

Changed

Added typing for process.cdist
Added multithreading support to cdist using the argument process.cdist
Add dtype argument to process.cdist to set the dtype of the result numpy array (see #132)
Use a better hash collision strategy in the internal hashmap, which improves the worst case performance

[1.6.2] - 2021-09-15

Changed

improved performance of fuzz.ratio
only import process.cdist when numpy is available

[1.6.1] - 2021-09-11

Changed

Add back wheels for Python2.7

[1.6.0] - 2021-09-10

Changed

fuzz.partial_ratio uses a new implementation for short needles (<= 64). This implementation is
- more accurate than the current implementation (it is guaranteed to find the optimal alignment)
- it is significantly faster
Add process.cdist to compare all elements of two lists (see #51)

[1.5.1] - 2021-09-01

Fixed

Fix out of bounds access in levenshtein_editops

[1.5.0] - 2021-08-21

Changed

all scorers do now support similarity/distance calculations between any sequence of hashables. So it is possible to calculate e.g. the WER as:

>>> string_metric.levenshtein(["word1", "word2"], ["word1", "word3"])
1

Added

Added type stub files for all functions
added jaro similarity in string_metric.jaro_similarity
added jaro winkler similarity in string_metric.jaro_winkler_similarity
added Levenshtein editops in string_metric.levenshtein_editops

Fixed

Fixed support for set objects in process.extract
Fixed inconsistent handling of empty strings

[1.4.1] - 2021-03-30

Performance

improved performance of result creation in process.extract

Fixed

Cython ABI stability issue (#95)
fix missing decref in case of exceptions in process.extract

[1.4.0] - 2021-03-29

Changed

added processor support to levenshtein and hamming
added distance support to extract/extractOne/extract_iter

Fixed

incorrect results of normalized_hamming and normalized_levenshtein when used with utils.default_process as processor

[1.3.3] - 2021-03-20

Fixed

Fix a bug in the mbleven implementation of the uniform Levenshtein distance and cover it with fuzz tests

[1.3.2] - 2021-03-20

Fixed

some of the newly activated warnings caused build failures in the conda-forge build

[1.3.1] - 2021-03-20

Fixed

Fixed issue in LCS calculation for partial_ratio (see #90)
Fixed incorrect results for normalized_hamming and normalized_levenshtein when the processor utils.default_process is used
Fix many compiler warnings

[1.3.0] - 2021-03-16

Changed

add wheels for a lot of new platforms
drop support for Python 2.7

Performance

use is instead of == to compare functions directly by address

Fixed

Fix another ref counting issue
Fix some issues in the Levenshtein distance algorithm (see #92)

[1.2.1] - 2021-03-08

Performance

further improve bitparallel implementation of uniform Levenshtein distance for strings with a length > 64 (in many cases more than 50% faster)

[1.2.0] - 2021-03-07

Changed

add more benchmarks to documentation

Performance

add bitparallel implementation to InDel Distance (Levenshtein with the weights 1,1,2) for strings with a length > 64
improve bitparallel implementation of uniform Levenshtein distance for strings with a length > 64
use the InDel Distance and uniform Levenshtein distance in more cases instead of the generic implementation
Directly use the Levenshtein implementation in C++ instead of using it through Python in process.*

[1.1.2] - 2021-03-03

Fixed

Fix reference counting in process.extract (see #81)

[1.1.1] - 2021-02-23

Fixed

Fix result conversion in process.extract (see #79)

[1.1.0] - 2021-02-21

Changed

string_metric.normalized_levenshtein supports now all weights
when different weights are used for Insertion and Deletion the strings are not swapped inside the Levenshtein implementation anymore. So different weights for Insertion and Deletion are now supported.
replace C++ implementation with a Cython implementation. This has the following advantages:
- The implementation is less error prone, since a lot of the complex things are done by Cython
- slightly faster than the current implementation (up to 10% for some parts)
- about 33% smaller binary size
- reduced compile time
Added **kwargs argument to process.extract/extractOne/extract_iter that is passed to the scorer
Add max argument to hamming distance
Add support for whole Unicode range to utils.default_process

Performance

replaced Wagner Fischer usage in the normal Levenshtein distance with a bitparallel implementation

[1.0.2] - 2021-02-19

Fixed

The bitparallel LCS algorithm in fuzz.partial_ratio did not find the longest common substring properly in some cases. The old algorithm is used again until this bug is fixed.

[1.0.1] - 2021-02-17

Changed

string_metric.normalized_levenshtein supports now the weights (1, 1, N) with N >= 1

Performance

The Levenshtein distance with the weights (1, 1, >2) do now use the same implementation as the weight (1, 1, 2), since Substitution > Insertion + Deletion has no effect

Fixed

fix uninitialized variable in bitparallel Levenshtein distance with the weight (1, 1, 1)

[1.0.0] - 2021-02-12

Changed

all normalized string_metrics can now be used as scorer for process.extract/extractOne
Implementation of the C++ Wrapper completely refactored to make it easier to add more scorers, processors and string matching algorithms in the future.
increased test coverage, that already helped to fix some bugs and help to prevent regressions in the future
improved docstrings of functions

Performance

Added bit-parallel implementation of the Levenshtein distance for the weights (1,1,1) and (1,1,2).
Added specialized implementation of the Levenshtein distance for cases with a small maximum edit distance, that is even faster, than the bit-parallel implementation.
Improved performance of fuzz.partial_ratio -> Since fuzz.ratio and fuzz.partial_ratio are used in most scorers, this improves the overall performance.
Improved performance of process.extract and process.extractOne

Deprecated

the rapidfuzz.levenshtein module is now deprecated and will be removed in v2.0.0 These functions are now placed in rapidfuzz.string_metric. distance, normalized_distance, weighted_distance and weighted_normalized_distance are combined into levenshtein and normalized_levenshtein.

Added

added normalized version of the hamming distance in string_metric.normalized_hamming
process.extract_iter as a generator, that yields the similarity of all elements, that have a similarity >= score_cutoff

Fixed

multiple bugs in extractOne when used with a scorer, that's not from RapidFuzz
fixed bug in token_ratio
fixed bug in result normalization causing zero division

[0.14.2] - 2020-12-31

Fixed

utf8 usage in the copyright header caused problems with python2.7 on some platforms (see #70)

[0.14.1] - 2020-12-13

Fixed

when a custom processor like lambda s: s was used with any of the methods inside fuzz.* it always returned a score of 100. This release fixes this and adds a better test coverage to prevent this bug in the future.

[0.14.0] - 2020-12-09

Added

added hamming distance metric in the levenshtein module

Performance

improved performance of default_process by using lookup table

[0.13.4] - 2020-11-30

Fixed

Add missing virtual destructor that caused a segmentation fault on Mac Os

[0.13.3] - 2020-11-21

Added

C++11 Support
manylinux wheels

[0.13.2] - 2020-11-21

Fixed

Levenshtein was not imported from __init__
The reference count of a Python Object inside process.extractOne was decremented to early

[0.13.1] - 2020-11-17

Performance

process.extractOne exits early when a score of 100 is found. This way the other strings do not have to be preprocessed anymore.

[0.13.0] - 2020-11-16

Fixed

string objects passed to scorers had to be strings even before preprocessing them. This was changed, so they only have to be strings after preprocessing similar to process.extract/process.extractOne

Performance

process.extractOne is now implemented in C++ making it a lot faster
When token_sort_ratio or partial_token_sort ratio is used inprocess.extractOne the words in the query are only sorted once to improve the runtime

Changed

process.extractOne/process.extract do now return the index of the match, when the choices are a list.

Removed

process.extractIndices got removed, since the indices are now already returned by process.extractOne/process.extract

[0.12.5] - 2020-10-26

Fixed

fix documentation of process.extractOne (see #48)

[0.12.4] - 2020-10-22

Added

Added wheels for
- CPython 2.7 on windows 64 bit
- CPython 2.7 on windows 32 bit
- PyPy 2.7 on windows 32 bit

[0.12.3] - 2020-10-09

Fixed

fix bug in partial_ratio (see #43)

[0.12.2] - 2020-10-01

Fixed

fix inconsistency with fuzzywuzzy in partial_ratio when using strings of equal length

[0.12.1] - 2020-09-30

Fixed

MSVC has a bug and therefore crashed on some of the templates used. This Release simplifies the templates so compiling on msvc works again

[0.12.0] - 2020-09-30

Performance

partial_ratio is using the Levenshtein distance now, which is a lot faster. Since many of the other algorithms use partial_ratio, this helps to improve the overall performance

[0.11.3] - 2020-09-22

Fixed

fix partial_token_set_ratio returning 100 all the time

[0.11.2] - 2020-09-12

Added

added rapidfuzz.__author__, rapidfuzz.__license__ and rapidfuzz.__version__

[0.11.1] - 2020-09-01

Fixed

do not use auto junk when searching the optimal alignment for partial_ratio

[0.11.0] - 2020-08-22

Changed

support for python 2.7 added #40
add wheels for python2.7 (both pypy and cpython) on MacOS and Linux

[0.10.0] - 2020-08-17

Changed

added wheels for Python3.9

Fixed

tuple scores in process.extractOne are now supported #39

21 KiB Raw Blame History

Changelog

[2.14.0] -

Fixed

[2.13.2] - 2022-11-05

Fixed

[2.13.1] - 2022-11-02

Fixed

[2.13.0] - 2022-10-30

Fixed

Changed

[2.12.0] - 2022-10-24

Changed

Added

Fixed

[2.11.1] - 2022-10-05

Fixed

[2.11.0] - 2022-10-02

Changes

Performance

[2.10.3] - 2022-09-30

Fixed

[2.10.2] - 2022-09-27

Fixed

[2.10.1] - 2022-09-25

Changed

[2.10.0] - 2022-09-18

Added

Performance

Fixed

[2.9.0] - 2022-09-16

Added

[2.8.0] - 2022-09-11

Fixed

Performance

Changed

[2.7.0] - 2022-09-11

Performance

Added

Changed

[2.6.1] - 2022-09-03

Fixed

[2.6.0] - 2022-08-20

Fixed

Added

[2.5.0] - 2022-08-14

Added

Fixed

[2.4.4] - 2022-08-12

Changed

[2.4.3] - 2022-08-08

Fixed

[2.4.2] - 2022-07-30

Fixed

[2.4.1] - 2022-07-29

Fixed

[2.4.0] - 2022-07-29

Fixed

Performance

[2.3.0] - 2022-07-23

Added

Changed

Fixed

[2.2.0] - 2022-07-19

Changed

[2.1.4] - 2022-07-17

Changed

Added

[2.1.3] - 2022-07-09

Fixed

[2.1.2] - 2022-07-04

Changed

[2.1.1] - 2022-06-30

Changed

[2.1.0] - 2022-06-29

Added

Fixed

[2.0.15] - 2022-06-24

Fixed

[2.0.14] - 2022-06-23

21 KiB

Raw Blame History