RapidFuzz/CHANGELOG.md

21 KiB

Changelog

[2.14.0] -

Fixed

  • improve handling of functions wrapped using functools.wraps

[2.13.2] - 2022-11-05

Fixed

  • fix incorrect results in Hamming.normalized_similarity
  • fix incorrect score_cutoff handling in pure python implementation of Postfix.normalized_distance and Prefix.normalized_distance
  • fix Levenshtein.normalized_similarity and Levenshtein.normalized_distance when used in combination with the process module
  • fuzz.partial_ratio was not always symmetric when len(s1) == len(s2)

[2.13.1] - 2022-11-02

Fixed

  • fix bug in normalized_similarity of most scorers, leading to incorrect results when used in combination with the process module
  • fix sse2 support
  • fix bug in JaroWinkler and Jaro when used in the pure python process module
  • forward kwargs in pure Python implementation of process.extract

[2.13.0] - 2022-10-30

Fixed

  • fix bug in Levenshtein.editops leading to crashes when used with score_hint

Changed

  • moved capi from rapidfuzz_capi into rapidfuzz, since it will always succeed the installation now that there is a pure Python mode
  • add score_hint argument to process module
  • add score_hint argument to Levenshtein module

[2.12.0] - 2022-10-24

Changed

  • drop support for Python 3.6

Added

  • added Prefix/Suffix similarity

Fixed

  • fixed packaging with pyinstaller

[2.11.1] - 2022-10-05

Fixed

  • Fix segmentation fault in process.cdist when used with an empty query sequence

[2.11.0] - 2022-10-02

Changes

  • move jarowinkler dependency into rapidfuzz to simplify maintenance

Performance

  • add SIMD implementation for fuzz.ratio/fuzz.QRatio/Levenshtein/Indel/LCSseq/OSA to improve performance for short strings in cdist

[2.10.3] - 2022-09-30

Fixed

  • use scikit-build=0.14.1 on Linux, since scikit-build=0.15.0 fails to find the Python Interpreter
  • workaround gcc in bug in template type deduction

[2.10.2] - 2022-09-27

Fixed

  • fix support for cmake versions below 3.17

[2.10.1] - 2022-09-25

Changed

  • modernize cmake build to fix most conda-forge builds

[2.10.0] - 2022-09-18

Added

  • add editops to hamming distance

Performance

  • strip common affix in osa distance

Fixed

  • ignore missing pandas in Python3.11 tests

[2.9.0] - 2022-09-16

Added

  • add optimal string alignment (OSA)

[2.8.0] - 2022-09-11

Fixed

  • fuzz.partial_ratio did not find the optimal alignment in some edge cases (#219)

Performance

  • improve performance of fuzz.partial_ratio

Changed

  • increased minimum C++ version to C++17 (see #255)

[2.7.0] - 2022-09-11

Performance

  • improve performance of Levenshtein.distance/Levenshtein.editops for long sequences.

Added

  • add score_hint parameter to Levenshtein.editops which allows the use of a faster implementation

Changed

  • all functions in the string_metric module do now raise a deprecation warning. They are now only wrappers for their replacement functions, which makes them slower when used with the process module

[2.6.1] - 2022-09-03

Fixed

  • fix incorrect results of partial_ratio for long needles (#257)

[2.6.0] - 2022-08-20

Fixed

  • fix hashing for custom classes

Added

  • add support for slicing in Editops.__getitem__/Editops.__delitem__
  • add DamerauLevenshtein module

[2.5.0] - 2022-08-14

Added

  • added support for KeyboardInterrupt in processor module It might still take a bit until the KeyboardInterrupt is registered, but no longer runs all text comparisons after pressing Ctrl + C

Fixed

  • fix default scorer used by cdist to use C++ implementation if possible

[2.4.4] - 2022-08-12

Changed

  • Added support for Python3.11

[2.4.3] - 2022-08-08

Fixed

  • fix value range of jaro_similarity/jaro_winkler_similarity in the pure Python mode for the string_metric module
  • fix missing atomix symbol on arm 32 bit

[2.4.2] - 2022-07-30

Fixed

  • add missing symbol to pure Python which made the usage impossible

[2.4.1] - 2022-07-29

Fixed

  • fix version number

[2.4.0] - 2022-07-29

Fixed

  • fix banded Levenshtein implementation

Performance

  • improve performance and memory usage of Levenshtein.editops
    • memory usage is reduced from O(NM) to O(N)
    • performance is improved for long sequences

[2.3.0] - 2022-07-23

Added

  • add as_matching_blocks to Editops/Opcodes
  • add support for deletions from Editops
  • add Editops.apply/Opcodes.apply
  • add Editops.remove_subsequence

Changed

  • merge adjacent similar blocks in Opcodes

Fixed

  • fix usage of eval(repr(Editop)), eval(repr(Editops)), eval(repr(Opcode)) and eval(repr(Opcodes))
  • fix opcode conversion for empty source sequence
  • fix validation for empty Opcode list passed into Opcodes.__init__

[2.2.0] - 2022-07-19

Changed

  • added in-tree build backend to install cmake and ninja only when it is not installed yet and only when wheels are available

[2.1.4] - 2022-07-17

Changed

  • changed internal implementation of cdist to remove build dependency to numpy

Added

  • added wheels for musllinux and manylinux ppc64le, s390x

[2.1.3] - 2022-07-09

Fixed

  • fix missing type stubs

[2.1.2] - 2022-07-04

Changed

  • change src layout to make package import from root directory possible

[2.1.1] - 2022-06-30

Changed

  • allow installation without the C++ extension if it fails to compile
  • allow selection of implementation via the environment variable RAPIDFUZZ_IMPLEMENTATION which can be set to "cpp" or "python"

[2.1.0] - 2022-06-29

Added

  • added pure python fallback for all implementations with the following exceptions:
    • no support for sequences of hashables. Only strings supported so far
    • *.editops / *.opcodes functions not implemented yet
    • process.cdist does not support multithreading

Fixed

  • fuzz.partial_ratio_alignment ignored the score_cutoff
  • fix implementation of Hamming.normalized_similarity
  • fix default score_cutoff of Hamming.similarity
  • fix implementation of LCSseq.distance when used in the process module
  • treat hash for -1 and -2 as different

[2.0.15] - 2022-06-24

Fixed

  • fix integer wraparound in partial_ratio/partial_ratio_alignment

[2.0.14] - 2022-06-23

Fixed

  • fix unlimited recursion in LCSseq when used in combination with the process module

Changed

  • add fallback implementations of taskflow, rapidfuzz-cpp and jarowinkler-cpp back to wheel, since some package building systems like piwheels can't clone sources

[2.0.13] - 2022-06-22

Changed

  • use system version of cmake on arm platforms, since the cmake package fails to compile

[2.0.12] - 2022-06-22

Changed

  • add tests to sdist
  • remove cython dependency for sdist

[2.0.11] - 2022-04-23

Changed

  • relax version requirements of dependencies to simplify packaging

[2.0.10] - 2022-04-17

Fixed

  • Do not include installations of jaro_winkler in wheels (regression from 2.0.7)

Changed

  • Allow installation from system installed versions of rapidfuzz-cpp, jarowinkler-cpp and taskflow

Added

  • Added PyPy3.9 wheels on Linux

[2.0.9] - 2022-04-07

Fixed

  • Add missing Cython code in sdist
  • consider float imprecision in score_cutoff (see #210)

[2.0.8] - 2022-04-07

Fixed

  • fix incorrect score_cutoff handling in token_set_ratio and token_ratio

Added

  • add longest common subsequence

[2.0.7] - 2022-03-13

Fixed

  • Do not include installations of jaro_winkler and taskflow in wheels

[2.0.6] - 2022-03-06

Fixed

  • fix incorrect population of sys.modules which lead to submodules overshadowing other imports

Changed

  • moved JaroWinkler and Jaro into a separate package

[2.0.5] - 2022-02-25

Fixed

  • fix signed integer overflow inside hashmap implementation

[2.0.4] - 2022-02-21

Fixed

  • fix binary size increase due to debug symbols
  • fix segmentation fault in Levenshtein.editops

[2.0.3] - 2022-02-18

Added

  • Added fuzz.partial_ratio_alignment, which returns the result of fuzz.partial_ratio combined with the alignment this result stems from

Fixed

  • Fix Indel distance returning incorrect result when using score_cutoff=1, when the strings are not equal. This affected other scorers like fuzz.WRatio, which use the Indel distance as well.

[2.0.2] - 2022-02-12

Fixed

  • fix type hints
  • Add back transpiled cython files to the sdist to simplify builds in package builders like FreeBSD port build or conda-forge

[2.0.1] - 2022-02-11

Fixed

  • fix type hints
  • Indel.normalized_similarity mistakenly used the implementation of Indel.normalized_distance

[2.0.0] - 2022-02-09

Added

  • added C-Api which can be used to extend RapidFuzz from different Python modules using any programming language which allows the usage of C-Apis (C/C++/Rust)
  • added new scorers in rapidfuzz.distance.*
    • port existing distances to this new api
    • add Indel distance along with the corresponding editops function

Changed

  • when the result of string_metric.levenshtein or string_metric.hamming is below max they do now return max + 1 instead of -1
  • Build system moved from setuptools to scikit-build
  • Stop including all modules in __init__.py, since they significantly slowed down import time

Removed

  • remove the rapidfuzz.levenshtein module which was deprecated in v1.0.0 and scheduled for removal in v2.0.0
  • dropped support for Python2.7 and Python3.5

Deprecated

  • deprecate support to specify processor in form of a boolean (will be removed in v3.0.0)
    • new functions will not get support for this in the first place
  • deprecate rapidfuzz.string_metric (will be removed in v3.0.0). Similar scorers are available in rapidfuzz.distance.*

Fixed

  • process.cdist did raise an exception when used with a pure python scorer

Performance

  • improve performance and memory usage of rapidfuzz.string_metric.levenshtein_editops
    • memory usage is reduced by 33%
    • performance is improved by around 10%-20%
  • significantly improve performance of rapidfuzz.string_metric.levenshtein for max <= 31 using a banded implementation

[1.9.1] - 2021-12-13

Fixed

  • fix bug in new editops implementation, causing it to SegFault on some inputs (see qurator-spk/dinglehopper#64)

[1.9.0] - 2021-12-11

Fixed

  • Fix some issues in the type annotations (see #163)

Performance

  • improve performance and memory usage of rapidfuzz.string_metric.levenshtein_editops
    • memory usage is reduced by 10x
    • performance is improved from O(N * M) to O([N / 64] * M)

[1.8.3] - 2021-11-19

Added

  • Added missing wheels for Python3.6 on MacOs and Windows (see #159)

[1.8.2] - 2021-10-27

Added

  • Add wheels for Python 3.10 on MacOs

[1.8.1] - 2021-10-22

Fixed

  • Fix incorrect editops results (See #148)

[1.8.0] - 2021-10-20

Changed

  • Add Wheels for Python3.10 on all platforms except MacOs (see #141)
  • Improve performance of string_metric.jaro_similarity and string_metric.jaro_winkler_similarity for strings with a length <= 64

[1.7.1] - 2021-10-02

Fixed

  • fixed incorrect results of fuzz.partial_ratio for long needles (see #138)

[1.7.0] - 2021-09-27

Changed

  • Added typing for process.cdist
  • Added multithreading support to cdist using the argument process.cdist
  • Add dtype argument to process.cdist to set the dtype of the result numpy array (see #132)
  • Use a better hash collision strategy in the internal hashmap, which improves the worst case performance

[1.6.2] - 2021-09-15

Changed

  • improved performance of fuzz.ratio
  • only import process.cdist when numpy is available

[1.6.1] - 2021-09-11

Changed

  • Add back wheels for Python2.7

[1.6.0] - 2021-09-10

Changed

  • fuzz.partial_ratio uses a new implementation for short needles (<= 64). This implementation is
    • more accurate than the current implementation (it is guaranteed to find the optimal alignment)
    • it is significantly faster
  • Add process.cdist to compare all elements of two lists (see #51)

[1.5.1] - 2021-09-01

Fixed

  • Fix out of bounds access in levenshtein_editops

[1.5.0] - 2021-08-21

Changed

  • all scorers do now support similarity/distance calculations between any sequence of hashables. So it is possible to calculate e.g. the WER as:
>>> string_metric.levenshtein(["word1", "word2"], ["word1", "word3"])
1

Added

  • Added type stub files for all functions
  • added jaro similarity in string_metric.jaro_similarity
  • added jaro winkler similarity in string_metric.jaro_winkler_similarity
  • added Levenshtein editops in string_metric.levenshtein_editops

Fixed

  • Fixed support for set objects in process.extract
  • Fixed inconsistent handling of empty strings

[1.4.1] - 2021-03-30

Performance

  • improved performance of result creation in process.extract

Fixed

  • Cython ABI stability issue (#95)
  • fix missing decref in case of exceptions in process.extract

[1.4.0] - 2021-03-29

Changed

  • added processor support to levenshtein and hamming
  • added distance support to extract/extractOne/extract_iter

Fixed

  • incorrect results of normalized_hamming and normalized_levenshtein when used with utils.default_process as processor

[1.3.3] - 2021-03-20

Fixed

  • Fix a bug in the mbleven implementation of the uniform Levenshtein distance and cover it with fuzz tests

[1.3.2] - 2021-03-20

Fixed

  • some of the newly activated warnings caused build failures in the conda-forge build

[1.3.1] - 2021-03-20

Fixed

  • Fixed issue in LCS calculation for partial_ratio (see #90)
  • Fixed incorrect results for normalized_hamming and normalized_levenshtein when the processor utils.default_process is used
  • Fix many compiler warnings

[1.3.0] - 2021-03-16

Changed

  • add wheels for a lot of new platforms
  • drop support for Python 2.7

Performance

  • use is instead of == to compare functions directly by address

Fixed

  • Fix another ref counting issue
  • Fix some issues in the Levenshtein distance algorithm (see #92)

[1.2.1] - 2021-03-08

Performance

  • further improve bitparallel implementation of uniform Levenshtein distance for strings with a length > 64 (in many cases more than 50% faster)

[1.2.0] - 2021-03-07

Changed

  • add more benchmarks to documentation

Performance

  • add bitparallel implementation to InDel Distance (Levenshtein with the weights 1,1,2) for strings with a length > 64
  • improve bitparallel implementation of uniform Levenshtein distance for strings with a length > 64
  • use the InDel Distance and uniform Levenshtein distance in more cases instead of the generic implementation
  • Directly use the Levenshtein implementation in C++ instead of using it through Python in process.*

[1.1.2] - 2021-03-03

Fixed

  • Fix reference counting in process.extract (see #81)

[1.1.1] - 2021-02-23

Fixed

  • Fix result conversion in process.extract (see #79)

[1.1.0] - 2021-02-21

Changed

  • string_metric.normalized_levenshtein supports now all weights
  • when different weights are used for Insertion and Deletion the strings are not swapped inside the Levenshtein implementation anymore. So different weights for Insertion and Deletion are now supported.
  • replace C++ implementation with a Cython implementation. This has the following advantages:
    • The implementation is less error prone, since a lot of the complex things are done by Cython
    • slightly faster than the current implementation (up to 10% for some parts)
    • about 33% smaller binary size
    • reduced compile time
  • Added **kwargs argument to process.extract/extractOne/extract_iter that is passed to the scorer
  • Add max argument to hamming distance
  • Add support for whole Unicode range to utils.default_process

Performance

  • replaced Wagner Fischer usage in the normal Levenshtein distance with a bitparallel implementation

[1.0.2] - 2021-02-19

Fixed

  • The bitparallel LCS algorithm in fuzz.partial_ratio did not find the longest common substring properly in some cases. The old algorithm is used again until this bug is fixed.

[1.0.1] - 2021-02-17

Changed

  • string_metric.normalized_levenshtein supports now the weights (1, 1, N) with N >= 1

Performance

  • The Levenshtein distance with the weights (1, 1, >2) do now use the same implementation as the weight (1, 1, 2), since Substitution > Insertion + Deletion has no effect

Fixed

  • fix uninitialized variable in bitparallel Levenshtein distance with the weight (1, 1, 1)

[1.0.0] - 2021-02-12

Changed

  • all normalized string_metrics can now be used as scorer for process.extract/extractOne
  • Implementation of the C++ Wrapper completely refactored to make it easier to add more scorers, processors and string matching algorithms in the future.
  • increased test coverage, that already helped to fix some bugs and help to prevent regressions in the future
  • improved docstrings of functions

Performance

  • Added bit-parallel implementation of the Levenshtein distance for the weights (1,1,1) and (1,1,2).
  • Added specialized implementation of the Levenshtein distance for cases with a small maximum edit distance, that is even faster, than the bit-parallel implementation.
  • Improved performance of fuzz.partial_ratio -> Since fuzz.ratio and fuzz.partial_ratio are used in most scorers, this improves the overall performance.
  • Improved performance of process.extract and process.extractOne

Deprecated

  • the rapidfuzz.levenshtein module is now deprecated and will be removed in v2.0.0 These functions are now placed in rapidfuzz.string_metric. distance, normalized_distance, weighted_distance and weighted_normalized_distance are combined into levenshtein and normalized_levenshtein.

Added

  • added normalized version of the hamming distance in string_metric.normalized_hamming
  • process.extract_iter as a generator, that yields the similarity of all elements, that have a similarity >= score_cutoff

Fixed

  • multiple bugs in extractOne when used with a scorer, that's not from RapidFuzz
  • fixed bug in token_ratio
  • fixed bug in result normalization causing zero division

[0.14.2] - 2020-12-31

Fixed

  • utf8 usage in the copyright header caused problems with python2.7 on some platforms (see #70)

[0.14.1] - 2020-12-13

Fixed

  • when a custom processor like lambda s: s was used with any of the methods inside fuzz.* it always returned a score of 100. This release fixes this and adds a better test coverage to prevent this bug in the future.

[0.14.0] - 2020-12-09

Added

  • added hamming distance metric in the levenshtein module

Performance

  • improved performance of default_process by using lookup table

[0.13.4] - 2020-11-30

Fixed

  • Add missing virtual destructor that caused a segmentation fault on Mac Os

[0.13.3] - 2020-11-21

Added

  • C++11 Support
  • manylinux wheels

[0.13.2] - 2020-11-21

Fixed

  • Levenshtein was not imported from __init__
  • The reference count of a Python Object inside process.extractOne was decremented to early

[0.13.1] - 2020-11-17

Performance

  • process.extractOne exits early when a score of 100 is found. This way the other strings do not have to be preprocessed anymore.

[0.13.0] - 2020-11-16

Fixed

  • string objects passed to scorers had to be strings even before preprocessing them. This was changed, so they only have to be strings after preprocessing similar to process.extract/process.extractOne

Performance

  • process.extractOne is now implemented in C++ making it a lot faster
  • When token_sort_ratio or partial_token_sort ratio is used inprocess.extractOne the words in the query are only sorted once to improve the runtime

Changed

  • process.extractOne/process.extract do now return the index of the match, when the choices are a list.

Removed

  • process.extractIndices got removed, since the indices are now already returned by process.extractOne/process.extract

[0.12.5] - 2020-10-26

Fixed

  • fix documentation of process.extractOne (see #48)

[0.12.4] - 2020-10-22

Added

  • Added wheels for
    • CPython 2.7 on windows 64 bit
    • CPython 2.7 on windows 32 bit
    • PyPy 2.7 on windows 32 bit

[0.12.3] - 2020-10-09

Fixed

  • fix bug in partial_ratio (see #43)

[0.12.2] - 2020-10-01

Fixed

  • fix inconsistency with fuzzywuzzy in partial_ratio when using strings of equal length

[0.12.1] - 2020-09-30

Fixed

  • MSVC has a bug and therefore crashed on some of the templates used. This Release simplifies the templates so compiling on msvc works again

[0.12.0] - 2020-09-30

Performance

  • partial_ratio is using the Levenshtein distance now, which is a lot faster. Since many of the other algorithms use partial_ratio, this helps to improve the overall performance

[0.11.3] - 2020-09-22

Fixed

  • fix partial_token_set_ratio returning 100 all the time

[0.11.2] - 2020-09-12

Added

  • added rapidfuzz.__author__, rapidfuzz.__license__ and rapidfuzz.__version__

[0.11.1] - 2020-09-01

Fixed

  • do not use auto junk when searching the optimal alignment for partial_ratio

[0.11.0] - 2020-08-22

Changed

  • support for python 2.7 added #40
  • add wheels for python2.7 (both pypy and cpython) on MacOS and Linux

[0.10.0] - 2020-08-17

Changed

  • added wheels for Python3.9

Fixed

  • tuple scores in process.extractOne are now supported #39