RapidFuzz/CHANGELOG.rst

1197 lines
32 KiB
ReStructuredText

Changelog
---------
[3.10.1] - 2024-10-24
^^^^^^^^^^^^^^^^^^^^^
Fixed
~~~~~
- fix compilation on clang-19
- fix incorrect results in simd optimized implementation of Levenshtein and OSA on 32bit targets
Added
~~~~~
* added support for taskflow 3.8.0
[3.10.0] - 2024-09-21
^^^^^^^^^^^^^^^^^^^^^
Fixed
~~~~~
- drop support for Python 3.8
- switch build system to `scikit-build-core`
[3.9.7] - 2024-09-02
^^^^^^^^^^^^^^^^^^^^
Fixed
~~~~~
* fix crash in ``cdist`` due to Visual Studio upgrade
[3.9.6] - 2024-08-06
^^^^^^^^^^^^^^^^^^^^
Changed
~~~~~~~
* upgrade to ``Cython==3.0.11``
* add python 3.13 wheels
[3.9.5] - 2024-07-29
^^^^^^^^^^^^^^^^^^^^
Fixed
~~~~~
* include simd binaries in pyinstaller builds
* fix builds with setuptools 72 by upgrading `scikit-build`
[3.9.4] - 2024-07-02
^^^^^^^^^^^^^^^^^^^^
Fixed
~~~~~
* fix bug in ``Levenshtein.editops`` and ``Levenshtein.opcodes`` which could lead
to incorrect results and crashes for some inputs
[3.9.3] - 2024-05-31
^^^^^^^^^^^^^^^^^^^^
Fixed
~~~~~
* fix None handling for queries in ``process.cdist`` for scorers not supporting SIMD
[3.9.2] - 2024-05-28
^^^^^^^^^^^^^^^^^^^^
Fixed
~~~~~
* fix supported versions of taskflow in cmake to be in the range v3.3 - v3.7
[3.9.1] - 2024-05-19
^^^^^^^^^^^^^^^^^^^^
Fixed
~~~~~
* disable AVX2 on MacOS since it did lead to illegal instructions being generated
[3.9.0] - 2024-05-02
^^^^^^^^^^^^^^^^^^^^
Changed
~~~~~~~
* significantly improve type hints for the library
Fixed
~~~~~
* fix cmake version parsing
[3.8.1] - 2024-04-07
^^^^^^^^^^^^^^^^^^^^
Fixed
~~~~~
* use the correct version of ``rapidfuzz-cpp`` when building against a system installed version
[3.8.0] - 2024-04-06
^^^^^^^^^^^^^^^^^^^^
Added
~~~~~
* added ``process.cpdist`` which allows pairwise comparison of two collection of inputs
Fixed
~~~~~
- fix some minor errors in the type hints
- fix potentially incorrect results of JaroWinkler when using high prefix weights
[3.7.0] - 2024-03-21
^^^^^^^^^^^^^^^^^^^^
Changed
~~~~~~~
* reduce importtime
[3.6.2] - 2024-03-05
^^^^^^^^^^^^^^^^^^^^
Changed
~~~~~~~
* upgrade to ``Cython==3.0.9``
Fixed
~~~~~
* upgrade ``rapidfuzz-cpp`` which includes a fix for build issues on some compilers
* fix some issues with the sphinx config
[3.6.1] - 2023-12-28
^^^^^^^^^^^^^^^^^^^^
Fixed
~~~~~
* fix overflow error on systems with ``sizeof(size_t) < 8``
[3.6.0] - 2023-12-26
^^^^^^^^^^^^^^^^^^^^
Fixed
~~~~~
* fix pure python fallback implementation of ``fuzz.token_set_ratio``
* properly link with ``-latomic`` if ``std::atomic<uint64_t>`` is not natively supported
Performance
~~~~~~~~~~~
* add banded implementation of LCS / Indel. This improves the runtime from ``O((|s1|/64) * |s2|)`` to ``O((score_cutoff/64) * |s2|)``
Changed
~~~~~~~
* upgrade to ``Cython==3.0.7``
* cdist for many metrics now returns a matrix of ``uint32`` instead of ``int32`` by default
[3.5.2] - 2023-11-02
^^^^^^^^^^^^^^^^^^^^
Fixed
~~~~~
* use _mm_malloc/_mm_free on macOS if aligned_alloc is unsupported
[3.5.1] - 2023-10-31
^^^^^^^^^^^^^^^^^^^^
Fixed
~~~~~
* fix compilation failure on macOS
[3.5.0] - 2023-10-31
^^^^^^^^^^^^^^^^^^^^
Changed
~~~~~~~
* skip pandas ``pd.NA`` similar to ``None``
* add ``score_multiplier`` argument to ``process.cdist`` which allows multiplying the end result scores
with a constant factor.
* drop support for Python 3.7
Performance
~~~~~~~~~~~
* improve performance of simd implementation for ``LCS`` / ``Indel`` / ``Jaro`` / ``JaroWinkler``
* improve performance of Jaro and Jaro Winkler for long sequences
* implement ``process.extract`` with ``limit=1`` using ``process.extractOne`` which can be faster
Fixed
~~~~~
* the preprocessing function was always called through Python due to a broken C-API version check
* fix wraparound issue in simd implementation of Jaro and Jaro Winkler
[3.4.0] - 2023-10-09
^^^^^^^^^^^^^^^^^^^^
Changed
~~~~~~~
* upgrade to ``Cython==3.0.3``
* add simd implementation for Jaro and Jaro Winkler
[3.3.1] - 2023-09-25
^^^^^^^^^^^^^^^^^^^^
Added
~~~~~
* add missing tag for python 3.12 support
[3.3.0] - 2023-09-11
^^^^^^^^^^^^^^^^^^^^
Changed
~~~~~~~
* upgrade to ``Cython==3.0.2``
* implement the remaining missing features from the C++ implementation in the pure Python implementation
Added
~~~~~
* added support for Python 3.12
[3.2.0] - 2023-08-02
^^^^^^^^^^^^^^^^^^^^
Changed
~~~~~~~
* build x86 with sse2/avx2 runtime detection
[3.1.2] - 2023-07-19
^^^^^^^^^^^^^^^^^^^^
Changed
~~~~~~~
* upgrade to ``Cython==3.0.0``
[3.1.1] - 2023-06-06
^^^^^^^^^^^^^^^^^^^^
Changed
~~~~~~~
* upgrade to ``taskflow==3.6``
Fixed
~~~~~
* replace usage of ``isnan`` with ``std::isnan`` which fixes the build on NetBSD
[3.1.0] - 2023-06-02
^^^^^^^^^^^^^^^^^^^^
Changed
~~~~~~~
* added keyword argument ``pad`` to Hamming distance. This controls whether sequences of different
length should be padded or lead to a ``ValueError``
* improve consistency of exception messages between the C++ and pure Python implementation
* upgrade required Cython version to ``Cython==3.0.0b3``
Fixed
~~~~~
* fix missing GIL restore when an exception is thrown inside ``process.cdist``
* fix incorrect type hints for the ``process`` module
[3.0.0] - 2023-04-16
^^^^^^^^^^^^^^^^^^^^
Changed
~~~~~~~
* allow the usage of ``Hamming`` for different string lengths. Length differences are handled as
insertions / deletions
* remove support for boolean preprocessor functions in ``rapidfuzz.fuzz`` and ``rapidfuzz.process``.
The processor argument is now always a callable or ``None``.
* update defaults of the processor argument to be ``None`` everywhere. For affected functions this can change results, since strings are no longer preprocessed.
To get back the old behaviour pass ``processor=utils.default_process`` to these functions.
The following functions are affected by this:
* ``process.extract``, ``process.extract_iter``, ``process.extractOne``
* ``fuzz.token_sort_ratio``, ``fuzz.token_set_ratio``, ``fuzz.token_ratio``, ``fuzz.partial_token_sort_ratio``, ``fuzz.partial_token_set_ratio``, ``fuzz.partial_token_ratio``, ``fuzz.WRatio``, ``fuzz.QRatio``
* ``rapidfuzz.process`` no longer calls scorers with ``processor=None``. For this reason user provided scorers no longer require this argument.
* remove option to pass keyword arguments to scorer via ``**kwargs`` in ``rapidfuzz.process``. They can be passed
via a ``scorer_kwargs`` argument now. This ensures this does not break when extending function parameters and
prevents naming clashes.
* remove ``rapidfuzz.string_metric`` module. Replacements for all functions are available in ``rapidfuzz.distance``
Added
~~~~~
* added support for arbitrary hashable sequence in the pure Python fallback implementation of all functions in ``rapidfuzz.distance``
* added support for ``None`` and ``float("nan")`` in ``process.cdist`` as long as the underlying scorer supports it.
This is the case for all scorers returning normalized results.
Fixed
~~~~~
* fix division by zero in simd implementation of normalized metrics leading to incorrect results
[2.15.1] - 2023-04-11
^^^^^^^^^^^^^^^^^^^^^
Fixed
~~~~~
* fix incorrect tag dispatching implementation leading to AVX2 instructions in the SSE2 code path
Added
~~~~~
* add wheels for windows arm64
[2.15.0] - 2023-04-01
^^^^^^^^^^^^^^^^^^^^^
Changed
~~~~~~~
* allow the usage of finite generators as choices in ``process.extract``
[2.14.0] - 2023-03-31
^^^^^^^^^^^^^^^^^^^^^
Changed
~~~~~~~
* upgrade required Cython version to ``Cython==3.0.0b2``
Fixed
~~~~~
* fix handling of non symmetric scorers in pure python version of ``process.cdist``
* fix default dtype handling when using ``process.cdist`` with pure python scorers
[2.13.7] - 2022-12-20
^^^^^^^^^^^^^^^^^^^^^
Fixed
~~~~~~~
* fix function signature of ``get_requires_for_build_wheel``
[2.13.6] - 2022-12-11
^^^^^^^^^^^^^^^^^^^^^
Changed
~~~~~~~
* reformat changelog as restructured text to get rig of ``m2r2`` dependency
[2.13.5] - 2022-12-11
^^^^^^^^^^^^^^^^^^^^^
Added
~~~~~
* added docs to sdist
Fixed
~~~~~
* fix two cases of undefined behavior in ``process.cdist``
[2.13.4] - 2022-12-08
^^^^^^^^^^^^^^^^^^^^^
Changed
~~~~~~~
* handle ``float("nan")`` similar to ``None`` for query / choice, since this is common for
non-existent data in tools like numpy
Fixed
~~~~~
* fix handling on ``None``\ /\ ``float("nan")`` in ``process.distance``
* use absolute imports inside tests
[2.13.3] - 2022-12-03
^^^^^^^^^^^^^^^^^^^^^
Fixed
~~~~~
* improve handling of functions wrapped using ``functools.wraps``
* fix broken fallback to Python implementation when the a ``ImportError`` occurs on import.
This can e.g. occur when the binary has a dependency on libatomic, but it is unavailable on
the system
* define ``CMAKE_C_COMPILER_AR``\ /\ ``CMAKE_CXX_COMPILER_AR``\ /\ ``CMAKE_C_COMPILER_RANLIB``\ /\ ``CMAKE_CXX_COMPILER_RANLIB``
if they are not defined yet
[2.13.2] - 2022-11-05
^^^^^^^^^^^^^^^^^^^^^
Fixed
~~~~~
* fix incorrect results in ``Hamming.normalized_similarity``
* fix incorrect score_cutoff handling in pure python implementation of
``Postfix.normalized_distance`` and ``Prefix.normalized_distance``
* fix ``Levenshtein.normalized_similarity`` and ``Levenshtein.normalized_distance``
when used in combination with the process module
* ``fuzz.partial_ratio`` was not always symmetric when ``len(s1) == len(s2)``
[2.13.1] - 2022-11-02
^^^^^^^^^^^^^^^^^^^^^
Fixed
~~~~~
* fix bug in ``normalized_similarity`` of most scorers,
leading to incorrect results when used in combination with the process module
* fix sse2 support
* fix bug in ``JaroWinkler`` and ``Jaro`` when used in the pure python process module
* forward kwargs in pure Python implementation of ``process.extract``
[2.13.0] - 2022-10-30
^^^^^^^^^^^^^^^^^^^^^
Fixed
~~~~~
* fix bug in ``Levenshtein.editops`` leading to crashes when used with ``score_hint``
Changed
~~~~~~~
* moved capi from ``rapidfuzz_capi`` into ``rapidfuzz``\ , since it will always
succeed the installation now that there is a pure Python mode
* add ``score_hint`` argument to process module
* add ``score_hint`` argument to Levenshtein module
[2.12.0] - 2022-10-24
^^^^^^^^^^^^^^^^^^^^^
Changed
~~~~~~~
* drop support for Python 3.6
Added
~~~~~
* added ``Prefix``\ /\ ``Suffix`` similarity
Fixed
~~~~~
* fixed packaging with pyinstaller
[2.11.1] - 2022-10-05
^^^^^^^^^^^^^^^^^^^^^
Fixed
~~~~~
* Fix segmentation fault in ``process.cdist`` when used with an empty query sequence
[2.11.0] - 2022-10-02
^^^^^^^^^^^^^^^^^^^^^
Changes
~~~~~~~
* move jarowinkler dependency into rapidfuzz to simplify maintenance
Performance
~~~~~~~~~~~
* add SIMD implementation for ``fuzz.ratio``\ /\ ``fuzz.QRatio``\ /\ ``Levenshtein``\ /\ ``Indel``\ /\ ``LCSseq``\ /\ ``OSA`` to improve
performance for short strings in cdist
[2.10.3] - 2022-09-30
^^^^^^^^^^^^^^^^^^^^^
Fixed
~~~~~
* use ``scikit-build=0.14.1`` on Linux, since ``scikit-build=0.15.0`` fails to find the Python Interpreter
* workaround gcc in bug in template type deduction
[2.10.2] - 2022-09-27
^^^^^^^^^^^^^^^^^^^^^
Fixed
~~~~~
* fix support for cmake versions below 3.17
[2.10.1] - 2022-09-25
^^^^^^^^^^^^^^^^^^^^^
Changed
~~~~~~~
* modernize cmake build to fix most conda-forge builds
[2.10.0] - 2022-09-18
^^^^^^^^^^^^^^^^^^^^^
Added
~~~~~
* add editops to hamming distance
Performance
~~~~~~~~~~~
* strip common affix in osa distance
Fixed
~~~~~
* ignore missing pandas in Python 3.11 tests
[2.9.0] - 2022-09-16
^^^^^^^^^^^^^^^^^^^^
Added
~~~~~
* add optimal string alignment (OSA)
[2.8.0] - 2022-09-11
^^^^^^^^^^^^^^^^^^^^
Fixed
~~~~~
* ``fuzz.partial_ratio`` did not find the optimal alignment in some edge cases (#219)
Performance
~~~~~~~~~~~
* improve performance of ``fuzz.partial_ratio``
Changed
~~~~~~~
* increased minimum C++ version to C++17 (see #255)
[2.7.0] - 2022-09-11
^^^^^^^^^^^^^^^^^^^^
Performance
~~~~~~~~~~~
* improve performance of ``Levenshtein.distance``\ /\ ``Levenshtein.editops`` for
long sequences.
Added
~~~~~
* add ``score_hint`` parameter to ``Levenshtein.editops`` which allows the use of a
faster implementation
Changed
~~~~~~~
* all functions in the ``string_metric`` module do now raise a deprecation warning.
They are now only wrappers for their replacement functions, which makes them slower
when used with the process module
[2.6.1] - 2022-09-03
^^^^^^^^^^^^^^^^^^^^
Fixed
~~~~~
* fix incorrect results of partial_ratio for long needles (#257)
[2.6.0] - 2022-08-20
^^^^^^^^^^^^^^^^^^^^
Fixed
~~~~~
* fix hashing for custom classes
Added
~~~~~
* add support for slicing in ``Editops.__getitem__``\ /\ ``Editops.__delitem__``
* add ``DamerauLevenshtein`` module
[2.5.0] - 2022-08-14
^^^^^^^^^^^^^^^^^^^^
Added
~~~~~
* added support for KeyboardInterrupt in processor module
It might still take a bit until the KeyboardInterrupt is registered, but
no longer runs all text comparisons after pressing ``Ctrl + C``
Fixed
~~~~~
* fix default scorer used by cdist to use C++ implementation if possible
[2.4.4] - 2022-08-12
^^^^^^^^^^^^^^^^^^^^
Changed
~~~~~~~
* Added support for Python 3.11
[2.4.3] - 2022-08-08
^^^^^^^^^^^^^^^^^^^^
Fixed
~~~~~
* fix value range of ``jaro_similarity``\ /\ ``jaro_winkler_similarity`` in the pure Python mode
for the string_metric module
* fix missing atomix symbol on arm 32 bit
[2.4.2] - 2022-07-30
^^^^^^^^^^^^^^^^^^^^
Fixed
~~~~~
* add missing symbol to pure Python which made the usage impossible
[2.4.1] - 2022-07-29
^^^^^^^^^^^^^^^^^^^^
Fixed
~~~~~
* fix version number
[2.4.0] - 2022-07-29
^^^^^^^^^^^^^^^^^^^^
Fixed
~~~~~
* fix banded Levenshtein implementation
Performance
~~~~~~~~~~~
* improve performance and memory usage of ``Levenshtein.editops``
* memory usage is reduced from O(NM) to O(N)
* performance is improved for long sequences
[2.3.0] - 2022-07-23
^^^^^^^^^^^^^^^^^^^^
Added
~~~~~
* add ``as_matching_blocks`` to ``Editops``\ /\ ``Opcodes``
* add support for deletions from ``Editops``
* add ``Editops.apply``\ /\ ``Opcodes.apply``
* add ``Editops.remove_subsequence``
Changed
~~~~~~~
* merge adjacent similar blocks in ``Opcodes``
Fixed
~~~~~
* fix usage of ``eval(repr(Editop))``\ , ``eval(repr(Editops))``\ , ``eval(repr(Opcode))`` and ``eval(repr(Opcodes))``
* fix opcode conversion for empty source sequence
* fix validation for empty Opcode list passed into ``Opcodes.__init__``
[2.2.0] - 2022-07-19
^^^^^^^^^^^^^^^^^^^^
Changed
~~~~~~~
* added in-tree build backend to install cmake and ninja only when it is not installed yet
and only when wheels are available
[2.1.4] - 2022-07-17
^^^^^^^^^^^^^^^^^^^^
Changed
~~~~~~~
* changed internal implementation of cdist to remove build dependency to numpy
Added
~~~~~
* added wheels for musllinux and manylinux ppc64le, s390x
[2.1.3] - 2022-07-09
^^^^^^^^^^^^^^^^^^^^
Fixed
~~~~~
* fix missing type stubs
[2.1.2] - 2022-07-04
^^^^^^^^^^^^^^^^^^^^
Changed
~~~~~~~
* change src layout to make package import from root directory possible
[2.1.1] - 2022-06-30
^^^^^^^^^^^^^^^^^^^^
Changed
~~~~~~~
* allow installation without the C++ extension if it fails to compile
* allow selection of implementation via the environment variable ``RAPIDFUZZ_IMPLEMENTATION``
which can be set to "cpp" or "python"
[2.1.0] - 2022-06-29
^^^^^^^^^^^^^^^^^^^^
Added
~~~~~
* added pure python fallback for all implementations with the following exceptions:
* no support for sequences of hashables. Only strings supported so far
* ``\*.editops`` / ``\*.opcodes`` functions not implemented yet
* process.cdist does not support multithreading
Fixed
~~~~~
* fuzz.partial_ratio_alignment ignored the score_cutoff
* fix implementation of Hamming.normalized_similarity
* fix default score_cutoff of Hamming.similarity
* fix implementation of LCSseq.distance when used in the process module
* treat hash for -1 and -2 as different
[2.0.15] - 2022-06-24
^^^^^^^^^^^^^^^^^^^^^
Fixed
~~~~~
* fix integer wraparound in partial_ratio/partial_ratio_alignment
[2.0.14] - 2022-06-23
^^^^^^^^^^^^^^^^^^^^^
Fixed
~~~~~
* fix unlimited recursion in LCSseq when used in combination with the process module
Changed
~~~~~~~
* add fallback implementations of ``taskflow``\ , ``rapidfuzz-cpp`` and ``jarowinkler-cpp``
back to wheel, since some package building systems like piwheels can't clone sources
[2.0.13] - 2022-06-22
^^^^^^^^^^^^^^^^^^^^^
Changed
~~~~~~~
* use system version of cmake on arm platforms, since the cmake package fails to compile
[2.0.12] - 2022-06-22
^^^^^^^^^^^^^^^^^^^^^
Changed
~~~~~~~
* add tests to sdist
* remove cython dependency for sdist
[2.0.11] - 2022-04-23
^^^^^^^^^^^^^^^^^^^^^
Changed
~~~~~~~
* relax version requirements of dependencies to simplify packaging
[2.0.10] - 2022-04-17
^^^^^^^^^^^^^^^^^^^^^
Fixed
~~~~~
* Do not include installations of jaro_winkler in wheels (regression from 2.0.7)
Changed
~~~~~~~
* Allow installation from system installed versions of ``rapidfuzz-cpp``\ , ``jarowinkler-cpp``
and ``taskflow``
Added
~~~~~
* Added PyPy3.9 wheels on Linux
[2.0.9] - 2022-04-07
^^^^^^^^^^^^^^^^^^^^
Fixed
~~~~~
* Add missing Cython code in sdist
* consider float imprecision in score_cutoff (see #210)
[2.0.8] - 2022-04-07
^^^^^^^^^^^^^^^^^^^^
Fixed
~~~~~
* fix incorrect score_cutoff handling in token_set_ratio and token_ratio
Added
~~~~~
* add longest common subsequence
[2.0.7] - 2022-03-13
^^^^^^^^^^^^^^^^^^^^
Fixed
~~~~~
* Do not include installations of jaro_winkler and taskflow in wheels
[2.0.6] - 2022-03-06
^^^^^^^^^^^^^^^^^^^^
Fixed
~~~~~
* fix incorrect population of sys.modules which lead to submodules overshadowing
other imports
Changed
~~~~~~~
* moved JaroWinkler and Jaro into a separate package
[2.0.5] - 2022-02-25
^^^^^^^^^^^^^^^^^^^^
Fixed
~~~~~
* fix signed integer overflow inside hashmap implementation
[2.0.4] - 2022-02-21
^^^^^^^^^^^^^^^^^^^^
Fixed
~~~~~
* fix binary size increase due to debug symbols
* fix segmentation fault in ``Levenshtein.editops``
[2.0.3] - 2022-02-18
^^^^^^^^^^^^^^^^^^^^
Added
~~~~~
* Added fuzz.partial_ratio_alignment, which returns the result of fuzz.partial_ratio
combined with the alignment this result stems from
Fixed
~~~~~
* Fix Indel distance returning incorrect result when using score_cutoff=1, when the strings
are not equal. This affected other scorers like fuzz.WRatio, which use the Indel distance
as well.
[2.0.2] - 2022-02-12
^^^^^^^^^^^^^^^^^^^^
Fixed
~~~~~
* fix type hints
* Add back transpiled cython files to the sdist to simplify builds in package builders
like FreeBSD port build or conda-forge
[2.0.1] - 2022-02-11
^^^^^^^^^^^^^^^^^^^^
Fixed
~~~~~
* fix type hints
* Indel.normalized_similarity mistakenly used the implementation of Indel.normalized_distance
[2.0.0] - 2022-02-09
^^^^^^^^^^^^^^^^^^^^
Added
~~~~~
* added C-Api which can be used to extend RapidFuzz from different Python modules using any
programming language which allows the usage of C-Apis (C/C++/Rust)
* added new scorers in ``rapidfuzz.distance.*``
* port existing distances to this new api
* add Indel distance along with the corresponding editops function
Changed
~~~~~~~
* when the result of ``string_metric.levenshtein`` or ``string_metric.hamming`` is below max
they do now return ``max + 1`` instead of -1
* Build system moved from setuptools to scikit-build
* Stop including all modules in __init__.py, since they significantly slowed down import time
Removed
~~~~~~~
* remove the ``rapidfuzz.levenshtein`` module which was deprecated in v1.0.0 and scheduled for removal in v2.0.0
* dropped support for Python2.7 and Python3.5
Deprecated
~~~~~~~~~~
* deprecate support to specify processor in form of a boolean (will be removed in v3.0.0)
* new functions will not get support for this in the first place
* deprecate ``rapidfuzz.string_metric`` (will be removed in v3.0.0). Similar scorers are available
in ``rapidfuzz.distance.*``
Fixed
~~~~~
* process.cdist did raise an exception when used with a pure python scorer
Performance
~~~~~~~~~~~
* improve performance and memory usage of ``rapidfuzz.string_metric.levenshtein_editops``
* memory usage is reduced by 33%
* performance is improved by around 10%-20%
* significantly improve performance of ``rapidfuzz.string_metric.levenshtein`` for ``max <= 31``
using a banded implementation
[1.9.1] - 2021-12-13
^^^^^^^^^^^^^^^^^^^^
Fixed
~~~~~
* fix bug in new editops implementation, causing it to SegFault on some inputs (see qurator-spk/dinglehopper#64)
[1.9.0] - 2021-12-11
^^^^^^^^^^^^^^^^^^^^
Fixed
~~~~~
* Fix some issues in the type annotations (see #163)
Performance
~~~~~~~~~~~
* improve performance and memory usage of ``rapidfuzz.string_metric.levenshtein_editops``
* memory usage is reduced by 10x
* performance is improved from ``O(N * M)`` to ``O([N / 64] * M)``
[1.8.3] - 2021-11-19
^^^^^^^^^^^^^^^^^^^^
Added
~~~~~
* Added missing wheels for Python3.6 on MacOs and Windows (see #159)
[1.8.2] - 2021-10-27
^^^^^^^^^^^^^^^^^^^^
Added
~~~~~
* Add wheels for Python 3.10 on MacOs
[1.8.1] - 2021-10-22
^^^^^^^^^^^^^^^^^^^^
Fixed
~~~~~
* Fix incorrect editops results (See #148)
[1.8.0] - 2021-10-20
^^^^^^^^^^^^^^^^^^^^
Changed
~~~~~~~
* Add Wheels for Python3.10 on all platforms except MacOs (see #141)
* Improve performance of ``string_metric.jaro_similarity`` and ``string_metric.jaro_winkler_similarity`` for strings with a length <= 64
[1.7.1] - 2021-10-02
^^^^^^^^^^^^^^^^^^^^
Fixed
~~~~~
* fixed incorrect results of fuzz.partial_ratio for long needles (see #138)
[1.7.0] - 2021-09-27
^^^^^^^^^^^^^^^^^^^^
Changed
~~~~~~~
* Added typing for process.cdist
* Added multithreading support to cdist using the argument ``process.cdist``
* Add dtype argument to ``process.cdist`` to set the dtype of the result numpy array (see #132)
* Use a better hash collision strategy in the internal hashmap, which improves the worst case performance
[1.6.2] - 2021-09-15
^^^^^^^^^^^^^^^^^^^^
Changed
~~~~~~~
* improved performance of fuzz.ratio
* only import process.cdist when numpy is available
[1.6.1] - 2021-09-11
^^^^^^^^^^^^^^^^^^^^
Changed
~~~~~~~
* Add back wheels for Python2.7
[1.6.0] - 2021-09-10
^^^^^^^^^^^^^^^^^^^^
Changed
~~~~~~~
* fuzz.partial_ratio uses a new implementation for short needles (<= 64). This implementation is
* more accurate than the current implementation (it is guaranteed to find the optimal alignment)
* it is significantly faster
* Add process.cdist to compare all elements of two lists (see #51)
[1.5.1] - 2021-09-01
^^^^^^^^^^^^^^^^^^^^
Fixed
~~~~~
* Fix out of bounds access in levenshtein_editops
[1.5.0] - 2021-08-21
^^^^^^^^^^^^^^^^^^^^
Changed
~~~~~~~
* all scorers do now support similarity/distance calculations between any sequence of hashables. So it is possible to calculate e.g. the WER as:
.. code-block::
>>> string_metric.levenshtein(["word1", "word2"], ["word1", "word3"])
1
Added
~~~~~
* Added type stub files for all functions
* added jaro similarity in ``string_metric.jaro_similarity``
* added jaro winkler similarity in ``string_metric.jaro_winkler_similarity``
* added Levenshtein editops in ``string_metric.levenshtein_editops``
Fixed
~~~~~
* Fixed support for set objects in ``process.extract``
* Fixed inconsistent handling of empty strings
[1.4.1] - 2021-03-30
^^^^^^^^^^^^^^^^^^^^
Performance
~~~~~~~~~~~
* improved performance of result creation in process.extract
Fixed
~~~~~
* Cython ABI stability issue (#95)
* fix missing decref in case of exceptions in process.extract
[1.4.0] - 2021-03-29
^^^^^^^^^^^^^^^^^^^^
Changed
~~~~~~~
* added processor support to ``levenshtein`` and ``hamming``
* added distance support to extract/extractOne/extract_iter
Fixed
~~~~~
* incorrect results of ``normalized_hamming`` and ``normalized_levenshtein`` when used with ``utils.default_process`` as processor
[1.3.3] - 2021-03-20
^^^^^^^^^^^^^^^^^^^^
Fixed
~~~~~
* Fix a bug in the mbleven implementation of the uniform Levenshtein distance and cover it with fuzz tests
[1.3.2] - 2021-03-20
^^^^^^^^^^^^^^^^^^^^
Fixed
~~~~~
* some of the newly activated warnings caused build failures in the conda-forge build
[1.3.1] - 2021-03-20
^^^^^^^^^^^^^^^^^^^^
Fixed
~~~~~
* Fixed issue in LCS calculation for partial_ratio (see #90)
* Fixed incorrect results for normalized_hamming and normalized_levenshtein when the processor ``utils.default_process`` is used
* Fix many compiler warnings
[1.3.0] - 2021-03-16
^^^^^^^^^^^^^^^^^^^^
Changed
~~~~~~~
* add wheels for a lot of new platforms
* drop support for Python 2.7
Performance
~~~~~~~~~~~
* use ``is`` instead of ``==`` to compare functions directly by address
Fixed
~~~~~
* Fix another ref counting issue
* Fix some issues in the Levenshtein distance algorithm (see #92)
[1.2.1] - 2021-03-08
^^^^^^^^^^^^^^^^^^^^
Performance
~~~~~~~~~~~
* further improve bitparallel implementation of uniform Levenshtein distance for strings with a length > 64 (in many cases more than 50% faster)
[1.2.0] - 2021-03-07
^^^^^^^^^^^^^^^^^^^^
Changed
~~~~~~~
* add more benchmarks to documentation
Performance
~~~~~~~~~~~
* add bitparallel implementation to InDel Distance (Levenshtein with the weights 1,1,2) for strings with a length > 64
* improve bitparallel implementation of uniform Levenshtein distance for strings with a length > 64
* use the InDel Distance and uniform Levenshtein distance in more cases instead of the generic implementation
* Directly use the Levenshtein implementation in C++ instead of using it through Python in process.*
[1.1.2] - 2021-03-03
^^^^^^^^^^^^^^^^^^^^
Fixed
~~~~~
* Fix reference counting in process.extract (see #81)
[1.1.1] - 2021-02-23
^^^^^^^^^^^^^^^^^^^^
Fixed
~~~~~
* Fix result conversion in process.extract (see #79)
[1.1.0] - 2021-02-21
^^^^^^^^^^^^^^^^^^^^
Changed
~~~~~~~
* string_metric.normalized_levenshtein supports now all weights
* when different weights are used for Insertion and Deletion the strings are not swapped inside the Levenshtein implementation anymore. So different weights for Insertion and Deletion are now supported.
* replace C++ implementation with a Cython implementation. This has the following advantages:
* The implementation is less error prone, since a lot of the complex things are done by Cython
* slightly faster than the current implementation (up to 10% for some parts)
* about 33% smaller binary size
* reduced compile time
* Added \*\*kwargs argument to process.extract/extractOne/extract_iter that is passed to the scorer
* Add max argument to hamming distance
* Add support for whole Unicode range to utils.default_process
Performance
~~~~~~~~~~~
* replaced Wagner Fischer usage in the normal Levenshtein distance with a bitparallel implementation
[1.0.2] - 2021-02-19
^^^^^^^^^^^^^^^^^^^^
Fixed
~~~~~
* The bitparallel LCS algorithm in fuzz.partial_ratio did not find the longest common substring properly in some cases.
The old algorithm is used again until this bug is fixed.
[1.0.1] - 2021-02-17
^^^^^^^^^^^^^^^^^^^^
Changed
~~~~~~~
* string_metric.normalized_levenshtein supports now the weights (1, 1, N) with N >= 1
Performance
~~~~~~~~~~~
* The Levenshtein distance with the weights (1, 1, >2) do now use the same implementation as the weight (1, 1, 2), since
``Substitution > Insertion + Deletion`` has no effect
Fixed
~~~~~
* fix uninitialized variable in bitparallel Levenshtein distance with the weight (1, 1, 1)
[1.0.0] - 2021-02-12
^^^^^^^^^^^^^^^^^^^^
Changed
~~~~~~~
* all normalized string_metrics can now be used as scorer for process.extract/extractOne
* Implementation of the C++ Wrapper completely refactored to make it easier to add more scorers, processors and string matching algorithms in the future.
* increased test coverage, that already helped to fix some bugs and help to prevent regressions in the future
* improved docstrings of functions
Performance
~~~~~~~~~~~
* Added bit-parallel implementation of the Levenshtein distance for the weights (1,1,1) and (1,1,2).
* Added specialized implementation of the Levenshtein distance for cases with a small maximum edit distance, that is even faster, than the bit-parallel implementation.
* Improved performance of ``fuzz.partial_ratio``
-> Since ``fuzz.ratio`` and ``fuzz.partial_ratio`` are used in most scorers, this improves the overall performance.
* Improved performance of ``process.extract`` and ``process.extractOne``
Deprecated
~~~~~~~~~~
* the ``rapidfuzz.levenshtein`` module is now deprecated and will be removed in v2.0.0
These functions are now placed in ``rapidfuzz.string_metric``. ``distance``\ , ``normalized_distance``\ , ``weighted_distance`` and ``weighted_normalized_distance`` are combined into ``levenshtein`` and ``normalized_levenshtein``.
Added
~~~~~
* added normalized version of the hamming distance in ``string_metric.normalized_hamming``
* process.extract_iter as a generator, that yields the similarity of all elements, that have a similarity >= score_cutoff
Fixed
~~~~~
* multiple bugs in extractOne when used with a scorer, that's not from RapidFuzz
* fixed bug in ``token_ratio``
* fixed bug in result normalization causing zero division
[0.14.2] - 2020-12-31
^^^^^^^^^^^^^^^^^^^^^
Fixed
~~~~~
* utf8 usage in the copyright header caused problems with python2.7 on some platforms (see #70)
[0.14.1] - 2020-12-13
^^^^^^^^^^^^^^^^^^^^^
Fixed
~~~~~
* when a custom processor like ``lambda s: s`` was used with any of the methods inside fuzz.* it always returned a score of 100. This release fixes this and adds a better test coverage to prevent this bug in the future.
[0.14.0] - 2020-12-09
^^^^^^^^^^^^^^^^^^^^^
Added
~~~~~
* added hamming distance metric in the levenshtein module
Performance
~~~~~~~~~~~
* improved performance of default_process by using lookup table
[0.13.4] - 2020-11-30
^^^^^^^^^^^^^^^^^^^^^
Fixed
~~~~~
* Add missing virtual destructor that caused a segmentation fault on Mac Os
[0.13.3] - 2020-11-21
^^^^^^^^^^^^^^^^^^^^^
Added
~~~~~
* C++11 Support
* manylinux wheels
[0.13.2] - 2020-11-21
^^^^^^^^^^^^^^^^^^^^^
Fixed
~~~~~
* Levenshtein was not imported from __init__
* The reference count of a Python Object inside process.extractOne was decremented to early
[0.13.1] - 2020-11-17
^^^^^^^^^^^^^^^^^^^^^
Performance
~~~~~~~~~~~
* process.extractOne exits early when a score of 100 is found. This way the other strings do not have to be preprocessed anymore.
[0.13.0] - 2020-11-16
^^^^^^^^^^^^^^^^^^^^^
Fixed
~~~~~
* string objects passed to scorers had to be strings even before preprocessing them. This was changed, so they only have to be strings after preprocessing similar to process.extract/process.extractOne
Performance
~~~~~~~~~~~
* process.extractOne is now implemented in C++ making it a lot faster
* When token_sort_ratio or partial_token_sort ratio is used inprocess.extractOne the words in the query are only sorted once to improve the runtime
Changed
~~~~~~~
* process.extractOne/process.extract do now return the index of the match, when the choices are a list.
Removed
~~~~~~~
* process.extractIndices got removed, since the indices are now already returned by process.extractOne/process.extract
[0.12.5] - 2020-10-26
^^^^^^^^^^^^^^^^^^^^^
Fixed
~~~~~
* fix documentation of process.extractOne (see #48)
[0.12.4] - 2020-10-22
^^^^^^^^^^^^^^^^^^^^^
Added
~~~~~
* Added wheels for
* CPython 2.7 on windows 64 bit
* CPython 2.7 on windows 32 bit
* PyPy 2.7 on windows 32 bit
[0.12.3] - 2020-10-09
^^^^^^^^^^^^^^^^^^^^^
Fixed
~~~~~
* fix bug in partial_ratio (see #43)
[0.12.2] - 2020-10-01
^^^^^^^^^^^^^^^^^^^^^
Fixed
~~~~~
* fix inconsistency with fuzzywuzzy in partial_ratio when using strings of equal length
[0.12.1] - 2020-09-30
^^^^^^^^^^^^^^^^^^^^^
Fixed
~~~~~
* MSVC has a bug and therefore crashed on some of the templates used. This Release simplifies the templates so compiling on msvc works again
[0.12.0] - 2020-09-30
^^^^^^^^^^^^^^^^^^^^^
Performance
~~~~~~~~~~~
* partial_ratio is using the Levenshtein distance now, which is a lot faster. Since many of the other algorithms use partial_ratio, this helps to improve the overall performance
[0.11.3] - 2020-09-22
^^^^^^^^^^^^^^^^^^^^^
Fixed
~~~~~
* fix partial_token_set_ratio returning 100 all the time
[0.11.2] - 2020-09-12
^^^^^^^^^^^^^^^^^^^^^
Added
~~~~~
* added rapidfuzz.__author__, rapidfuzz.__license__ and rapidfuzz.__version__
[0.11.1] - 2020-09-01
^^^^^^^^^^^^^^^^^^^^^
Fixed
~~~~~
* do not use auto junk when searching the optimal alignment for partial_ratio
[0.11.0] - 2020-08-22
^^^^^^^^^^^^^^^^^^^^^
Changed
~~~~~~~
* support for python 2.7 added #40
* add wheels for python2.7 (both pypy and cpython) on MacOS and Linux
[0.10.0] - 2020-08-17
^^^^^^^^^^^^^^^^^^^^^
Changed
~~~~~~~
* added wheels for Python3.9
Fixed
~~~~~
* tuple scores in process.extractOne are now supported #39