fuzzysearch

Find parts of long text or data, allowing for some changes/typos.

fuzzy-matching fuzzy-search python starred-repo starred-taleinat-repo string-search text-search

Go to file

Tal Einat f2dda1fa34 Try fixing build sdist again		2024-08-05 22:00:45 +03:00
.github/workflows	Try fixing build sdist again	2024-08-05 22:00:45 +03:00
benchmarks	fixed broken function imports in benchmarks	2015-02-13 13:08:47 +02:00
docs	bumpy copyright year to 2022	2022-07-26 23:04:34 +03:00
src/fuzzysearch	Set version to 0.8.0-dev0	2024-06-29 22:08:37 +03:00
tests	Drop support for Python 2.7, 3.5, 3.6, 3.7	2024-06-24 23:26:38 +03:00
.bumpversion.cfg	Bump version: 0.7.0 → 0.7.1	2020-04-07 14:14:24 +03:00
.coveragerc	trying to consolidate coverage reports from tests run via tox	2015-02-07 14:25:59 +02:00
.gitignore	add some build and dist directories to gitignore	2015-09-16 20:41:52 +03:00
AUTHORS.rst	initial commit (project framework)	2013-11-02 00:34:18 +02:00
CONTRIBUTING.rst	Drop Travis CI and AppVeyor	2024-06-24 22:55:29 +03:00
HISTORY.rst	Support Python 3.11 and 3.12	2024-06-25 00:21:41 +03:00
LICENSE	Update copyright year range	2024-06-25 00:22:03 +03:00
MANIFEST.in	fix handling of inputs in bytes-only C extension functions	2020-06-28 10:00:25 +03:00
Makefile	add -2 flag to cython commands	2020-05-07 16:45:06 +03:00
README.rst	Update README	2024-06-25 23:59:47 +03:00
build.cmd	adding AppVeyor integration for testing and building wheels on Windows	2017-07-06 13:08:53 +03:00
requirements_dev.txt	Upgrade to latest tox 4.9	2024-06-24 22:53:31 +03:00
setup.py	Have setup.py read version in from __init__.py	2024-06-25 23:45:59 +03:00
tox.ini	WIP	2024-08-05 21:33:58 +03:00

README.rst

===========
fuzzysearch
===========

.. image:: https://img.shields.io/pypi/v/fuzzysearch.svg?style=flat
    :target: https://pypi.python.org/pypi/fuzzysearch
    :alt: Latest Version

.. image:: https://img.shields.io/coveralls/taleinat/fuzzysearch.svg?branch=master
    :target: https://coveralls.io/r/taleinat/fuzzysearch?branch=master
    :alt: Test Coverage

.. image:: https://img.shields.io/pypi/wheel/fuzzysearch.svg?style=flat
    :target: https://pypi.python.org/pypi/fuzzysearch
    :alt: Wheels

.. image:: https://img.shields.io/pypi/pyversions/fuzzysearch.svg?style=flat
    :target: https://pypi.python.org/pypi/fuzzysearch
    :alt: Supported Python versions

.. image:: https://img.shields.io/pypi/implementation/fuzzysearch.svg?style=flat
    :target: https://pypi.python.org/pypi/fuzzysearch
    :alt: Supported Python implementations

.. image:: https://img.shields.io/pypi/l/fuzzysearch.svg?style=flat
    :target: https://pypi.python.org/pypi/fuzzysearch/
    :alt: License

Fuzzy search: Find parts of long text or data, allowing for some
changes/typos.

Highly optimized, simple to use, does one thing well.

.. code:: python

    >>> find_near_matches('PATTERN', '---PATERN---', max_l_dist=1)
    [Match(start=3, end=9, dist=1, matched="PATERN")]

* Two simple functions to use: one for in-memory data and one for files

  * Fastest search algorithm is chosen automatically

* Levenshtein Distance metric with configurable parameters

  * Separately configure the max. allowed distance, substitutions, deletions
    and/or insertions

* Advanced algorithms with optional C and Cython optimizations

* Properly handles Unicode; special optimizations for binary data

* Simple installation:
   * ``pip install fuzzysearch`` just works
   * pure-Python fallbacks for compiled modules
   * only one dependency (``attrs``)

* Extensively tested

* Free software: `MIT license <LICENSE>`_

For more info, see the `documentation <http://fuzzysearch.rtfd.org>`_.


How is this different than FuzzyWuzzy or RapidFuzz?
---------------------------------------------------

The main difference is that fuzzysearch searches for fuzzy matches through
long texts or data. FuzzyWuzzy and RapidFuzz, on the other hand, are intended
for fuzzy comparison of pairs of strings, identifying how closely they match
according to some metric such as the Levenshtein distance.

These are very different use-cases, and the solutions are very different as
well.


How is this different than ElasticSearch and Lucene?
----------------------------------------------------

The main difference is that fuzzysearch does no indexing or other
preparations; it directly searches through the given text or data for a given
sub-string. Therefore, it is much simpler to use compared to systems based on
text indexing.


Installation
------------

``fuzzysearch`` supports Python versions 3.8+, as well as PyPy 3.9 and 3.10.

.. code::

    $ pip install fuzzysearch

This will work even if installing the C and Cython extensions fails, using
pure-Python fallbacks.


Usage
-----
Just call ``find_near_matches()`` with the sub-sequence you're looking for,
the sequence to search, and the matching parameters:

.. code:: python

    >>> from fuzzysearch import find_near_matches
    # search for 'PATTERN' with a maximum Levenshtein Distance of 1
    >>> find_near_matches('PATTERN', '---PATERN---', max_l_dist=1)
    [Match(start=3, end=9, dist=1, matched="PATERN")]

To search in a file, use ``find_near_matches_in_file()``:

.. code:: python

    >>> from fuzzysearch import find_near_matches_in_file
    >>> with open('data_file', 'rb') as f:
    ...     find_near_matches_in_file(b'PATTERN', f, max_l_dist=1)
    [Match(start=3, end=9, dist=1, matched="PATERN")]


Examples
--------

*fuzzysearch* is great for ad-hoc searches of genetic data, such as DNA or
protein sequences, before reaching for more complex tools:

.. code:: python

    >>> sequence = '''\
    GACTAGCACTGTAGGGATAACAATTTCACACAGGTGGACAATTACATTGAAAATCACAGATTGGTCACACACACA
    TTGGACATACATAGAAACACACACACATACATTAGATACGAACATAGAAACACACATTAGACGCGTACATAGACA
    CAAACACATTGACAGGCAGTTCAGATGATGACGCCCGACTGATACTCGCGTAGTCGTGGGAGGCAAGGCACACAG
    GGGATAGG'''
    >>> subsequence = 'TGCACTGTAGGGATAACAAT' # distance = 1
    >>> find_near_matches(subsequence, sequence, max_l_dist=2)
    [Match(start=3, end=24, dist=1, matched="TAGCACTGTAGGGATAACAAT")]

BioPython sequences are also supported:

.. code:: python

    >>> from Bio.Seq import Seq
    >>> from Bio.Alphabet import IUPAC
    >>> sequence = Seq('''\
    GACTAGCACTGTAGGGATAACAATTTCACACAGGTGGACAATTACATTGAAAATCACAGATTGGTCACACACACA
    TTGGACATACATAGAAACACACACACATACATTAGATACGAACATAGAAACACACATTAGACGCGTACATAGACA
    CAAACACATTGACAGGCAGTTCAGATGATGACGCCCGACTGATACTCGCGTAGTCGTGGGAGGCAAGGCACACAG
    GGGATAGG''', IUPAC.unambiguous_dna)
    >>> subsequence = Seq('TGCACTGTAGGGATAACAAT', IUPAC.unambiguous_dna)
    >>> find_near_matches(subsequence, sequence, max_l_dist=2)
    [Match(start=3, end=24, dist=1, matched="TAGCACTGTAGGGATAACAAT")]


Matching Criteria
-----------------
The search function supports four possible match criteria, which may be
supplied in any combination:

* maximum Levenshtein distance (``max_l_dist``)

* maximum # of subsitutions

* maximum # of deletions ("delete" = skip a character in the sub-sequence)

* maximum # of insertions ("insert" = skip a character in the sequence)

Not supplying a criterion means that there is no limit for it. For this reason,
one must always supply ``max_l_dist`` and/or all other criteria.

.. code:: python

    >>> find_near_matches('PATTERN', '---PATERN---', max_l_dist=1)
    [Match(start=3, end=9, dist=1, matched="PATERN")]

    # this will not match since max-deletions is set to zero
    >>> find_near_matches('PATTERN', '---PATERN---', max_l_dist=1, max_deletions=0)
    []

    # note that a deletion + insertion may be combined to match a substution
    >>> find_near_matches('PATTERN', '---PAT-ERN---', max_deletions=1, max_insertions=1, max_substitutions=0)
    [Match(start=3, end=10, dist=1, matched="PAT-ERN")] # the Levenshtein distance is still 1

    # ... but deletion + insertion may also match other, non-substitution differences
    >>> find_near_matches('PATTERN', '---PATERRN---', max_deletions=1, max_insertions=1, max_substitutions=0)
    [Match(start=3, end=10, dist=2, matched="PATERRN")]