fuzzysearch

Find parts of long text or data, allowing for some changes/typos.

fuzzy-matching fuzzy-search python starred-repo starred-taleinat-repo string-search text-search

Go to file

Tal Einat d8e3ba09cc adding AppVeyor integration for testing and building wheels on Windows		2017-07-06 12:51:21 +03:00
benchmarks	fixed broken function imports in benchmarks	2015-02-13 13:08:47 +02:00
docs	minor documentation fixes	2015-02-09 21:51:32 +02:00
src/fuzzysearch	fixed another inconsistency in changed .c and .h files	2017-07-06 11:45:18 +03:00
tests	make c_find_near_matches_generic_ngrams also accept search_params	2017-07-05 12:07:02 +03:00
.bumpversion.cfg	Bump version: 0.2.2 → 0.3.0	2015-02-12 21:39:54 +02:00
.coveragerc	trying to consolidate coverage reports from tests run via tox	2015-02-07 14:25:59 +02:00
.gitignore	add some build and dist directories to gitignore	2015-09-16 20:41:52 +03:00
.travis.yml	tox.ini: set max verbosity for tox and TravisCI	2017-07-06 11:44:44 +03:00
AUTHORS.rst	initial commit (project framework)	2013-11-02 00:34:18 +02:00
CONTRIBUTING.rst	initial commit (project framework)	2013-11-02 00:34:18 +02:00
HISTORY.rst	Bump version: 0.2.2 → 0.3.0	2015-02-12 21:39:54 +02:00
LICENSE	updated license copyright notice to 2015 and added a link to the license in the README	2015-01-31 21:20:05 +02:00
MANIFEST.in	moved package directory under src/	2015-02-01 14:44:22 +02:00
Makefile	mucking around with tox.ini, Makefile and setup.py build_ext --inplace	2015-02-01 21:38:45 +02:00
README.rst	Update README.rst	2017-06-26 23:06:03 +03:00
appveyor.yml	adding AppVeyor integration for testing and building wheels on Windows	2017-07-06 12:51:21 +03:00
build.cmd	adding AppVeyor integration for testing and building wheels on Windows	2017-07-06 12:13:46 +03:00
requirements_dev.txt	fixed another inconsistency in changed .c and .h files	2017-07-06 11:45:18 +03:00
setup.py	setup.py: use print instead of a silly echo() func	2017-07-06 11:45:18 +03:00
tox.ini	adding AppVeyor integration for testing and building wheels on Windows	2017-07-06 12:51:21 +03:00

README.rst

===============================
fuzzysearch
===============================

.. image:: https://img.shields.io/pypi/v/fuzzysearch.svg?style=flat
    :target: https://pypi.python.org/pypi/fuzzysearch
    :alt: Latest Version

.. image:: https://img.shields.io/travis/taleinat/fuzzysearch.svg?branch=master
    :target: https://travis-ci.org/taleinat/fuzzysearch/branches
    :alt: Build & Tests Status

.. image:: https://img.shields.io/coveralls/taleinat/fuzzysearch.svg?branch=master
    :target: https://coveralls.io/r/taleinat/fuzzysearch?branch=master
    :alt: Test Coverage

.. image:: https://img.shields.io/pypi/dm/fuzzysearch.svg?style=flat
    :target: https://pypi.python.org/pypi/fuzzysearch
    :alt: Downloads

.. image:: https://img.shields.io/pypi/wheel/fuzzysearch.svg?style=flat
    :target: https://pypi.python.org/pypi/fuzzysearch
    :alt: Wheels

.. image:: https://img.shields.io/pypi/pyversions/fuzzysearch.svg?style=flat
    :target: https://pypi.python.org/pypi/fuzzysearch
    :alt: Supported Python versions

.. image:: https://img.shields.io/pypi/implementation/fuzzysearch.svg?style=flat
    :target: https://pypi.python.org/pypi/fuzzysearch
    :alt: Supported Python implementations

.. image:: https://img.shields.io/pypi/l/fuzzysearch.svg?style=flat
    :target: https://pypi.python.org/pypi/fuzzysearch/
    :alt: License

fuzzysearch is a Python library for fuzzy substring searches. It implements efficient
ad-hoc searching for approximate sub-sequences. Matching is done using a generalized
Levenshtein Distance metric, with configurable parameters.

* Free software: `MIT license <LICENSE>`_
* Documentation: http://fuzzysearch.rtfd.org.

Installation
------------
Just install using pip::

    $ pip install fuzzysearch

Features
--------

* Fuzzy sub-sequence search: Find parts of a sequence which match a given
  sub-sequence.
* Easy to use: A single function to call which returns a list of matches.
* Set a maximum Levenshtein Distance for matches, including individual limits
  for the number of substitutions, insertions and/or deletions allowed for
  near-matches.
* Includes optimized implementations for specific use-cases, e.g. allowing
  only substitutions.

Simple Examples
---------------
Just call `find_near_matches()` with the sequence to search, the sub-sequence
you're looking for, and the matching parameters:

.. code:: python

    >>> from fuzzysearch import find_near_matches
    # search for 'PATTERN' with a maximum Levenshtein Distance of 1
    >>> find_near_matches('PATTERN', '---PATERN---', max_l_dist=1)
    [Match(start=3, end=9, dist=1)]

.. code:: python

    >>> sequence = '''\
    GACTAGCACTGTAGGGATAACAATTTCACACAGGTGGACAATTACATTGAAAATCACAGATTGGTCACACACACA
    TTGGACATACATAGAAACACACACACATACATTAGATACGAACATAGAAACACACATTAGACGCGTACATAGACA
    CAAACACATTGACAGGCAGTTCAGATGATGACGCCCGACTGATACTCGCGTAGTCGTGGGAGGCAAGGCACACAG
    GGGATAGG'''
    >>> subsequence = 'TGCACTGTAGGGATAACAAT' # distance = 1
    >>> find_near_matches(subsequence, sequence, max_l_dist=2)
    [Match(start=3, end=24, dist=1)]

Advanced Search Criteria
------------------------
The search function supports four possible match criteria, which may be supplied in any combination:

* maximum Levenshtein distance
* maximum # of subsitutions
* maximum # of deletions (elements appearing in the pattern search for, which are skipped in the matching sub-sequence)
* maximum # of insertions (elements added in the matching sub-sequence which don't appear in the pattern search for)

Not supplying a criterion means that there is no limit for it. For this reason, one must always supply `max_l_dist` and/or all other criteria.

.. code:: python

    >>> find_near_matches('PATTERN', '---PATERN---', max_l_dist=1)
    [Match(start=3, end=9, dist=1)]
    
    # this will not match since max-deletions is set to zero
    >>> find_near_matches('PATTERN', '---PATERN---', max_l_dist=1, max_deletions=0)
    []
    
    # note that a deletion + insertion may be combined to match a substution
    >>> find_near_matches('PATTERN', '---PAT-ERN---', max_deletions=1, max_insertions=1, max_substitutions=0)
    [Match(start=3, end=10, dist=1)] # the Levenshtein distance is still 1

    # ... but deletion + insertion may also match other, non-substitution differences
    >>> find_near_matches('PATTERN', '---PATERRN---', max_deletions=1, max_insertions=1, max_substitutions=0)
    [Match(start=3, end=10, dist=2)]