fuzzysearch

Find parts of long text or data, allowing for some changes/typos.

fuzzy-matching fuzzy-search python starred-repo starred-taleinat-repo string-search text-search

Go to file

Tal Einat 244335bc51 test that "ngrams" functions raise a ValueError when the sub-sequence is too short		2015-02-09 20:37:26 +02:00
benchmarks	updated benchmarks	2014-05-16 12:18:05 +03:00
docs	updated copyright notive in docs/conf.py	2015-02-01 14:53:44 +02:00
src/fuzzysearch	avoid passing unicode to "byteslike" search functions	2015-02-09 20:10:46 +02:00
tests	test that "ngrams" functions raise a ValueError when the sub-sequence is too short	2015-02-09 20:37:26 +02:00
.bumpversion.cfg	added bumpversion configuration	2015-02-01 12:48:32 +02:00
.coveragerc	trying to consolidate coverage reports from tests run via tox	2015-02-07 14:25:59 +02:00
.gitignore	some additions to .gitignore in order to better support PyCharm and pyenv	2015-01-31 21:21:56 +02:00
.travis.yml	fixed Travis config to run tests with all supported versions of Python both with and without coverage	2015-02-07 13:34:48 +02:00
AUTHORS.rst	initial commit (project framework)	2013-11-02 00:34:18 +02:00
CONTRIBUTING.rst	initial commit (project framework)	2013-11-02 00:34:18 +02:00
HISTORY.rst	version 0.2.2	2014-03-27 15:36:43 +02:00
LICENSE	updated license copyright notice to 2015 and added a link to the license in the README	2015-01-31 21:20:05 +02:00
MANIFEST.in	moved package directory under src/	2015-02-01 14:44:22 +02:00
Makefile	mucking around with tox.ini, Makefile and setup.py build_ext --inplace	2015-02-01 21:38:45 +02:00
README.rst	slight update to coveralls badge in README	2015-02-01 22:15:36 +02:00
dev_requirements.txt	added dev_requirements.txt	2015-02-01 15:39:51 +02:00
nose2.cfg	added C extensions and changed to single-source code	2014-04-19 01:31:32 +03:00
requirements.txt	removed the no longer needed version limit on the 'six' library	2015-01-31 20:43:25 +02:00
setup.py	fallback when building C extension fails	2015-02-01 15:37:49 +02:00
tox.ini	have tox run tests both with and without coverage	2015-02-07 13:18:02 +02:00

README.rst

===============================
fuzzysearch
===============================

.. image:: https://badge.fury.io/py/fuzzysearch.png
    :target: http://badge.fury.io/py/fuzzysearch

.. image:: https://travis-ci.org/taleinat/fuzzysearch.png?branch=master
        :target: https://travis-ci.org/taleinat/fuzzysearch

.. image:: https://coveralls.io/repos/taleinat/fuzzysearch/badge.png?branch=master
        :target: https://coveralls.io/r/taleinat/fuzzysearch?branch=master

.. image:: https://pypip.in/d/fuzzysearch/badge.png
        :target: https://crate.io/packages/fuzzysearch?version=latest

fuzzysearch is useful for finding approximate subsequence matches

* Free software: `MIT license <LICENSE>`_
* Documentation: http://fuzzysearch.rtfd.org.

Features
--------

* Fuzzy sub-sequence search: Find parts of a sequence which match a given
  sub-sequence up to a given maximum Levenshtein distance.
* Set individual limits for the number of substitutions, insertions and/or
  deletions allowed for a near-match.
* Includes optimized implementations for specific use-cases, e.g. only allowing
  substitutions in near-matches.

Simple Example
--------------
You can usually just use the `find_near_matches()` utility function, which
chooses a suitable fuzzy search implementation according to the given
parameters:

.. code:: python

    >>> from fuzzysearch import find_near_matches
    >>> find_near_matches('PATTERN', 'aaaPATERNaaa', max_l_dist=1)
    [Match(start=3, end=9, dist=1)]

Advanced Example
----------------
If needed you can choose a specific search implementation, such as
`find_near_matches_with_ngrams()`:

.. code:: python

    >>> sequence = '''\
    GACTAGCACTGTAGGGATAACAATTTCACACAGGTGGACAATTACATTGAAAATCACAGATTGGTCACACACACA
    TTGGACATACATAGAAACACACACACATACATTAGATACGAACATAGAAACACACATTAGACGCGTACATAGACA
    CAAACACATTGACAGGCAGTTCAGATGATGACGCCCGACTGATACTCGCGTAGTCGTGGGAGGCAAGGCACACAG
    GGGATAGG'''
    >>> subsequence = 'TGCACTGTAGGGATAACAAT' #distance 1
    >>> max_distance = 2

    >>> from fuzzysearch import find_near_matches_with_ngrams
    >>> find_near_matches_with_ngrams(subsequence, sequence, max_distance)
    [Match(start=3, end=24, dist=1)]