=========== fuzzysearch =========== .. image:: https://img.shields.io/pypi/v/fuzzysearch.svg?style=flat :target: https://pypi.python.org/pypi/fuzzysearch :alt: Latest Version .. image:: https://img.shields.io/coveralls/taleinat/fuzzysearch.svg?branch=master :target: https://coveralls.io/r/taleinat/fuzzysearch?branch=master :alt: Test Coverage .. image:: https://img.shields.io/pypi/wheel/fuzzysearch.svg?style=flat :target: https://pypi.python.org/pypi/fuzzysearch :alt: Wheels .. image:: https://img.shields.io/pypi/pyversions/fuzzysearch.svg?style=flat :target: https://pypi.python.org/pypi/fuzzysearch :alt: Supported Python versions .. image:: https://img.shields.io/pypi/implementation/fuzzysearch.svg?style=flat :target: https://pypi.python.org/pypi/fuzzysearch :alt: Supported Python implementations .. image:: https://img.shields.io/pypi/l/fuzzysearch.svg?style=flat :target: https://pypi.python.org/pypi/fuzzysearch/ :alt: License Fuzzy search: Find parts of long text or data, allowing for some changes/typos. **Easy, fast, and just works!** .. code:: python >>> find_near_matches('PATTERN', '---PATERN---', max_l_dist=1) [Match(start=3, end=9, dist=1, matched="PATERN")] * Two simple functions to use: one for in-memory data and one for files * Fastest search algorithm is chosen automatically * Levenshtein Distance metric with configurable parameters * Separately configure the max. allowed distance, substitutions, deletions and/or insertions * Advanced algorithms with optional C and Cython optimizations * Properly handles Unicode; special optimizations for binary data * Simple installation: * ``pip install fuzzysearch`` just works * pure-Python fallbacks for compiled modules * only one dependency (``attrs``) * Extensively tested * Free software: `MIT license `_ For more info, see the `documentation `_. Installation ------------ ``fuzzysearch`` supports Python versions 3.8+, as well as PyPy 3.9 and 3.10. .. code:: $ pip install fuzzysearch This will work even if installing the C and Cython extensions fails, using pure-Python fallbacks. Usage ----- Just call ``find_near_matches()`` with the sub-sequence you're looking for, the sequence to search, and the matching parameters: .. code:: python >>> from fuzzysearch import find_near_matches # search for 'PATTERN' with a maximum Levenshtein Distance of 1 >>> find_near_matches('PATTERN', '---PATERN---', max_l_dist=1) [Match(start=3, end=9, dist=1, matched="PATERN")] To search in a file, use ``find_near_matches_in_file()`` similarly: .. code:: python >>> from fuzzysearch import find_near_matches_in_file >>> with open('data_file', 'rb') as f: ... find_near_matches_in_file(b'PATTERN', f, max_l_dist=1) [Match(start=3, end=9, dist=1, matched="PATERN")] Examples -------- *fuzzysearch* is great for ad-hoc searches of genetic data, such as DNA or protein sequences, before reaching for "heavier", domain-specific tools like BioPython: .. code:: python >>> sequence = '''\ GACTAGCACTGTAGGGATAACAATTTCACACAGGTGGACAATTACATTGAAAATCACAGATTGGTCACACACACA TTGGACATACATAGAAACACACACACATACATTAGATACGAACATAGAAACACACATTAGACGCGTACATAGACA CAAACACATTGACAGGCAGTTCAGATGATGACGCCCGACTGATACTCGCGTAGTCGTGGGAGGCAAGGCACACAG GGGATAGG''' >>> subsequence = 'TGCACTGTAGGGATAACAAT' # distance = 1 >>> find_near_matches(subsequence, sequence, max_l_dist=2) [Match(start=3, end=24, dist=1, matched="TAGCACTGTAGGGATAACAAT")] BioPython sequences are also supported: .. code:: python >>> from Bio.Seq import Seq >>> from Bio.Alphabet import IUPAC >>> sequence = Seq('''\ GACTAGCACTGTAGGGATAACAATTTCACACAGGTGGACAATTACATTGAAAATCACAGATTGGTCACACACACA TTGGACATACATAGAAACACACACACATACATTAGATACGAACATAGAAACACACATTAGACGCGTACATAGACA CAAACACATTGACAGGCAGTTCAGATGATGACGCCCGACTGATACTCGCGTAGTCGTGGGAGGCAAGGCACACAG GGGATAGG''', IUPAC.unambiguous_dna) >>> subsequence = Seq('TGCACTGTAGGGATAACAAT', IUPAC.unambiguous_dna) >>> find_near_matches(subsequence, sequence, max_l_dist=2) [Match(start=3, end=24, dist=1, matched="TAGCACTGTAGGGATAACAAT")] Matching Criteria ----------------- The search function supports four possible match criteria, which may be supplied in any combination: * maximum Levenshtein distance (``max_l_dist``) * maximum # of subsitutions * maximum # of deletions ("delete" = skip a character in the sub-sequence) * maximum # of insertions ("insert" = skip a character in the sequence) Not supplying a criterion means that there is no limit for it. For this reason, one must always supply ``max_l_dist`` and/or all other criteria. .. code:: python >>> find_near_matches('PATTERN', '---PATERN---', max_l_dist=1) [Match(start=3, end=9, dist=1, matched="PATERN")] # this will not match since max-deletions is set to zero >>> find_near_matches('PATTERN', '---PATERN---', max_l_dist=1, max_deletions=0) [] # note that a deletion + insertion may be combined to match a substution >>> find_near_matches('PATTERN', '---PAT-ERN---', max_deletions=1, max_insertions=1, max_substitutions=0) [Match(start=3, end=10, dist=1, matched="PAT-ERN")] # the Levenshtein distance is still 1 # ... but deletion + insertion may also match other, non-substitution differences >>> find_near_matches('PATTERN', '---PATERRN---', max_deletions=1, max_insertions=1, max_substitutions=0) [Match(start=3, end=10, dist=2, matched="PATERRN")] When to Use Other Tools ----------------------- * Use case: Search through a list of strings for almost-exactly matching strings. For example, searching through a list of names for possible slight variations of a certain name. Suggestion: Consider using `fuzzywuzzy `_.