Commit Graph

68 Commits

Author SHA1 Message Date
Ines Montani d5c78c7a34 Update docs and fix consistency 2020-08-09 22:31:52 +02:00
Ines Montani 56c17973aa Use "raise ... from" in custom errors for better tracebacks 2020-08-05 23:53:21 +02:00
Ines Montani e257e66ab9 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2020-07-29 11:36:45 +02:00
Ines Montani e0ffe36e79 Update docstrings, docs and types 2020-07-29 11:36:42 +02:00
Sofie Van Landeghem 40c995b1be
Option for returning only greedy matches (#5771)
* add "greedy" option for match pattern

* distinction between greedy FIRST or LONGEST

* check for proper values, throw custom warning otherwise

* unxfail one more test

* add comment in docstring

* add test that LONGEST also prefers first match if equal length

* use c arrays for more efficient processing

* rename 'greediness' to 'greedy'
2020-07-29 11:04:43 +02:00
Ines Montani 5f6f4ff594 Remove object subclassing 2020-07-12 14:03:23 +02:00
Ines Montani 412dbb1f38
Remove dead and/or deprecated code (#5710)
* Remove dead and/or deprecated code

* Remove n_threads

Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
2020-07-06 13:06:25 +02:00
Ines Montani 52728d8fa3 Merge branch 'develop' into master-tmp 2020-06-20 15:52:00 +02:00
Ines Montani 810fce3bb1 Merge branch 'develop' into master-tmp 2020-06-03 14:36:59 +02:00
Adriane Boyd e06ca7ea24 Switch to new add API in PhraseMatcher unpickle 2020-05-25 11:22:47 +02:00
Ines Montani 1a15896ba9 unicode -> str consistency [ci skip] 2020-05-24 18:51:10 +02:00
Ines Montani 5d3806e059 unicode -> str consistency 2020-05-24 17:20:58 +02:00
Ines Montani 24f72c669c Merge branch 'develop' into master-tmp 2020-05-21 18:39:06 +02:00
Ines Montani a9cb2882cb
Rename argument: doc_or_span/obj -> doclike (#5463)
* doc_or_span -> obj

* Revert "doc_or_span -> obj"

This reverts commit 78bb9ff5e0.

* obj -> doclike

* Refer to correct object
2020-05-21 15:17:39 +02:00
adrianeboyd 3f43c73d37
Normalize TokenC.sent_start values for Matcher (#5346)
Normalize TokenC.sent_start values to booleans for the `Matcher`.
2020-04-29 12:57:30 +02:00
Adriane Boyd bc39f97e11 Simplify warnings 2020-04-28 13:37:37 +02:00
Paolo Arduin 1ca32d8f9c
Matcher support for Span as well as Doc (#5113)
* Matcher support for Span, as well as Doc #5056

* Removes an import unused

* Signed contributors agreement

* Code optimization and better test

* Add error message for bad Matcher call argument

* Fix merging
2020-04-15 13:51:33 +02:00
Paolo Arduin 8ce408d2e1
Comparison predicate handling for `!=` (#5282)
* Fix #5281

* Optim test
2020-04-14 19:14:15 +02:00
Ines Montani 46568f40a7 Merge branch 'master' into tmp/sync 2020-03-26 13:38:14 +01:00
Ines Montani b0cfab317f Merge branch 'develop' into refactor/simplify-warnings 2020-03-04 16:38:55 +01:00
adrianeboyd 697bec764d
Normalize IS_SENT_START to SENT_START for Matcher (#5080) 2020-03-03 12:22:39 +01:00
Ines Montani 648f61d077
Tidy up compiler flags and imports (#5071) 2020-03-02 11:48:10 +01:00
Ines Montani 37691e6d5d Simplify warnings 2020-02-28 12:20:23 +01:00
Ines Montani 33a2682d60
Add better schemas and validation using Pydantic (#4831)
* Remove unicode declarations

* Remove Python 3.5 and 2.7 from CI

* Don't require pathlib

* Replace compat helpers

* Remove OrderedDict

* Use f-strings

* Set Cython compiler language level

* Fix typo

* Re-add OrderedDict for Table

* Update setup.cfg

* Revert CONTRIBUTING.md

* Add better schemas and validation using Pydantic

* Revert lookups.md

* Remove unused import

* Update spacy/schemas.py

Co-Authored-By: Sebastián Ramírez <tiangolo@gmail.com>

* Various small fixes

* Fix docstring

Co-authored-by: Sebastián Ramírez <tiangolo@gmail.com>
2019-12-25 12:39:49 +01:00
Ines Montani db55577c45
Drop Python 2.7 and 3.5 (#4828)
* Remove unicode declarations

* Remove Python 3.5 and 2.7 from CI

* Don't require pathlib

* Replace compat helpers

* Remove OrderedDict

* Use f-strings

* Set Cython compiler language level

* Fix typo

* Re-add OrderedDict for Table

* Update setup.cfg

* Revert CONTRIBUTING.md

* Revert lookups.md

* Revert top-level.md

* Small adjustments and docs [ci skip]
2019-12-22 01:53:56 +01:00
adrianeboyd c208eb6e4d Fix int value handling in Matcher (#4749)
Add `int` values (for `LENGTH`) in _get_attr_values() instead of
treating `int` like `dict`.
2019-12-06 19:22:57 +01:00
Priscilla de Abreu Lopes 39e79fcc86 Bugfix/dep matcher issue 4590 (#4601)
* add contributor agreement for prilopes

* add test for issue #4590

* fix on_match params for DependencyMacther (#4590)
2019-11-07 12:01:06 +01:00
Ines Montani cfffdba7b1 Implement new API for {Phrase}Matcher.add (backwards-compatible) (#4522)
* Implement new API for {Phrase}Matcher.add (backwards-compatible)

* Update docs

* Also update DependencyMatcher.add

* Update internals

* Rewrite tests to use new API

* Add basic check for common mistake

Raise error with suggestion if user likely passed in a pattern instead of a list of patterns

* Fix typo [ci skip]
2019-10-25 22:21:08 +02:00
Sofie Van Landeghem 48886afc78 prevent zero-length mem alloc (#4429)
* raise specific error when removing a matcher rule that doesn't exist

* rephrasing

* goldparse init: allocate fields only if doc is not empty

* avoid zero length alloc in saving tokenizer cache

* avoid allocating zero length mem in matcher

* asserts to avoid allocating zero length mem

* fix zero-length allocation in matcher

* bump cymem version

* revert cymem version bump
2019-10-22 16:54:33 +02:00
adrianeboyd d359da9687 Replace Entity/MatchStruct with SpanC (#4459)
* Replace MatchStruct with Entity

Replace MatchStruct with Entity since the existing Entity struct is
nearly identical.

* Replace Entity with more general SpanC
2019-10-18 11:01:47 +02:00
adrianeboyd 275c9ad872 Allow int values in token patterns (#4444)
* Add missing int value option to top-level pattern validation in Matcher

* Adjust existing tests accordingly

* Add new test for valid pattern `{"LENGTH": int}`
2019-10-16 13:40:18 +02:00
Sofie Van Landeghem 7d1efac4eb Fix remove pattern from matcher (#4454)
* raise specific error when removing a matcher rule that doesn't exist

* rephrasing

* bugfix in remove matcher + extended unit test
2019-10-16 13:34:58 +02:00
adrianeboyd 98a961a60e Fix PhraseMatcher.remove for overlapping patterns (#4437) 2019-10-14 12:19:51 +02:00
Sofie Van Landeghem da6e0de34f fix attrs field in the matcher (#4423)
* raise specific error when removing a matcher rule that doesn't exist

* rephrasing

* ensure attrs is NULL when nr_attr == 0 + several fixes to prevent OOB
2019-10-10 15:20:59 +02:00
Sofie Van Landeghem 5efae495f1 Error when removing a matcher rule that doesn't exist (#4420)
* raise specific error when removing a matcher rule that doesn't exist

* rephrasing
2019-10-10 14:01:53 +02:00
Matthew Honnibal fa95c030a5
Unify matcher get_ent_id and get_pattern_key (#4415)
This is basically stabbing blindly at the ghost match problem, but it at
least seems like there was a bug previously here --- so this should
hopefully be an improvement, even if it doesn't fix the ghost match
problem.
2019-10-09 15:26:31 +02:00
adrianeboyd 14841d0aa6 Fix PhraseMatcher callback and add tests (#4399)
* Fix callback lookup in PhraseMatcher (string key rather than hash key)
* Add callback tests for Matcher and PhraseMatcher
2019-10-08 12:07:02 +02:00
Ines Montani 573e543e4a Alphanumeric -> alphabetic [ci skip]
see ines/spacy-course#38
2019-10-06 13:30:01 +02:00
Ines Montani fec9433044 Make PhraseMatcher.vocab consistent with Matcher.vocab (closes #4373) 2019-10-04 12:18:41 +02:00
adrianeboyd ba5595c764 Fix PhraseMatcher to remember attr on pickling (#4336)
* Fix PhraseMatcher to remember attr on pickling

* Check for attr as int or long
2019-09-29 17:12:33 +02:00
Ines Montani aad66d9bb9 Document PhraseMatcher.remove [ci skip] 2019-09-27 16:34:53 +02:00
adrianeboyd c23edf302b Replace PhraseMatcher with trie-based search (#4309)
* Replace PhraseMatcher with Aho-Corasick

Replace PhraseMatcher with the Aho-Corasick algorithm over numpy arrays
of the hash values for the relevant attribute. The implementation is
based on FlashText.

The speed should be similar to the previous PhraseMatcher. It is now
possible to easily remove match IDs and matches don't go missing with
large keyword lists / vocabularies.

Fixes #4308.

* Restore support for pickling

* Fix internal keyword add/remove for numpy arrays

* Add missing loop for match ID set in search loop

* Remove cruft in matching loop for partial matches

There was a bit of unnecessary code left over from FlashText in the
matching loop to handle partial token matches, which we don't have with
PhraseMatcher.

* Replace dict trie with MapStruct trie

* Fix how match ID hash is stored/added

* Update fix for match ID vocab

* Switch from map_get_unless_missing to map_get

* Switch from numpy array to Token.get_struct_attr

Access token attributes directly in Doc instead of making a copy of the
relevant values in a numpy array.

Add unsatisfactory warning for hash collision with reserved terminal
hash key. (Ideally it would change the reserved terminal hash and redo
the whole trie, but for now, I'm hoping there won't be collisions.)

* Restructure imports to export find_matches

* Implement full remove()

Remove unnecessary trie paths and free unused maps.

Parallel to Matcher, raise KeyError when attempting to remove a match ID
that has not been added.

* Store docs internally only as attr lists

* Reduces size for pickle

* Remove duplicate keywords store

Now that docs are stored as lists of attr hashes, there's no need to
have the duplicate _keywords store.
2019-09-27 16:22:34 +02:00
Ines Montani 52904b7270 Raise if on_match is not callable or None 2019-09-24 23:06:24 +02:00
Sean Löfgren 31c683d87d add return_matches and as_tuples back to Matcher.pipe (#4303)
* add contributor agreement [ci skip]

* add return_matches and as_tuples back to Matcher.pipe
2019-09-18 22:00:33 +02:00
Ines Montani cd90752193 Tidy up and auto-format [ci skip] 2019-08-31 13:39:06 +02:00
adrianeboyd 5feb342f5e Add more token attributes to token pattern schema (#4210)
Add token attributes with tests to token pattern schema.
2019-08-29 12:02:26 +02:00
Sofie Van Landeghem c417c380e3 Matcher ID fixes (#4179)
* allow phrasematcher to link one match to multiple original patterns

* small fix for defining ent_id in the matcher (anti-ghost prevention)

* cleanup

* formatting
2019-08-22 17:17:07 +02:00
Sofie Van Landeghem de272f8b82 adding double match for optional operator at the end (#4166) 2019-08-21 22:46:56 +02:00
Sofie Van Landeghem 7539a4f3a8 use states[q] in while retry loop (#4162) 2019-08-21 21:58:04 +02:00
adrianeboyd 2d17b047e2 Check for is_tagged/is_parsed for Matcher attrs (#4163)
Check for relevant components in the pipeline when Matcher is called,
similar to the checks for PhraseMatcher in #4105.

* keep track of attributes seen in patterns

* when Matcher is called on a Doc, check for is_tagged for LEMMA, TAG,
POS and for is_parsed for DEP
2019-08-21 20:52:36 +02:00