spaCy/spacy
Adriane Boyd fa79a0db9f
Add AttributeRuler for token attribute exceptions (#5842)
* Add AttributeRuler for token attribute exceptions

Add the `AttributeRuler` to handle exceptions for token-level
attributes. The `AttributeRuler` uses `Matcher` patterns to identify
target spans and applies the specified attributes to the token at the
provided index in the matched span. A negative index can be used to
index from the end of the matched span. The retokenizer is used to
"merge" the individual tokens and assign them the provided attributes.

Helper functions can import existing tag maps and morph rules to the
corresponding `Matcher` patterns.

There is an additional minor bug fix for `MORPH` attributes in the
retokenizer to correctly normalize the values and to handle `MORPH`
alongside `_` in an attrs dict.

* Fix default name

* Update name in error message

* Extend AttributeRuler functionality

* Add option to initialize with a dict of AttributeRuler patterns

* Instead of silently discarding overlapping matches (the default
behavior for the retokenizer if only the attrs differ), split the
matches into disjoint sets and retokenize each set separately. This
allows, for instance, one pattern to set the POS and another pattern to
set the lemma. (If two matches modify the same attribute, it looks like
the attrs are applied in the order they were added, but it may not be
deterministic?)

* Improve types

* Sort spans before processing

* Fix index boundaries in Span

* Refactor retokenizer to separate attrs methods

Add top-level `normalize_token_attrs` and `set_token_attrs` methods.

* Update AttributeRuler to use refactored methods

Update `AttributeRuler` to replace use of full retokenizer with only the
relevant methods for normalizing and setting attributes for a single
token.

* Update spacy/pipeline/attributeruler.py

Co-authored-by: Ines Montani <ines@ines.io>

* Make API more similar to EntityRuler

* Add `AttributeRuler.add_patterns` to add patterns from a list of dicts
* Return list of dicts as property `AttributeRuler.patterns`

* Make attrs_unnormed private

* Add test loading patterns from assets

* Revert "Fix index boundaries in Span"

This reverts commit 8f8a5c3386.

* Add Span index boundary checks (#5861)

* Add Span index boundary checks

* Return Span-specific IndexError in all cases

* Simplify and fix if/else

Co-authored-by: Ines Montani <ines@ines.io>
2020-08-04 17:02:39 +02:00
..
cli Create corpus iterator and batcher from registry during training (#5865) 2020-08-04 15:09:37 +02:00
displacy Tidy up, autoformat, add types 2020-07-25 15:01:15 +02:00
gold Prevent alignment when texts don't match (#5867) 2020-08-04 16:29:18 +02:00
lang Tidy up [ci skip] 2020-07-25 13:00:49 +02:00
matcher Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2020-07-29 11:36:45 +02:00
ml Default empty KB in EL component (#5872) 2020-08-04 14:34:09 +02:00
pipeline Add AttributeRuler for token attribute exceptions (#5842) 2020-08-04 17:02:39 +02:00
tests Add AttributeRuler for token attribute exceptions (#5842) 2020-08-04 17:02:39 +02:00
tokens Add AttributeRuler for token attribute exceptions (#5842) 2020-08-04 17:02:39 +02:00
__init__.pxd
__init__.py Tidy up __init__.py 2020-07-25 12:14:37 +02:00
__main__.py Tidy up 2020-06-22 00:45:40 +02:00
about.py Set version to v3.0.0a5 2020-07-25 14:06:01 +02:00
attrs.pxd Merge branch 'develop' into master-tmp 2020-05-21 18:39:06 +02:00
attrs.pyx Merge branch 'develop' into master-tmp 2020-05-21 18:39:06 +02:00
compat.py Tidy up, autoformat, add types 2020-07-25 15:01:15 +02:00
default_config.cfg Create corpus iterator and batcher from registry during training (#5865) 2020-08-04 15:09:37 +02:00
errors.py Add AttributeRuler for token attribute exceptions (#5842) 2020-08-04 17:02:39 +02:00
glossary.py unicode -> str consistency 2020-05-24 17:20:58 +02:00
kb.pxd Tidy up and avoid absolute spacy imports in core 2020-05-21 20:05:03 +02:00
kb.pyx Default empty KB in EL component (#5872) 2020-08-04 14:34:09 +02:00
language.py Simplify pipe analysis 2020-08-01 13:40:06 +02:00
lemmatizer.py Update docstrings, docs and types 2020-07-29 11:36:42 +02:00
lexeme.pxd Merge branch 'develop' into master-tmp 2020-05-21 18:39:06 +02:00
lexeme.pyx WIP: move more language data to config 2020-07-22 15:59:37 +02:00
lookups.py Update docstrings, docs and types 2020-07-29 11:36:42 +02:00
morphology.pxd Update Morphology to load exceptions as MORPH_RULES 2020-07-16 21:16:49 +02:00
morphology.pyx Minor refactor for Morphology and MorphAnalysis (#5804) 2020-07-24 09:28:06 +02:00
parts_of_speech.pxd
parts_of_speech.pyx
pipe_analysis.py Simplify pipe analysis 2020-08-01 13:40:06 +02:00
schemas.py Create corpus iterator and batcher from registry during training (#5865) 2020-08-04 15:09:37 +02:00
scorer.py Merge branch 'develop' into feature/scorer-adjustments 2020-07-31 10:48:14 +02:00
strings.pxd
strings.pyx Update docstrings, docs and types 2020-07-29 11:36:42 +02:00
structs.pxd Merge branch 'develop' into master-tmp 2020-05-21 18:39:06 +02:00
symbols.pxd Merge branch 'develop' into master-tmp 2020-05-21 18:39:06 +02:00
symbols.pyx Merge branch 'develop' into master-tmp 2020-05-21 18:39:06 +02:00
tokenizer.pxd Remove dead and/or deprecated code (#5710) 2020-07-06 13:06:25 +02:00
tokenizer.pyx Update docs and consistency 2020-07-29 15:14:07 +02:00
typedefs.pxd
typedefs.pyx
util.py Create corpus iterator and batcher from registry during training (#5865) 2020-08-04 15:09:37 +02:00
vectors.pyx Update docstrings, docs and types 2020-07-29 11:36:42 +02:00
vocab.pxd Tidy up and move noun_chunks, token_match, url_match 2020-07-22 22:18:46 +02:00
vocab.pyx Merge pull request #5834 from explosion/feature/vectors 2020-07-29 18:49:26 +02:00