spaCy

History

Adriane Boyd fa79a0db9f Add AttributeRuler for token attribute exceptions (#5842 ) * Add AttributeRuler for token attribute exceptions Add the `AttributeRuler` to handle exceptions for token-level attributes. The `AttributeRuler` uses `Matcher` patterns to identify target spans and applies the specified attributes to the token at the provided index in the matched span. A negative index can be used to index from the end of the matched span. The retokenizer is used to "merge" the individual tokens and assign them the provided attributes. Helper functions can import existing tag maps and morph rules to the corresponding `Matcher` patterns. There is an additional minor bug fix for `MORPH` attributes in the retokenizer to correctly normalize the values and to handle `MORPH` alongside `_` in an attrs dict. * Fix default name * Update name in error message * Extend AttributeRuler functionality * Add option to initialize with a dict of AttributeRuler patterns * Instead of silently discarding overlapping matches (the default behavior for the retokenizer if only the attrs differ), split the matches into disjoint sets and retokenize each set separately. This allows, for instance, one pattern to set the POS and another pattern to set the lemma. (If two matches modify the same attribute, it looks like the attrs are applied in the order they were added, but it may not be deterministic?) * Improve types * Sort spans before processing * Fix index boundaries in Span * Refactor retokenizer to separate attrs methods Add top-level `normalize_token_attrs` and `set_token_attrs` methods. * Update AttributeRuler to use refactored methods Update `AttributeRuler` to replace use of full retokenizer with only the relevant methods for normalizing and setting attributes for a single token. * Update spacy/pipeline/attributeruler.py Co-authored-by: Ines Montani <ines@ines.io> * Make API more similar to EntityRuler * Add `AttributeRuler.add_patterns` to add patterns from a list of dicts * Return list of dicts as property `AttributeRuler.patterns` * Make attrs_unnormed private * Add test loading patterns from assets * Revert "Fix index boundaries in Span" This reverts commit `8f8a5c3386`. * Add Span index boundary checks (#5861) * Add Span index boundary checks * Return Span-specific IndexError in all cases * Simplify and fix if/else Co-authored-by: Ines Montani <ines@ines.io>		2020-08-04 17:02:39 +02:00
..
cli	Create corpus iterator and batcher from registry during training (#5865 )	2020-08-04 15:09:37 +02:00
displacy	Tidy up, autoformat, add types	2020-07-25 15:01:15 +02:00
gold	Prevent alignment when texts don't match (#5867 )	2020-08-04 16:29:18 +02:00
lang	Tidy up [ci skip]	2020-07-25 13:00:49 +02:00
matcher	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2020-07-29 11:36:45 +02:00
ml	Default empty KB in EL component (#5872 )	2020-08-04 14:34:09 +02:00
pipeline	Add AttributeRuler for token attribute exceptions (#5842 )	2020-08-04 17:02:39 +02:00
tests	Add AttributeRuler for token attribute exceptions (#5842 )	2020-08-04 17:02:39 +02:00
tokens	Add AttributeRuler for token attribute exceptions (#5842 )	2020-08-04 17:02:39 +02:00
__init__.pxd	…
__init__.py	Tidy up __init__.py	2020-07-25 12:14:37 +02:00
__main__.py	Tidy up	2020-06-22 00:45:40 +02:00
about.py	Set version to v3.0.0a5	2020-07-25 14:06:01 +02:00
attrs.pxd	Merge branch 'develop' into master-tmp	2020-05-21 18:39:06 +02:00
attrs.pyx	Merge branch 'develop' into master-tmp	2020-05-21 18:39:06 +02:00
compat.py	Tidy up, autoformat, add types	2020-07-25 15:01:15 +02:00
default_config.cfg	Create corpus iterator and batcher from registry during training (#5865 )	2020-08-04 15:09:37 +02:00
errors.py	Add AttributeRuler for token attribute exceptions (#5842 )	2020-08-04 17:02:39 +02:00
glossary.py	unicode -> str consistency	2020-05-24 17:20:58 +02:00
kb.pxd	Tidy up and avoid absolute spacy imports in core	2020-05-21 20:05:03 +02:00
kb.pyx	Default empty KB in EL component (#5872 )	2020-08-04 14:34:09 +02:00
language.py	Simplify pipe analysis	2020-08-01 13:40:06 +02:00
lemmatizer.py	Update docstrings, docs and types	2020-07-29 11:36:42 +02:00
lexeme.pxd	Merge branch 'develop' into master-tmp	2020-05-21 18:39:06 +02:00
lexeme.pyx	WIP: move more language data to config	2020-07-22 15:59:37 +02:00
lookups.py	Update docstrings, docs and types	2020-07-29 11:36:42 +02:00
morphology.pxd	Update Morphology to load exceptions as MORPH_RULES	2020-07-16 21:16:49 +02:00
morphology.pyx	Minor refactor for Morphology and MorphAnalysis (#5804 )	2020-07-24 09:28:06 +02:00
parts_of_speech.pxd	…
parts_of_speech.pyx	…
pipe_analysis.py	Simplify pipe analysis	2020-08-01 13:40:06 +02:00
schemas.py	Create corpus iterator and batcher from registry during training (#5865 )	2020-08-04 15:09:37 +02:00
scorer.py	Merge branch 'develop' into feature/scorer-adjustments	2020-07-31 10:48:14 +02:00
strings.pxd	…
strings.pyx	Update docstrings, docs and types	2020-07-29 11:36:42 +02:00
structs.pxd	Merge branch 'develop' into master-tmp	2020-05-21 18:39:06 +02:00
symbols.pxd	Merge branch 'develop' into master-tmp	2020-05-21 18:39:06 +02:00
symbols.pyx	Merge branch 'develop' into master-tmp	2020-05-21 18:39:06 +02:00
tokenizer.pxd	Remove dead and/or deprecated code (#5710 )	2020-07-06 13:06:25 +02:00
tokenizer.pyx	Update docs and consistency	2020-07-29 15:14:07 +02:00
typedefs.pxd	…
typedefs.pyx	…
util.py	Create corpus iterator and batcher from registry during training (#5865 )	2020-08-04 15:09:37 +02:00
vectors.pyx	Update docstrings, docs and types	2020-07-29 11:36:42 +02:00
vocab.pxd	Tidy up and move noun_chunks, token_match, url_match	2020-07-22 22:18:46 +02:00
vocab.pyx	Merge pull request #5834 from explosion/feature/vectors	2020-07-29 18:49:26 +02:00