spaCy/spacy
Adriane Boyd a322d6d5f2
Add SpanRuler component (#9880)
* Add SpanRuler component

Add a `SpanRuler` component similar to `EntityRuler` that saves a list
of matched spans to `Doc.spans[spans_key]`. The matches from the token
and phrase matchers are deduplicated and sorted before assignment but
are not otherwise filtered.

* Update spacy/pipeline/span_ruler.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Fix cast

* Add self.key property

* Use number of patterns as length

* Remove patterns kwarg from init

* Update spacy/tests/pipeline/test_span_ruler.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Add options for spans filter and setting to ents

* Add `spans_filter` option as a registered function'
* Make `spans_key` optional and if `None`, set to `doc.ents` instead of
`doc.spans[spans_key]`.

* Update and generalize tests

* Add test for setting doc.ents, fix key property type

* Fix typing

* Allow independent doc.spans and doc.ents

* If `spans_key` is set, set `doc.spans` with `spans_filter`.
* If `annotate_ents` is set, set `doc.ents` with `ents_fitler`.
  * Use `util.filter_spans` by default as `ents_filter`.
  * Use a custom warning if the filter does not work for `doc.ents`.

* Enable use of SpanC.id in Span

* Support id in SpanRuler as Span.id

* Update types

* `id` can only be provided as string (already by `PatternType`
definition)

* Update all uses of Span.id/ent_id in Doc

* Rename Span id kwarg to span_id

* Update types and docs

* Add ents filter to mimic EntityRuler overwrite_ents

* Refactor `ents_filter` to take `entities, spans` args for more
  filtering options
* Give registered filters more descriptive names
* Allow registered `filter_spans` filter
  (`spacy.first_longest_spans_filter.v1`) to take any number of
  `Iterable[Span]` objects as args so it can be used for spans filter
  or ents filter

* Implement future entity ruler as span ruler

Implement a compatible `entity_ruler` as `future_entity_ruler` using
`SpanRuler` as the underlying component:
* Add `sort_key` and `sort_reverse` to allow the sorting behavior to be
  customized. (Necessary for the same sorting/filtering as in
  `EntityRuler`.)
* Implement `overwrite_overlapping_ents_filter` and
  `preserve_existing_ents_filter` to support
  `EntityRuler.overwrite_ents` settings.
* Add `remove_by_id` to support `EntityRuler.remove` functionality.
* Refactor `entity_ruler` tests to parametrize all tests to test both
  `entity_ruler` and `future_entity_ruler`
* Implement `SpanRuler.token_patterns` and `SpanRuler.phrase_patterns`
  properties.

Additional changes:

* Move all config settings to top-level attributes to avoid duplicating
  settings in the config vs. `span_ruler/cfg`. (Also avoids a lot of
  casting.)

* Format

* Fix filter make method name

* Refactor to use same error for removing by label or ID

* Also provide existing spans to spans filter

* Support ids property

* Remove token_patterns and phrase_patterns

* Update docstrings

* Add span ruler docs

* Fix types

* Apply suggestions from code review

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Move sorting into filters

* Check for all tokens in seen tokens in entity ruler filters

* Remove registered sort key

* Set Token.ent_id in a backwards-compatible way in Doc.set_ents

* Remove sort options from API docs

* Update docstrings

* Rename entity ruler filters

* Fix and parameterize scoring

* Add id to Span API docs

* Fix typo in API docs

* Include explicit labeled=True for scorer

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2022-06-02 13:12:53 +02:00
..
cli Simplify GPU check 2022-05-25 14:06:45 +02:00
displacy #10672: fixes displacy output for manual unsorted entities (#10673) 2022-04-27 09:51:58 +02:00
lang Auto-format code with black (#10687) 2022-04-22 11:24:53 +02:00
matcher Better errors for has_annotation and Matcher (#10830) 2022-05-25 11:12:29 +02:00
ml Minor NEL type fixes (#10860) 2022-06-01 00:41:28 +02:00
pipeline Add SpanRuler component (#9880) 2022-06-02 13:12:53 +02:00
tests Add SpanRuler component (#9880) 2022-06-02 13:12:53 +02:00
tokens Add SpanRuler component (#9880) 2022-06-02 13:12:53 +02:00
training fix typo + CI slow testing (#10835) 2022-06-02 00:10:16 +02:00
__init__.pxd
__init__.py Tidy up and auto-format 2021-07-18 15:44:56 +10:00
__main__.py
about.py Set version to v3.3.0 (#10614) 2022-04-28 13:07:49 +02:00
attrs.pxd
attrs.pyx Intify IOB (#9738) 2022-01-20 13:19:38 +01:00
compat.py Custom component types in spacy.ty (#9469) 2021-10-21 15:31:06 +02:00
default_config.cfg Add a few docs to the default_config.cfg (#9981) 2022-01-05 09:16:40 +01:00
default_config_pretraining.cfg Add new parameter for saving every n epoch in pretraining (#8912) 2021-08-12 11:14:48 +02:00
errors.py Add SpanRuler component (#9880) 2022-06-02 13:12:53 +02:00
glossary.py Add glossary entry for root (#10821) 2022-05-20 09:56:32 +02:00
kb.pxd Replace cpdef variables with cdef (#7834) 2021-04-26 16:54:02 +02:00
kb.pyx Merge remote-tracking branch 'upstream/master' into chore/update-develop-from-master-v3.2-1 2021-10-26 11:53:50 +02:00
language.py Ignore overrides for pipe names in config argument (#10779) 2022-05-12 11:46:08 +02:00
lexeme.pxd
lexeme.pyi fix type of lexeme.rank (#9979) 2022-01-04 13:15:25 +01:00
lexeme.pyx Bugfix for similarity return types (#10051) 2022-01-20 11:40:46 +01:00
lookups.py Fix issues for Mypy 0.950 and Pydantic 1.9.0 (#10786) 2022-05-25 09:33:54 +02:00
morphology.pxd Clean up Morphology imports and definitions (#7441) 2021-04-26 16:54:23 +02:00
morphology.pyx Clean up Morphology imports and definitions (#7441) 2021-04-26 16:54:23 +02:00
parts_of_speech.pxd
parts_of_speech.pyx
pipe_analysis.py 🏷 Add Mypy check to CI and ignore all existing Mypy errors (#9167) 2021-10-14 15:21:40 +02:00
py.typed Add py.typed 2021-03-16 09:48:31 +01:00
schemas.py Fix issues for Mypy 0.950 and Pydantic 1.9.0 (#10786) 2022-05-25 09:33:54 +02:00
scorer.py Alignment: use a simplified ragged type for performance (#10319) 2022-04-01 09:02:06 +02:00
strings.pxd Update Cython string types (#9143) 2021-09-13 17:02:17 +02:00
strings.pyi Fix StringStore.__getitem__ return type depending on parameter types (#10741) 2022-05-03 17:57:07 +02:00
strings.pyx Update Cython string types (#9143) 2021-09-13 17:02:17 +02:00
structs.pxd
symbols.pxd
symbols.pyx
tokenizer.pxd Add tokenizer option to allow Matcher handling for all rules (#10452) 2022-03-24 13:21:32 +01:00
tokenizer.pyx Add tokenizer option to allow Matcher handling for all rules (#10452) 2022-03-24 13:21:32 +01:00
ty.py Custom component types in spacy.ty (#9469) 2021-10-21 15:31:06 +02:00
typedefs.pxd
typedefs.pyx
util.py Add SpanRuler component (#9880) 2022-06-02 13:12:53 +02:00
vectors.pyx Save vectors as little endian, load with Ops.asarray (#10201) 2022-03-21 14:24:46 +01:00
vocab.pxd Add support for floret vectors (#8909) 2021-10-27 14:08:31 +02:00
vocab.pyi Add vector deduplication (#10551) 2022-03-30 08:54:23 +02:00
vocab.pyx Add vector deduplication (#10551) 2022-03-30 08:54:23 +02:00