spaCy/spacy/tokens
adrianeboyd aec755d3a3 Modify retokenizer to use span root attributes (#4219)
* Modify retokenizer to use span root attributes

* tag/pos/morph are set to root tag/pos/morph

* lemma and norm are reset and end up as orth (not ideal, but better
than orth of first token)

* Also handle individual merge case

* Add test

* Attempt to handle ent_iob and ent_type in merges

* Fix check for whether B-ENT should become I-ENT

* Move IOB consistency check to after attrs

Move all IOB consistency checks after attrs are set and simplify to
check entire document, modifying I to B at the beginning of the document
or if the entity type of the previous token isn't the same.

* Move IOB consistency check for single merge

Move IOB consistency check after the token array is compressed for the
single merge case.

* Update spacy/tokens/_retokenize.pyx

Co-Authored-By: Matthew Honnibal <honnibal+gh@gmail.com>

* Remove single vs. multiple merge distinction

Remove original single-instance `_merge()` and use `_bulk_merge()` (now
renamed `_merge()`) for all merges.

* Add out-of-bound check in previous entity check
2019-09-08 13:04:49 +02:00
..
__init__.pxd
__init__.py Tidy up and improve docs and docstrings (#3370) 2019-03-08 11:42:26 +01:00
_retokenize.pyx Modify retokenizer to use span root attributes (#4219) 2019-09-08 13:04:49 +02:00
_serialize.py Reformat 2019-07-11 11:49:36 +02:00
doc.pxd cleanup 2019-07-11 13:09:22 +02:00
doc.pyx Serialize POS attribute when doc.is_tagged (#4092) 2019-08-21 21:59:30 +02:00
span.pxd annotate kb_id through ents in doc 2019-03-22 11:36:44 +01:00
span.pyx Add span.tensor and token.tensor attributes 2019-08-01 18:30:50 +02:00
token.pxd ensure Span.as_doc keeps the entity links + unit test 2019-06-25 15:28:51 +02:00
token.pyx Add span.tensor and token.tensor attributes 2019-08-01 18:30:50 +02:00
underscore.py 💫 Improve introspection of custom extension attributes (#3729) 2019-05-12 00:53:11 +02:00