spaCy/spacy/tokens
Matthew Honnibal abf8b16d71
Add doc.retokenize() context manager (#2172)
This patch takes a step towards #1487 by introducing the
doc.retokenize() context manager, to handle merging spans, and soon
splitting tokens.

The idea is to do merging and splitting like this:

with doc.retokenize() as retokenizer:
    for start, end, label in matches:
        retokenizer.merge(doc[start : end], attrs={'ent_type': label})

The retokenizer accumulates the merge requests, and applies them
together at the end of the block. This will allow retokenization to be
more efficient, and much less error prone.

A retokenizer.split() function will then be added, to handle splitting a
single token into multiple tokens. These methods take `Span` and `Token`
objects; if the user wants to go directly from offsets, they can append
to the .merges and .splits lists on the retokenizer.

The doc.merge() method's behaviour remains unchanged, so this patch
should be 100% backwards incompatible (modulo bugs). Internally,
doc.merge() fixes up the arguments (to handle the various deprecated styles),
opens the retokenizer, and makes the single merge.

We can later start making deprecation warnings on direct calls to doc.merge(),
to migrate people to use of the retokenize context manager.
2018-04-03 14:10:35 +02:00
..
__init__.pxd
__init__.py Tidy up and document Doc, Token and Span 2017-10-27 15:41:45 +02:00
_retokenize.pyx Add doc.retokenize() context manager (#2172) 2018-04-03 14:10:35 +02:00
doc.pxd Add doc.retokenize() context manager (#2172) 2018-04-03 14:10:35 +02:00
doc.pyx Add doc.retokenize() context manager (#2172) 2018-04-03 14:10:35 +02:00
printers.py Tidy up util and helpers 2017-10-27 14:39:09 +02:00
span.pxd
span.pyx Make .similarity() return 1.0 if all orth attrs match 2018-01-15 16:29:48 +01:00
token.pxd fix sent_start in serialization 2018-01-28 19:50:42 +01:00
token.pyx Fix #2073: Token.set_extension not working 2018-03-27 13:36:20 +02:00
underscore.py Tidy up util and helpers 2017-10-27 14:39:09 +02:00