spaCy

History

Matthew Honnibal abf8b16d71 Add doc.retokenize() context manager (#2172 ) This patch takes a step towards #1487 by introducing the doc.retokenize() context manager, to handle merging spans, and soon splitting tokens. The idea is to do merging and splitting like this: with doc.retokenize() as retokenizer: for start, end, label in matches: retokenizer.merge(doc[start : end], attrs={'ent_type': label}) The retokenizer accumulates the merge requests, and applies them together at the end of the block. This will allow retokenization to be more efficient, and much less error prone. A retokenizer.split() function will then be added, to handle splitting a single token into multiple tokens. These methods take `Span` and `Token` objects; if the user wants to go directly from offsets, they can append to the .merges and .splits lists on the retokenizer. The doc.merge() method's behaviour remains unchanged, so this patch should be 100% backwards incompatible (modulo bugs). Internally, doc.merge() fixes up the arguments (to handle the various deprecated styles), opens the retokenizer, and makes the single merge. We can later start making deprecation warnings on direct calls to doc.merge(), to migrate people to use of the retokenize context manager.		2018-04-03 14:10:35 +02:00
..
__init__.pxd	…
__init__.py	Tidy up and document Doc, Token and Span	2017-10-27 15:41:45 +02:00
_retokenize.pyx	Add doc.retokenize() context manager (#2172 )	2018-04-03 14:10:35 +02:00
doc.pxd	Add doc.retokenize() context manager (#2172 )	2018-04-03 14:10:35 +02:00
doc.pyx	Add doc.retokenize() context manager (#2172 )	2018-04-03 14:10:35 +02:00
printers.py	Tidy up util and helpers	2017-10-27 14:39:09 +02:00
span.pxd	…
span.pyx	Make .similarity() return 1.0 if all orth attrs match	2018-01-15 16:29:48 +01:00
token.pxd	fix sent_start in serialization	2018-01-28 19:50:42 +01:00
token.pyx	Fix #2073 : Token.set_extension not working	2018-03-27 13:36:20 +02:00
underscore.py	Tidy up util and helpers	2017-10-27 14:39:09 +02:00