spaCy

History

Adriane Boyd c62fd878a3 Allow Doc.char_span to snap to token boundaries (#5849 ) * Allow Doc.char_span to snap to token boundaries Add a `mode` option to allow `Doc.char_span` to snap to token boundaries. The `mode` options: * `strict`: character offsets must match token boundaries (default, same as before) * `inside`: all tokens completely within the character span * `outside`: all tokens at least partially covered by the character span Add a new helper function `token_by_char` that returns the token corresponding to a character position in the text. Update `token_by_start` and `token_by_end` to use `token_by_char` for more efficient searching. * Remove unused import * Rename mode to alignment_mode Rename `mode` to `alignment_mode` with the options `strict`/`contract`/`expand`. Any unrecognized modes are silently converted to `strict`.		2020-08-04 13:36:32 +02:00
..
__init__.pxd	* Break up tokens.pyx into tokens/doc.pyx, tokens/token.pyx, tokens/spans.pyx	2015-07-13 20:20:58 +02:00
__init__.py	DocPallet -> DocBin	2019-09-18 15:15:37 +02:00
_retokenize.pyx	Disallow merging 0-length spans	2020-05-22 10:14:34 +02:00
_serialize.py	Include Doc.cats in serialization of Doc and DocBin (#4774 )	2019-12-06 14:07:39 +01:00
doc.pxd	Normalize TokenC.sent_start values for Matcher (#5346 )	2020-04-29 12:57:30 +02:00
doc.pyx	Allow Doc.char_span to snap to token boundaries (#5849 )	2020-08-04 13:36:32 +02:00
morphanalysis.pxd	Add header for morphanalysis	2019-03-07 17:24:57 +01:00
morphanalysis.pyx	Remove MorphAnalysis __str__ and __repr__	2020-05-29 14:33:47 +02:00
span.pxd	annotate kb_id through ents in doc	2019-03-22 11:36:44 +01:00
span.pyx	Add Span index boundary checks (#5861 )	2020-08-04 13:35:25 +02:00
token.pxd	serialize ENT_ID (#4852 )	2020-01-06 14:57:34 +01:00
token.pyx	Fix polarity of Token.is_oov and Lexeme.is_oov (#5634 )	2020-06-23 13:29:51 +02:00
underscore.py	load Underscore state when multiprocessing	2020-02-12 11:50:42 +01:00