spaCy/spacy
Matthew Honnibal abf8b16d71
Add doc.retokenize() context manager (#2172)
This patch takes a step towards #1487 by introducing the
doc.retokenize() context manager, to handle merging spans, and soon
splitting tokens.

The idea is to do merging and splitting like this:

with doc.retokenize() as retokenizer:
    for start, end, label in matches:
        retokenizer.merge(doc[start : end], attrs={'ent_type': label})

The retokenizer accumulates the merge requests, and applies them
together at the end of the block. This will allow retokenization to be
more efficient, and much less error prone.

A retokenizer.split() function will then be added, to handle splitting a
single token into multiple tokens. These methods take `Span` and `Token`
objects; if the user wants to go directly from offsets, they can append
to the .merges and .splits lists on the retokenizer.

The doc.merge() method's behaviour remains unchanged, so this patch
should be 100% backwards incompatible (modulo bugs). Internally,
doc.merge() fixes up the arguments (to handle the various deprecated styles),
opens the retokenizer, and makes the single merge.

We can later start making deprecation warnings on direct calls to doc.merge(),
to migrate people to use of the retokenize context manager.
2018-04-03 14:10:35 +02:00
..
cli Merge pull request #2152 from explosion/feature/tidy-up-dependencies 2018-03-29 14:35:09 +02:00
data Make spacy/data a package 2017-03-18 20:04:22 +01:00
displacy Don't use deprecated Doc.merge call in displaCy 2018-01-27 11:25:05 +01:00
lang Add support for Vietnamese in spaCy by leveraging Pyvi, an external Vietnamese tokenizer (#2155) 2018-03-29 12:19:51 +02:00
syntax Merge pull request #2158 from explosion/feature/fix-multiple-vectors (resolves #1660) 2018-03-28 23:08:24 +02:00
tests [2032] - Changed python set to cpp stl set (#2170) 2018-03-31 13:28:25 +02:00
tokens Add doc.retokenize() context manager (#2172) 2018-04-03 14:10:35 +02:00
__init__.pxd * Seems to be working after refactor. Need to wire up more POS tag features, and wire up save/load of POS tags. 2014-10-24 02:23:42 +11:00
__init__.py Remove dummy variable from function calls 2018-01-05 09:37:05 +01:00
__main__.py Don't pass CLI command name as dummy argument 2018-01-04 21:33:47 +01:00
_ml.py Dont assume pretrained_vectors cfg set in build_tagger 2018-03-28 20:12:45 +02:00
about.py Set version to v2.0.10 2018-03-24 18:09:03 +01:00
attrs.pxd Fix LANG symbol 2018-02-17 18:10:50 +01:00
attrs.pyx missing PrepCase attribute 2018-02-18 14:46:12 +00:00
compat.py Fix urllib for Python 3 2018-03-29 00:19:33 +02:00
glossary.py Fix typo in glossary (resolves #1964) 2018-02-10 11:58:41 +01:00
gold.pxd Add support for sent_start to GoldParse 2017-08-25 20:03:14 -05:00
gold.pyx Add offsets_from_biluo_tags helper and tests (see #1626) 2017-11-26 16:38:01 +01:00
language.py Fix syntax error 2018-03-29 21:50:32 +02:00
lemmatizer.py If no rules are set, lemmatize by lookup 2017-12-06 12:12:11 +01:00
lexeme.pxd WIP on stringstore change. 27 failures 2017-05-28 14:06:40 +02:00
lexeme.pyx added new lexical feat to lexeme 2018-02-11 18:51:48 +01:00
matcher.pyx Add output options return_matches and as_tuples to Matcher 2018-02-18 14:00:45 +01:00
morphology.pxd fix typo/missing here too 2018-02-18 14:38:27 +00:00
morphology.pyx fix typo/missing here too 2018-02-18 14:38:27 +00:00
parts_of_speech.pxd Add support for Universal Dependencies v2.0 2017-03-03 13:17:34 +01:00
parts_of_speech.pyx Tidy up rest 2017-10-27 21:07:59 +02:00
pipeline.pxd Fix names of pipeline components 2017-10-26 12:38:23 +02:00
pipeline.pyx Merge pull request #2152 from explosion/feature/tidy-up-dependencies 2018-03-29 14:35:09 +02:00
scorer.py Tidy up rest 2017-10-27 21:07:59 +02:00
strings.pxd Try to fix StringStore clean up (see #1506) 2017-11-11 03:11:27 +03:00
strings.pyx Use safer method to get string without hit 2017-11-14 22:58:46 +03:00
structs.pxd Make TokenC.sent_tart an int, to allow ternary value 2017-10-08 19:58:54 +02:00
symbols.pxd Fix inconsistencies in the symbols table 2018-02-18 13:51:31 +01:00
symbols.pyx Fix inconsistencies in the symbols table 2018-02-18 13:51:31 +01:00
tokenizer.pxd Disable tokenizer cache for special-cases. Fixes #1250 2017-10-24 16:08:05 +02:00
tokenizer.pyx Merge pull request #1611 from fsonntag/master 2017-11-29 23:11:23 +01:00
typedefs.pxd Work on changing StringStore to return hashes. 2017-05-28 12:36:27 +02:00
typedefs.pyx Tidy up rest 2017-10-27 21:07:59 +02:00
util.py Get msgpack and msgpack_numpy via Thinc, to avoid potential version conflicts 2018-03-29 00:14:55 +02:00
vectors.pyx [2032] - Changed python set to cpp stl set (#2170) 2018-03-31 13:28:25 +02:00
vocab.pxd Add Vocab.cfg attr, to hold stuff like oov probs 2017-10-30 16:08:50 +01:00
vocab.pyx Fix loading of multiple pre-trained vectors 2018-03-28 16:02:59 +02:00