spaCy

History

Matthew Honnibal abf8b16d71 Add doc.retokenize() context manager (#2172 ) This patch takes a step towards #1487 by introducing the doc.retokenize() context manager, to handle merging spans, and soon splitting tokens. The idea is to do merging and splitting like this: with doc.retokenize() as retokenizer: for start, end, label in matches: retokenizer.merge(doc[start : end], attrs={'ent_type': label}) The retokenizer accumulates the merge requests, and applies them together at the end of the block. This will allow retokenization to be more efficient, and much less error prone. A retokenizer.split() function will then be added, to handle splitting a single token into multiple tokens. These methods take `Span` and `Token` objects; if the user wants to go directly from offsets, they can append to the .merges and .splits lists on the retokenizer. The doc.merge() method's behaviour remains unchanged, so this patch should be 100% backwards incompatible (modulo bugs). Internally, doc.merge() fixes up the arguments (to handle the various deprecated styles), opens the retokenizer, and makes the single merge. We can later start making deprecation warnings on direct calls to doc.merge(), to migrate people to use of the retokenize context manager.		2018-04-03 14:10:35 +02:00
..
cli	Merge pull request #2152 from explosion/feature/tidy-up-dependencies	2018-03-29 14:35:09 +02:00
data	Make spacy/data a package	2017-03-18 20:04:22 +01:00
displacy	Don't use deprecated Doc.merge call in displaCy	2018-01-27 11:25:05 +01:00
lang	Add support for Vietnamese in spaCy by leveraging Pyvi, an external Vietnamese tokenizer (#2155 )	2018-03-29 12:19:51 +02:00
syntax	Merge pull request #2158 from explosion/feature/fix-multiple-vectors (resolves #1660 )	2018-03-28 23:08:24 +02:00
tests	[2032] - Changed python set to cpp stl set (#2170 )	2018-03-31 13:28:25 +02:00
tokens	Add doc.retokenize() context manager (#2172 )	2018-04-03 14:10:35 +02:00
__init__.pxd	* Seems to be working after refactor. Need to wire up more POS tag features, and wire up save/load of POS tags.	2014-10-24 02:23:42 +11:00
__init__.py	Remove dummy variable from function calls	2018-01-05 09:37:05 +01:00
__main__.py	Don't pass CLI command name as dummy argument	2018-01-04 21:33:47 +01:00
_ml.py	Dont assume pretrained_vectors cfg set in build_tagger	2018-03-28 20:12:45 +02:00
about.py	Set version to v2.0.10	2018-03-24 18:09:03 +01:00
attrs.pxd	Fix LANG symbol	2018-02-17 18:10:50 +01:00
attrs.pyx	missing PrepCase attribute	2018-02-18 14:46:12 +00:00
compat.py	Fix urllib for Python 3	2018-03-29 00:19:33 +02:00
glossary.py	Fix typo in glossary (resolves #1964 )	2018-02-10 11:58:41 +01:00
gold.pxd	Add support for sent_start to GoldParse	2017-08-25 20:03:14 -05:00
gold.pyx	Add offsets_from_biluo_tags helper and tests (see #1626 )	2017-11-26 16:38:01 +01:00
language.py	Fix syntax error	2018-03-29 21:50:32 +02:00
lemmatizer.py	If no rules are set, lemmatize by lookup	2017-12-06 12:12:11 +01:00
lexeme.pxd	WIP on stringstore change. 27 failures	2017-05-28 14:06:40 +02:00
lexeme.pyx	added new lexical feat to lexeme	2018-02-11 18:51:48 +01:00
matcher.pyx	Add output options return_matches and as_tuples to Matcher	2018-02-18 14:00:45 +01:00
morphology.pxd	fix typo/missing here too	2018-02-18 14:38:27 +00:00
morphology.pyx	fix typo/missing here too	2018-02-18 14:38:27 +00:00
parts_of_speech.pxd	Add support for Universal Dependencies v2.0	2017-03-03 13:17:34 +01:00
parts_of_speech.pyx	Tidy up rest	2017-10-27 21:07:59 +02:00
pipeline.pxd	Fix names of pipeline components	2017-10-26 12:38:23 +02:00
pipeline.pyx	Merge pull request #2152 from explosion/feature/tidy-up-dependencies	2018-03-29 14:35:09 +02:00
scorer.py	Tidy up rest	2017-10-27 21:07:59 +02:00
strings.pxd	Try to fix StringStore clean up (see #1506 )	2017-11-11 03:11:27 +03:00
strings.pyx	Use safer method to get string without hit	2017-11-14 22:58:46 +03:00
structs.pxd	Make TokenC.sent_tart an int, to allow ternary value	2017-10-08 19:58:54 +02:00
symbols.pxd	Fix inconsistencies in the symbols table	2018-02-18 13:51:31 +01:00
symbols.pyx	Fix inconsistencies in the symbols table	2018-02-18 13:51:31 +01:00
tokenizer.pxd	Disable tokenizer cache for special-cases. Fixes #1250	2017-10-24 16:08:05 +02:00
tokenizer.pyx	Merge pull request #1611 from fsonntag/master	2017-11-29 23:11:23 +01:00
typedefs.pxd	Work on changing StringStore to return hashes.	2017-05-28 12:36:27 +02:00
typedefs.pyx	Tidy up rest	2017-10-27 21:07:59 +02:00
util.py	Get msgpack and msgpack_numpy via Thinc, to avoid potential version conflicts	2018-03-29 00:14:55 +02:00
vectors.pyx	[2032] - Changed python set to cpp stl set (#2170 )	2018-03-31 13:28:25 +02:00
vocab.pxd	Add Vocab.cfg attr, to hold stuff like oov probs	2017-10-30 16:08:50 +01:00
vocab.pyx	Fix loading of multiple pre-trained vectors	2018-03-28 16:02:59 +02:00