spaCy

Commit Graph

Author	SHA1	Message	Date
adrianeboyd	faaa832518	Generalize handling of tokenizer special cases (#4259 ) * Generalize handling of tokenizer special cases Handle tokenizer special cases more generally by using the Matcher internally to match special cases after the affix/token_match tokenization is complete. Instead of only matching special cases while processing balanced or nearly balanced prefixes and suffixes, this recognizes special cases in a wider range of contexts: * Allows arbitrary numbers of prefixes/affixes around special cases * Allows special cases separated by infixes Existing tests/settings that couldn't be preserved as before: * The emoticon '")' is no longer a supported special case * The emoticon ':)' in "example:)" is a false positive again When merged with #4258 (or the relevant cache bugfix), the affix and token_match properties should be modified to flush and reload all special cases to use the updated internal tokenization with the Matcher. * Remove accidentally added test case * Really remove accidentally added test * Reload special cases when necessary Reload special cases when affixes or token_match are modified. Skip reloading during initialization. * Update error code number * Fix offset and whitespace in Matcher special cases * Fix offset bugs when merging and splitting tokens * Set final whitespace on final token in inserted special case * Improve cache flushing in tokenizer * Separate cache and specials memory (temporarily) * Flush cache when adding special cases * Repeated `self._cache = PreshMap()` and `self._specials = PreshMap()` are necessary due to this bug: https://github.com/explosion/preshed/issues/21 * Remove reinitialized PreshMaps on cache flush * Update UD bin scripts * Update imports for `bin/` * Add all currently supported languages * Update subtok merger for new Matcher validation * Modify blinded check to look at tokens instead of lemmas (for corpora with tokens but not lemmas like Telugu) * Use special Matcher only for cases with affixes * Reinsert specials cache checks during normal tokenization for special cases as much as possible * Additionally include specials cache checks while splitting on infixes * Since the special Matcher needs consistent affix-only tokenization for the special cases themselves, introduce the argument `with_special_cases` in order to do tokenization with or without specials cache checks * After normal tokenization, postprocess with special cases Matcher for special cases containing affixes * Replace PhraseMatcher with Aho-Corasick Replace PhraseMatcher with the Aho-Corasick algorithm over numpy arrays of the hash values for the relevant attribute. The implementation is based on FlashText. The speed should be similar to the previous PhraseMatcher. It is now possible to easily remove match IDs and matches don't go missing with large keyword lists / vocabularies. Fixes #4308. * Restore support for pickling * Fix internal keyword add/remove for numpy arrays * Add test for #4248, clean up test * Improve efficiency of special cases handling * Use PhraseMatcher instead of Matcher * Improve efficiency of merging/splitting special cases in document * Process merge/splits in one pass without repeated token shifting * Merge in place if no splits * Update error message number * Remove UD script modifications Only used for timing/testing, should be a separate PR * Remove final traces of UD script modifications * Update UD bin scripts * Update imports for `bin/` * Add all currently supported languages * Update subtok merger for new Matcher validation * Modify blinded check to look at tokens instead of lemmas (for corpora with tokens but not lemmas like Telugu) * Add missing loop for match ID set in search loop * Remove cruft in matching loop for partial matches There was a bit of unnecessary code left over from FlashText in the matching loop to handle partial token matches, which we don't have with PhraseMatcher. * Replace dict trie with MapStruct trie * Fix how match ID hash is stored/added * Update fix for match ID vocab * Switch from map_get_unless_missing to map_get * Switch from numpy array to Token.get_struct_attr Access token attributes directly in Doc instead of making a copy of the relevant values in a numpy array. Add unsatisfactory warning for hash collision with reserved terminal hash key. (Ideally it would change the reserved terminal hash and redo the whole trie, but for now, I'm hoping there won't be collisions.) * Restructure imports to export find_matches * Implement full remove() Remove unnecessary trie paths and free unused maps. Parallel to Matcher, raise KeyError when attempting to remove a match ID that has not been added. * Switch to PhraseMatcher.find_matches * Switch to local cdef functions for span filtering * Switch special case reload threshold to variable Refer to variable instead of hard-coded threshold * Move more of special case retokenize to cdef nogil Move as much of the special case retokenization to nogil as possible. * Rewrap sort as stdsort for OS X * Rewrap stdsort with specific types * Switch to qsort * Fix merge * Improve cmp functions * Fix realloc * Fix realloc again * Initialize span struct while retokenizing * Temporarily skip retokenizing * Revert "Move more of special case retokenize to cdef nogil" This reverts commit `0b7e52c797`. * Revert "Switch to qsort" This reverts commit `a98d71a942`. * Fix specials check while caching * Modify URL test with emoticons The multiple suffix tests result in the emoticon `:>`, which is now retokenized into one token as a special case after the suffixes are split off. * Refactor _apply_special_cases() * Use cdef ints for span info used in multiple spots * Modify _filter_special_spans() to prefer earlier Parallel to #4414, modify _filter_special_spans() so that the earlier span is preferred for overlapping spans of the same length. * Replace MatchStruct with Entity Replace MatchStruct with Entity since the existing Entity struct is nearly identical. * Replace Entity with more general SpanC * Replace MatchStruct with SpanC * Add error in debug-data if no dev docs are available (see #4575) * Update azure-pipelines.yml * Revert "Update azure-pipelines.yml" This reverts commit `ed1060cf59`. * Use latest wasabi * Reorganise install_requires * add dframcy to universe.json (#4580) * Update universe.json [ci skip] * Fix multiprocessing for as_tuples=True (#4582) * Fix conllu script (#4579) * force extensions to avoid clash between example scripts * fix arg order and default file encoding * add example config for conllu script * newline * move extension definitions to main function * few more encodings fixes * Add load_from_docbin example [ci skip] TODO: upload the file somewhere * Update README.md * Add warnings about 3.8 (resolves #4593) [ci skip] * Fixed typo: Added space between "recognize" and "various" (#4600) * Fix DocBin.merge() example (#4599) * Replace function registries with catalogue (#4584) * Replace functions registries with catalogue * Update __init__.py * Fix test * Revert unrelated flag [ci skip] * Bugfix/dep matcher issue 4590 (#4601) * add contributor agreement for prilopes * add test for issue #4590 * fix on_match params for DependencyMacther (#4590) * Minor updates to language example sentences (#4608) * Add punctuation to Spanish example sentences * Combine multilanguage examples for lang xx * Add punctuation to nb examples * Always realloc to a larger size Avoid potential (unlikely) edge case and cymem error seen in #4604. * Add error in debug-data if no dev docs are available (see #4575) * Update debug-data for GoldCorpus / Example * Ignore None label in misaligned NER data	2019-11-13 21:24:35 +01:00
Ines Montani	6abc1ddb26	Update __main__.py	2019-03-20 09:43:26 +01:00
Ines Montani	37c7c85a86	💫 New JSON helpers, training data internals & CLI rewrite (#2932 ) * Support nowrap setting in util.prints * Tidy up and fix whitespace * Simplify script and use read_jsonl helper * Add JSON schemas (see #2928) * Deprecate Doc.print_tree Will be replaced with Doc.to_json, which will produce a unified format * Add Doc.to_json() method (see #2928) Converts Doc objects to JSON using the same unified format as the training data. Method also supports serializing selected custom attributes in the doc._. space. * Remove outdated test * Add write_json and write_jsonl helpers * WIP: Update spacy train * Tidy up spacy train * WIP: Use wasabi for formatting * Add GoldParse helpers for JSON format * WIP: add debug-data command * Fix typo * Add missing import * Update wasabi pin * Add missing import * 💫 Refactor CLI (#2943) To be merged into #2932. ## Description - [x] refactor CLI To use [`wasabi`](https://github.com/ines/wasabi) - [x] use [`black`](https://github.com/ambv/black) for auto-formatting - [x] add `flake8` config - [x] move all messy UD-related scripts to `cli.ud` - [x] make converters function that take the opened file and return the converted data (instead of having them handle the IO) ### Types of change enhancement ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. * Update wasabi pin * Delete old test * Update errors * Fix typo * Tidy up and format remaining code * Fix formatting * Improve formatting of messages * Auto-format remaining code * Add tok2vec stuff to spacy.train * Fix typo * Update wasabi pin * Fix path checks for when train() is called as function * Reformat and tidy up pretrain script * Update argument annotations * Raise error if model language doesn't match lang * Document new train command	2018-11-30 20:16:14 +01:00
Matthew Honnibal	8fdb9bc278	💫 Add experimental ULMFit/BERT/Elmo-like pretraining (#2931 ) * Add 'spacy pretrain' command * Fix pretrain command for Python 2 * Fix pretrain command * Fix pretrain command	2018-11-15 22:17:16 +01:00
Matthew Honnibal	1f7229f40f	Revert "Merge branch 'develop' of https://github.com/explosion/spaCy into develop" This reverts commit `c9ba3d3c2d`, reversing changes made to `92c26a35d4`.	2018-03-27 19:23:02 +02:00
Søren Lind Kristiansen	7f0ab145e9	Don't pass CLI command name as dummy argument	2018-01-04 21:33:47 +01:00
ines	82e80ff928	Rename model command to init_model and fix formatting	2017-12-07 09:59:23 +01:00
ines	affd3404ab	Remove old model command (now "vocab")	2017-11-01 13:14:03 +01:00
ines	98c35d2585	Fix spacy vocab command	2017-10-30 18:38:41 +01:00
Explosion Bot	0fc1209421	Wire up new vocab command	2017-10-30 16:14:50 +01:00
ines	778212efea	Tidy up init and main	2017-10-27 14:39:51 +02:00
ines	fff1028391	Add validate CLI command	2017-10-12 20:05:06 +02:00
Matthew Honnibal	69c7c642c2	Add spacy evaluate	2017-10-01 14:05:04 -05:00
Matthew Honnibal	cec76801dc	Add profile command to CLI	2017-08-21 23:23:05 +02:00
Gyorgy Orosz	e5344b83a3	Ported model cli from v1	2017-08-19 21:45:23 +02:00
ines	fc3ec733ea	Reduce complexity in CLI Remove now redundant model command and move plac annotations to cli files	2017-05-22 12:28:58 +02:00
Matthew Honnibal	80e19a2399	Simplify CLI implementation for subcommands. Remove model command.	2017-05-22 04:51:08 -05:00
Matthew Honnibal	7811d97339	Refactor CLI	2017-05-22 04:51:08 -05:00
Matthew Honnibal	4c9202249d	Refactor training, to fix memory leak	2017-05-21 09:07:06 -05:00
Matthew Honnibal	8d5e6d9f4f	Rename no_ner arg to no_entities	2017-05-19 13:23:11 -05:00
Matthew Honnibal	39ea38c4b1	Add option to use gpu to spacy train	2017-05-18 04:21:49 -05:00
Matthew Honnibal	793430aa7a	Get spaCy train command working with neural network * Integrate models into pipeline * Add basic serialization (maybe incorrect) * Fix pickle on vocab	2017-05-17 12:04:50 +02:00
ines	8d8dd9ceb2	Don't set default value for model	2017-05-07 23:22:21 +02:00
ines	a7574b7572	Add more options to read in meta data in package command Add meta option to supply path to meta.json. If no meta path is set, check if meta.json exists in input directory and use it. Otherwise, prompt for details on the command line.	2017-04-16 13:06:02 +02:00
ines	561f2a3eb4	Use consistent formatting for docstrings	2017-04-15 11:59:21 +02:00
ines	789ce8a45e	Add convert command	2017-04-07 13:04:17 +02:00
ines	47ddce6eb7	Remove unused variable	2017-04-07 13:01:48 +02:00
ines	7ceaa1614b	Add experimental model init command	2017-03-26 20:51:40 +02:00
Matthew Honnibal	fa107f95f6	Remove unused train_config command	2017-03-26 09:28:59 -05:00
ines	007a2492bd	Remove train_config command for now	2017-03-26 15:40:50 +02:00
ines	b297fab062	Update error message for missing commands	2017-03-26 15:40:02 +02:00
ines	7f95023fc0	Fix formatting	2017-03-26 15:37:37 +02:00
ines	5901c8f7f0	Update spacy train CLI documentation	2017-03-26 15:33:48 +02:00
Matthew Honnibal	9dcb58aaaf	Merge CLI changes	2017-03-26 07:30:45 -05:00
Matthew Honnibal	6b7f7a2060	Connect parser L1 option to train CLI	2017-03-26 07:24:07 -05:00
Matthew Honnibal	dec5571bf3	Update train CLI	2017-03-26 07:16:52 -05:00
ines	0fc56e2544	Update flag and defaults	2017-03-26 11:42:11 +02:00
ines	0035fd9efe	Add spacy train work in progress	2017-03-23 11:08:41 +01:00
ines	d5ebf583a4	Fix formatting	2017-03-23 11:08:30 +01:00
ines	09b24bc5a9	Add docs for package command	2017-03-21 11:19:21 +01:00
ines	448a916d0d	Add --force option to override directory	2017-03-21 02:05:34 +01:00
ines	8eb9a2b355	Fix formatting	2017-03-21 02:05:14 +01:00
ines	b2bcdec0f6	Update docstring	2017-03-20 22:50:55 +01:00
ines	bf240132d7	Add cli.package command to build model packages	2017-03-20 22:50:13 +01:00
ines	a54e3c2efe	Remove empty line	2017-03-20 22:49:36 +01:00
Matthew Honnibal	1a53fcc685	Fix CLI for Python 2	2017-03-18 18:14:03 +01:00
ines	ec3e810662	Add directory cli and set up command line interface	2017-03-18 15:14:48 +01:00

47 Commits