spaCy

Commit Graph

Author	SHA1	Message	Date
Ines Montani	83e0a6f3e3	Modernize plac commands for Python 3 (#4836 )	2020-01-01 13:15:46 +01:00
Ines Montani	a892821c51	More formatting changes	2019-12-25 17:59:52 +01:00
Ines Montani	db55577c45	Drop Python 2.7 and 3.5 (#4828 ) * Remove unicode declarations * Remove Python 3.5 and 2.7 from CI * Don't require pathlib * Replace compat helpers * Remove OrderedDict * Use f-strings * Set Cython compiler language level * Fix typo * Re-add OrderedDict for Table * Update setup.cfg * Revert CONTRIBUTING.md * Revert lookups.md * Revert top-level.md * Small adjustments and docs [ci skip]	2019-12-22 01:53:56 +01:00
adrianeboyd	faaa832518	Generalize handling of tokenizer special cases (#4259 ) * Generalize handling of tokenizer special cases Handle tokenizer special cases more generally by using the Matcher internally to match special cases after the affix/token_match tokenization is complete. Instead of only matching special cases while processing balanced or nearly balanced prefixes and suffixes, this recognizes special cases in a wider range of contexts: * Allows arbitrary numbers of prefixes/affixes around special cases * Allows special cases separated by infixes Existing tests/settings that couldn't be preserved as before: * The emoticon '")' is no longer a supported special case * The emoticon ':)' in "example:)" is a false positive again When merged with #4258 (or the relevant cache bugfix), the affix and token_match properties should be modified to flush and reload all special cases to use the updated internal tokenization with the Matcher. * Remove accidentally added test case * Really remove accidentally added test * Reload special cases when necessary Reload special cases when affixes or token_match are modified. Skip reloading during initialization. * Update error code number * Fix offset and whitespace in Matcher special cases * Fix offset bugs when merging and splitting tokens * Set final whitespace on final token in inserted special case * Improve cache flushing in tokenizer * Separate cache and specials memory (temporarily) * Flush cache when adding special cases * Repeated `self._cache = PreshMap()` and `self._specials = PreshMap()` are necessary due to this bug: https://github.com/explosion/preshed/issues/21 * Remove reinitialized PreshMaps on cache flush * Update UD bin scripts * Update imports for `bin/` * Add all currently supported languages * Update subtok merger for new Matcher validation * Modify blinded check to look at tokens instead of lemmas (for corpora with tokens but not lemmas like Telugu) * Use special Matcher only for cases with affixes * Reinsert specials cache checks during normal tokenization for special cases as much as possible * Additionally include specials cache checks while splitting on infixes * Since the special Matcher needs consistent affix-only tokenization for the special cases themselves, introduce the argument `with_special_cases` in order to do tokenization with or without specials cache checks * After normal tokenization, postprocess with special cases Matcher for special cases containing affixes * Replace PhraseMatcher with Aho-Corasick Replace PhraseMatcher with the Aho-Corasick algorithm over numpy arrays of the hash values for the relevant attribute. The implementation is based on FlashText. The speed should be similar to the previous PhraseMatcher. It is now possible to easily remove match IDs and matches don't go missing with large keyword lists / vocabularies. Fixes #4308. * Restore support for pickling * Fix internal keyword add/remove for numpy arrays * Add test for #4248, clean up test * Improve efficiency of special cases handling * Use PhraseMatcher instead of Matcher * Improve efficiency of merging/splitting special cases in document * Process merge/splits in one pass without repeated token shifting * Merge in place if no splits * Update error message number * Remove UD script modifications Only used for timing/testing, should be a separate PR * Remove final traces of UD script modifications * Update UD bin scripts * Update imports for `bin/` * Add all currently supported languages * Update subtok merger for new Matcher validation * Modify blinded check to look at tokens instead of lemmas (for corpora with tokens but not lemmas like Telugu) * Add missing loop for match ID set in search loop * Remove cruft in matching loop for partial matches There was a bit of unnecessary code left over from FlashText in the matching loop to handle partial token matches, which we don't have with PhraseMatcher. * Replace dict trie with MapStruct trie * Fix how match ID hash is stored/added * Update fix for match ID vocab * Switch from map_get_unless_missing to map_get * Switch from numpy array to Token.get_struct_attr Access token attributes directly in Doc instead of making a copy of the relevant values in a numpy array. Add unsatisfactory warning for hash collision with reserved terminal hash key. (Ideally it would change the reserved terminal hash and redo the whole trie, but for now, I'm hoping there won't be collisions.) * Restructure imports to export find_matches * Implement full remove() Remove unnecessary trie paths and free unused maps. Parallel to Matcher, raise KeyError when attempting to remove a match ID that has not been added. * Switch to PhraseMatcher.find_matches * Switch to local cdef functions for span filtering * Switch special case reload threshold to variable Refer to variable instead of hard-coded threshold * Move more of special case retokenize to cdef nogil Move as much of the special case retokenization to nogil as possible. * Rewrap sort as stdsort for OS X * Rewrap stdsort with specific types * Switch to qsort * Fix merge * Improve cmp functions * Fix realloc * Fix realloc again * Initialize span struct while retokenizing * Temporarily skip retokenizing * Revert "Move more of special case retokenize to cdef nogil" This reverts commit `0b7e52c797`. * Revert "Switch to qsort" This reverts commit `a98d71a942`. * Fix specials check while caching * Modify URL test with emoticons The multiple suffix tests result in the emoticon `:>`, which is now retokenized into one token as a special case after the suffixes are split off. * Refactor _apply_special_cases() * Use cdef ints for span info used in multiple spots * Modify _filter_special_spans() to prefer earlier Parallel to #4414, modify _filter_special_spans() so that the earlier span is preferred for overlapping spans of the same length. * Replace MatchStruct with Entity Replace MatchStruct with Entity since the existing Entity struct is nearly identical. * Replace Entity with more general SpanC * Replace MatchStruct with SpanC * Add error in debug-data if no dev docs are available (see #4575) * Update azure-pipelines.yml * Revert "Update azure-pipelines.yml" This reverts commit `ed1060cf59`. * Use latest wasabi * Reorganise install_requires * add dframcy to universe.json (#4580) * Update universe.json [ci skip] * Fix multiprocessing for as_tuples=True (#4582) * Fix conllu script (#4579) * force extensions to avoid clash between example scripts * fix arg order and default file encoding * add example config for conllu script * newline * move extension definitions to main function * few more encodings fixes * Add load_from_docbin example [ci skip] TODO: upload the file somewhere * Update README.md * Add warnings about 3.8 (resolves #4593) [ci skip] * Fixed typo: Added space between "recognize" and "various" (#4600) * Fix DocBin.merge() example (#4599) * Replace function registries with catalogue (#4584) * Replace functions registries with catalogue * Update __init__.py * Fix test * Revert unrelated flag [ci skip] * Bugfix/dep matcher issue 4590 (#4601) * add contributor agreement for prilopes * add test for issue #4590 * fix on_match params for DependencyMacther (#4590) * Minor updates to language example sentences (#4608) * Add punctuation to Spanish example sentences * Combine multilanguage examples for lang xx * Add punctuation to nb examples * Always realloc to a larger size Avoid potential (unlikely) edge case and cymem error seen in #4604. * Add error in debug-data if no dev docs are available (see #4575) * Update debug-data for GoldCorpus / Example * Ignore None label in misaligned NER data	2019-11-13 21:24:35 +01:00
Sofie Van Landeghem	e48a09df4e	Example class for training data (#4543 ) * OrigAnnot class instead of gold.orig_annot list of zipped tuples * from_orig to replace from_annot_tuples * rename to RawAnnot * some unit tests for GoldParse creation and internal format * removing orig_annot and switching to lists instead of tuple * rewriting tuples to use RawAnnot (+ debug statements, WIP) * fix pop() changing the data * small fixes * pop-append fixes * return RawAnnot for existing GoldParse to have uniform interface * clean up imports * fix merge_sents * add unit test for 4402 with new structure (not working yet) * introduce DocAnnot * typo fixes * add unit test for merge_sents * rename from_orig to from_raw * fixing unit tests * fix nn parser * read_annots to produce text, doc_annot pairs * _make_golds fix * rename golds_to_gold_annots * small fixes * fix encoding * have golds_to_gold_annots use DocAnnot * missed a spot * merge_sents as function in DocAnnot * allow specifying only part of the token-level annotations * refactor with Example class + underlying dicts * pipeline components to work with Example objects (wip) * input checking * fix yielding * fix calls to update * small fixes * fix scorer unit test with new format * fix kwargs order * fixes for ud and conllu scripts * fix reading data for conllu script * add in proper errors (not fixed numbering yet to avoid merge conflicts) * fixing few more small bugs * fix EL script	2019-11-11 17:35:27 +01:00
Matthew Honnibal	d5509e0989	Support Mish activation (requires Thinc 7.3) (#4536 ) * Add arch for MishWindowEncoder * Support mish in tok2vec and conv window >=2 * Pass new tok2vec settings from parser * Syntax error * Fix tok2vec setting * Fix registration of MishWindowEncoder * Fix receptive field setting * Fix mish arch * Pass more options from parser * Support more tok2vec options in pretrain * Require thinc 7.3 * Add docs [ci skip] * Require thinc 7.3.0.dev0 to run CI * Run black * Fix typo * Update Thinc version Co-authored-by: Ines Montani <ines@ines.io>	2019-10-28 15:16:33 +01:00
Sofie Van Landeghem	2d249a9502	KB extensions and better parsing of WikiData (#4375 ) * fix overflow error on windows * more documentation & logging fixes * md fix * 3 different limit parameters to play with execution time * bug fixes directory locations * small fixes * exclude dev test articles from prior probabilities stats * small fixes * filtering wikidata entities, removing numeric and meta items * adding aliases from wikidata also to the KB * fix adding WD aliases * adding also new aliases to previously added entities * fixing comma's * small doc fixes * adding subclassof filtering * append alias functionality in KB * prevent appending the same entity-alias pair * fix for appending WD aliases * remove date filter * remove unnecessary import * small corrections and reformatting * remove WD aliases for now (too slow) * removing numeric entities from training and evaluation * small fixes * shortcut during prediction if there is only one candidate * add counts and fscore logging, remove FP NER from evaluation * fix entity_linker.predict to take docs instead of single sentences * remove enumeration sentences from the WP dataset * entity_linker.update to process full doc instead of single sentence * spelling corrections and dump locations in readme * NLP IO fix * reading KB is unnecessary at the end of the pipeline * small logging fix * remove empty files	2019-10-14 12:28:53 +02:00
Matthew Honnibal	29f9fec267	Improve spacy pretrain (#4393 ) * Support bilstm_depth arg in spacy pretrain * Add option to ignore zero vectors in get_cossim_loss * Use cosine loss in Cloze multitask	2019-10-07 23:34:58 +02:00
Ines Montani	b6670bf0c2	Use consistent spelling	2019-10-02 10:37:39 +02:00
Ines Montani	f65e36925d	Fix absolute imports and avoid importing from cli	2019-08-20 15:08:59 +02:00
Ines Montani	7e8be44218	Auto-format	2019-08-20 15:06:31 +02:00
Ines Montani	f2ea3e3ea2	Merge branch 'master' into feature/nel-wiki	2019-07-09 21:57:47 +02:00
Björn Böing	04982ccc40	Update pretrain to prevent unintended overwriting of weight fil… (#3902 ) * Update pretrain to prevent unintended overwriting of weight files for #3859 * Add '--epoch-start' to pretrain docs * Add mising pretrain arguments to bash example * Update doc tag for v2.1.5	2019-07-09 21:48:30 +02:00
Ines Montani	ae2c208735	Auto-format [ci skip]	2019-06-20 10:36:38 +02:00
Ines Montani	872121955c	Update error code	2019-06-20 10:35:51 +02:00
Björn Böing	ebf5a04d6c	Update pretrain docs and add unsupported loss_func error (#3860 ) * Add error to `get_vectors_loss` for unsupported loss function of `pretrain` * Add missing "--loss-func" argument to pretrain docs. Update pretrain plac annotations to match docs. * Add missing quotation marks	2019-06-20 10:30:44 +02:00
BreakBB	d8573ee715	Update error raising for CLI pretrain to fix #3840 (#3843 ) * Add check for empty input file to CLI pretrain * Raise error if JSONL is not a dict or contains neither `tokens` nor `text` key * Skip empty values for correct pretrain keys and log a counter as warning * Add tests for CLI pretrain core function make_docs. * Add a short hint for the `tokens` key to the CLI pretrain docs * Add success message to CLI pretrain * Update model loading to fix the tests * Skip empty values and do not create docs out of it	2019-06-16 13:22:57 +02:00
Motoki Wu	9c064e6ad9	Add resume logic to spacy pretrain (#3652 ) * Added ability to resume training * Add to readmee * Remove duplicate entry	2019-06-12 13:29:23 +02:00
intrafind	2bba2a3536	Fix for #3811 (#3815 ) Corrected type of seed parameter.	2019-06-03 18:32:47 +02:00
devforfu	21af12eb53	Make "text" key in JSONL format optional when "tokens" key is provided (#3721 ) * Fix issue with forcing text key when it is not required * Extending the docs to reflect the new behavior	2019-05-11 15:41:29 +02:00
Motoki Wu	8e2cef49f3	Add save after `--save-every` batches for `spacy pretrain` (#3510 ) <!--- Provide a general summary of your changes in the title. --> When using `spacy pretrain`, the model is saved only after every epoch. But each epoch can be very big since `pretrain` is used for language modeling tasks. So I added a `--save-every` option in the CLI to save after every `--save-every` batches. ## Description <!--- Use this section to describe your changes. If your changes required testing, include information about the testing environment and the tests you ran. If your test fixes a bug reported in an issue, don't forget to include the issue number. If your PR is still a work in progress, that's totally fine – just include a note to let us know. --> To test... Save this file to `sample_sents.jsonl` ``` {"text": "hello there."} {"text": "hello there."} {"text": "hello there."} {"text": "hello there."} {"text": "hello there."} {"text": "hello there."} {"text": "hello there."} {"text": "hello there."} {"text": "hello there."} {"text": "hello there."} {"text": "hello there."} {"text": "hello there."} {"text": "hello there."} {"text": "hello there."} {"text": "hello there."} {"text": "hello there."} ``` Then run `--save-every 2` when pretraining. ```bash spacy pretrain sample_sents.jsonl en_core_web_md here -nw 1 -bs 1 -i 10 --save-every 2 ``` And it should save the model to the `here/` folder after every 2 batches. The models that are saved during an epoch will have a `.temp` appended to the save name. At the end the training, you should see these files (`ls here/`): ```bash config.json model2.bin model5.bin model8.bin log.jsonl model2.temp.bin model5.temp.bin model8.temp.bin model0.bin model3.bin model6.bin model9.bin model0.temp.bin model3.temp.bin model6.temp.bin model9.temp.bin model1.bin model4.bin model7.bin model1.temp.bin model4.temp.bin model7.temp.bin ``` ### Types of change <!-- What type of change does your PR cover? Is it a bug fix, an enhancement or new feature, or a change to the documentation? --> This is a new feature to `spacy pretrain`. 🌵 Unfortunately, I haven't been able to test this because compiling from source is not working (cythonize error). ``` Processing matcher.pyx [Errno 2] No such file or directory: '/Users/mwu/github/spaCy/spacy/matcher.pyx' Traceback (most recent call last): File "/Users/mwu/github/spaCy/bin/cythonize.py", line 169, in <module> run(args.root) File "/Users/mwu/github/spaCy/bin/cythonize.py", line 158, in run process(base, filename, db) File "/Users/mwu/github/spaCy/bin/cythonize.py", line 124, in process preserve_cwd(base, process_pyx, root + ".pyx", root + ".cpp") File "/Users/mwu/github/spaCy/bin/cythonize.py", line 87, in preserve_cwd func(args) File "/Users/mwu/github/spaCy/bin/cythonize.py", line 63, in process_pyx raise Exception("Cython failed") Exception: Cython failed Traceback (most recent call last): File "setup.py", line 276, in <module> setup_package() File "setup.py", line 209, in setup_package generate_cython(root, "spacy") File "setup.py", line 132, in generate_cython raise RuntimeError("Running cythonize failed") RuntimeError: Running cythonize failed ``` Edit: Fixed! after deleting all `.cpp` files: `find spacy -name ".cpp" \| xargs rm` ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information.	2019-04-22 14:10:16 +02:00
Ines Montani	c23e234d65	Auto-format	2019-04-01 12:11:27 +02:00
Matthew Honnibal	1612990e88	Implement cosine loss for spacy pretrain. Make default	2019-03-20 11:06:58 +00:00
Matthew Honnibal	62afa64a8d	Expose batch size and length caps on CLI for pretrain (#3417 ) Add and document CLI options for batch size, max doc length, min doc length for `spacy pretrain`. Also improve CLI output. Closes #3216 ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information.	2019-03-16 21:38:45 +01:00
Ines Montani	25602c794c	Tidy up and fix small bugs and typos	2019-02-08 14:14:49 +01:00
Matthew Honnibal	0f83b98afa	Remove unused code from spacy pretrain	2018-12-18 19:19:26 +01:00
Matthew Honnibal	7c504b6ddb	Try to implement more losses for pretraining * Try to implement cosine loss This one seems to be correct? Still unsure, but it performs okay * Try to implement the von Mises-Fisher loss This one's definitely not right yet.	2018-12-17 14:48:27 +00:00
Matthew Honnibal	df15279e88	Reduce batch size during pretrain	2018-12-10 15:30:23 +00:00
Matthew Honnibal	83ac227bd3	💫 Better support for semi-supervised learning (#3035 ) The new spacy pretrain command implemented BERT/ULMFit/etc-like transfer learning, using our Language Modelling with Approximate Outputs version of BERT's cloze task. Pretraining is convenient, but in some ways it's a bit of a strange solution. All we're doing is initialising the weights. At the same time, we're putting a lot of work into our optimisation so that it's less sensitive to initial conditions, and more likely to find good optima. I discuss this a bit in the pseudo-rehearsal blog post: https://explosion.ai/blog/pseudo-rehearsal-catastrophic-forgetting Support semi-supervised learning in spacy train One obvious way to improve these pretraining methods is to do multi-task learning, instead of just transfer learning. This has been shown to work very well: https://arxiv.org/pdf/1809.08370.pdf . This patch makes it easy to do this sort of thing. Add a new argument to spacy train, --raw-text. This takes a jsonl file with unlabelled data that can be used in arbitrary ways to do semi-supervised learning. Add a new method to the Language class and to pipeline components, .rehearse(). This is like .update(), but doesn't expect GoldParse objects. It takes a batch of Doc objects, and performs an update on some semi-supervised objective. Move the BERT-LMAO objective out from spacy/cli/pretrain.py into spacy/_ml.py, so we can create a new pipeline component, ClozeMultitask. This can be specified as a parser or NER multitask in the spacy train command. Example usage: python -m spacy train en ./tmp ~/data/en-core-web/train/nw.json ~/data/en-core-web/dev/nw.json --pipeline parser --raw-textt ~/data/unlabelled/reddit-100k.jsonl --vectors en_vectors_web_lg --parser-multitasks cloze Implement rehearsal methods for pipeline components The new --raw-text argument and nlp.rehearse() method also gives us a good place to implement the the idea in the pseudo-rehearsal blog post in the parser. This works as follows: Add a new nlp.resume_training() method. This allocates copies of pre-trained models in the pipeline, setting things up for the rehearsal updates. It also returns an optimizer object. This also greatly reduces confusion around the nlp.begin_training() method, which randomises the weights, making it not suitable for adding new labels or otherwise fine-tuning a pre-trained model. Implement rehearsal updates on the Parser class, making it available for the dependency parser and NER. During rehearsal, the initial model is used to supervise the model being trained. The current model is asked to match the predictions of the initial model on some data. This minimises catastrophic forgetting, by keeping the model's predictions close to the original. See the blog post for details. Implement rehearsal updates for tagger Implement rehearsal updates for text categoriz	2018-12-10 16:25:33 +01:00
Ines Montani	f37863093a	💫 Replace ujson, msgpack and dill/pickle/cloudpickle with srsly (#3003 ) Remove hacks and wrappers, keep code in sync across our libraries and move spaCy a few steps closer to only depending on packages with binary wheels 🎉 See here: https://github.com/explosion/srsly Serialization is hard, especially across Python versions and multiple platforms. After dealing with many subtle bugs over the years (encodings, locales, large files) our libraries like spaCy and Prodigy have steadily grown a number of utility functions to wrap the multiple serialization formats we need to support (especially json, msgpack and pickle). These wrapping functions ended up duplicated across our codebases, so we wanted to put them in one place. At the same time, we noticed that having a lot of small dependencies was making maintainence harder, and making installation slower. To solve this, we've made srsly standalone, by including the component packages directly within it. This way we can provide all the serialization utilities we need in a single binary wheel. srsly currently includes forks of the following packages: ujson msgpack msgpack-numpy cloudpickle * WIP: replace json/ujson with srsly * Replace ujson in examples Use regular json instead of srsly to make code easier to read and follow * Update requirements * Fix imports * Fix typos * Replace msgpack with srsly * Fix warning	2018-12-03 01:28:22 +01:00
Matthew Honnibal	4aa1002546	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2018-11-30 20:58:51 +00:00
Matthew Honnibal	6bd1cc57ee	Increase length limit for pretrain	2018-11-30 20:58:18 +00:00
Ines Montani	37c7c85a86	💫 New JSON helpers, training data internals & CLI rewrite (#2932 ) * Support nowrap setting in util.prints * Tidy up and fix whitespace * Simplify script and use read_jsonl helper * Add JSON schemas (see #2928) * Deprecate Doc.print_tree Will be replaced with Doc.to_json, which will produce a unified format * Add Doc.to_json() method (see #2928) Converts Doc objects to JSON using the same unified format as the training data. Method also supports serializing selected custom attributes in the doc._. space. * Remove outdated test * Add write_json and write_jsonl helpers * WIP: Update spacy train * Tidy up spacy train * WIP: Use wasabi for formatting * Add GoldParse helpers for JSON format * WIP: add debug-data command * Fix typo * Add missing import * Update wasabi pin * Add missing import * 💫 Refactor CLI (#2943) To be merged into #2932. ## Description - [x] refactor CLI To use [`wasabi`](https://github.com/ines/wasabi) - [x] use [`black`](https://github.com/ambv/black) for auto-formatting - [x] add `flake8` config - [x] move all messy UD-related scripts to `cli.ud` - [x] make converters function that take the opened file and return the converted data (instead of having them handle the IO) ### Types of change enhancement ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. * Update wasabi pin * Delete old test * Update errors * Fix typo * Tidy up and format remaining code * Fix formatting * Improve formatting of messages * Auto-format remaining code * Add tok2vec stuff to spacy.train * Fix typo * Update wasabi pin * Fix path checks for when train() is called as function * Reformat and tidy up pretrain script * Update argument annotations * Raise error if model language doesn't match lang * Document new train command	2018-11-30 20:16:14 +01:00
Matthew Honnibal	008e1ee1dd	Update pretrain command	2018-11-29 12:36:43 +00:00
Matthew Honnibal	61e435610e	💫 Feature/improve pretraining (#2971 ) * Improve spacy pretrain script * Implement BERT-style 'masked language model' objective. Much better results. * Improve logging. * Add length cap for documents, to avoid memory errors. * Require thinc 7.0.0.dev1 * Require thinc 7.0.0.dev1 * Add argument for using pretrained vectors * Fix defaults * Fix syntax error * Improve spacy pretrain script * Implement BERT-style 'masked language model' objective. Much better results. * Improve logging. * Add length cap for documents, to avoid memory errors. * Require thinc 7.0.0.dev1 * Require thinc 7.0.0.dev1 * Add argument for using pretrained vectors * Fix defaults * Fix syntax error * Tweak pretraining script * Fix data limits in spacy.gold * Fix pretrain script	2018-11-28 18:04:58 +01:00
Matthew Honnibal	2ddd428834	Fix pretrain script	2018-11-15 23:34:35 +00:00
Matthew Honnibal	f8afaa0c1c	Fix pretrain	2018-11-15 22:46:53 +00:00
Matthew Honnibal	6af6950e46	Fix pretrain	2018-11-15 22:45:36 +00:00
Matthew Honnibal	3e7b214e57	Make pretrain script work with stream from stdin	2018-11-15 22:44:07 +00:00
Matthew Honnibal	8fdb9bc278	💫 Add experimental ULMFit/BERT/Elmo-like pretraining (#2931 ) * Add 'spacy pretrain' command * Fix pretrain command for Python 2 * Fix pretrain command * Fix pretrain command	2018-11-15 22:17:16 +01:00

40 Commits