spaCy

Commit Graph

Author	SHA1	Message	Date
Ines Montani	a84025d70b	Remove --no-deps from default pip args on download Add warning if user is executing spaCy without having it installed and add --no-deps to prevent the package from being redownloaded	2019-09-16 23:32:41 +02:00
adrianeboyd	b5d999e510	Add textcat to train CLI (#4226 ) * Add doc.cats to spacy.gold at the paragraph level Support `doc.cats` as `"cats": [{"label": string, "value": number}]` in the spacy JSON training format at the paragraph level. * `spacy.gold.docs_to_json()` writes `docs.cats` * `GoldCorpus` reads in cats in each `GoldParse` * Update instances of gold_tuples to handle cats Update iteration over gold_tuples / gold_parses to handle addition of cats at the paragraph level. * Add textcat to train CLI * Add textcat options to train CLI * Add textcat labels in `TextCategorizer.begin_training()` * Add textcat evaluation to `Scorer`: * For binary exclusive classes with provided label: F1 for label * For 2+ exclusive classes: F1 macro average * For multilabel (not exclusive): ROC AUC macro average (currently relying on sklearn) * Provide user info on textcat evaluation settings, potential incompatibilities * Provide pipeline to Scorer in `Language.evaluate` for textcat config * Customize train CLI output to include only metrics relevant to current pipeline * Add textcat evaluation to evaluate CLI * Fix handling of unset arguments and config params Fix handling of unset arguments and model confiug parameters in Scorer initialization. * Temporarily add sklearn requirement * Remove sklearn version number * Improve Scorer handling of models without textcats * Fixing Scorer handling of models without textcats * Update Scorer output for python 2.7 * Modify inf in Scorer for python 2.7 * Auto-format Also make small adjustments to make auto-formatting with black easier and produce nicer results * Move error message to Errors * Update documentation * Add cats to annotation JSON format [ci skip] * Fix tpl flag and docs [ci skip] * Switch to internal roc_auc_score Switch to internal `roc_auc_score()` adapted from scikit-learn. * Add AUCROCScore tests and improve errors/warnings * Add tests for AUCROCScore and roc_auc_score * Add missing error for only positive/negative values * Remove unnecessary warnings and errors * Make reduced roc_auc_score functions private Because most of the checks and warnings have been stripped for the internal functions and access is only intended through `ROCAUCScore`, make the functions for roc_auc_score adapted from scikit-learn private. * Check that data corresponds with multilabel flag Check that the training instances correspond with the multilabel flag, adding the multilabel flag if required. * Add textcat score to early stopping check * Add more checks to debug-data for textcat * Add example training data for textcat * Add more checks to textcat train CLI * Check configuration when extending base model * Fix typos * Update textcat example data * Provide licensing details and licenses for data * Remove two labels with no positive instances from jigsaw-toxic-comment data. Co-authored-by: Ines Montani <ines@ines.io>	2019-09-15 22:31:31 +02:00
Ines Montani	b544dcb3c5	Document debug-data [ci skip]	2019-09-12 15:26:20 +02:00
Ines Montani	05a2df6616	Remove not implemented file validation [ci skip]	2019-09-12 15:26:02 +02:00
Ines Montani	655b434553	Merge branch 'master' into develop	2019-09-12 11:39:18 +02:00
Ines Montani	af25323653	Tidy up and auto-format	2019-09-11 14:00:36 +02:00
Matthew Honnibal	af93997993	Fix conllu converter	2019-09-11 13:28:07 +02:00
Ines Montani	e82a8d0d7a	Merge branch 'master' into develop	2019-09-11 11:52:38 +02:00
Ines Montani	6279d74c65	Tidy up and auto-format	2019-09-11 11:38:22 +02:00
Matthew Honnibal	7b858ba606	Update from master	2019-09-10 20:14:08 +02:00
Sofie Van Landeghem	482c7cd1b9	pulling tqdm imports in functions to avoid bug (tmp fix) (#4263 )	2019-09-09 16:32:11 +02:00
Matthew Honnibal	1a65c5b7af	Update develop from master	2019-09-08 18:21:41 +02:00
Ines Montani	cd90752193	Tidy up and auto-format [ci skip]	2019-08-31 13:39:06 +02:00
adrianeboyd	82159b5c19	Updates/bugfixes for NER/IOB converters (#4186 ) * Updates/bugfixes for NER/IOB converters * Converter formats `ner` and `iob` use autodetect to choose a converter if possible * `iob2json` is reverted to handle sentence-per-line data like `word1\|pos1\|ent1 word2\|pos2\|ent2` * Fix bug in `merge_sentences()` so the second sentence in each batch isn't skipped * `conll_ner2json` is made more general so it can handle more formats with whitespace-separated columns * Supports all formats where the first column is the token and the final column is the IOB tag; if present, the second column is the POS tag * As in CoNLL 2003 NER, blank lines separate sentences, `-DOCSTART- -X- O O` separates documents * Add option for segmenting sentences (new flag `-s`) * Parser-based sentence segmentation with a provided model, otherwise with sentencizer (new option `-b` to specify model) * Can group sentences into documents with `n_sents` as long as sentence segmentation is available * Only applies automatic segmentation when there are no existing delimiters in the data * Provide info about settings applied during conversion with warnings and suggestions if settings conflict or might not be not optimal. * Add tests for common formats * Add '(default)' back to docs for -c auto * Add document count back to output * Revert changes to converter output message * Use explicit tabs in convert CLI test data * Adjust/add messages for n_sents=1 default * Add sample NER data to training examples * Update README * Add links in docs to example NER data * Define msg within converters	2019-08-29 12:04:01 +02:00
Adriane Boyd	f3906950d3	Add separate noise vs orth level to train CLI	2019-08-29 09:10:35 +02:00
Matthew Honnibal	bc5ce49859	Fix 'noise_level' in train cmd	2019-08-28 17:55:38 +02:00
Matthew Honnibal	bb911e5f4e	Fix #3830 : 'subtok' label being added even if learn_tokens=False (#4188 ) * Prevent subtok label if not learning tokens The parser introduces the subtok label to mark tokens that should be merged during post-processing. Previously this happened even if we did not have the --learn-tokens flag set. This patch passes the config through to the parser, to prevent the problem. * Make merge_subtokens a parser post-process if learn_subtokens * Fix train script * Add test for 3830: subtok problem * Fix handlign of non-subtok in parser training	2019-08-23 17:54:00 +02:00
Ines Montani	f65e36925d	Fix absolute imports and avoid importing from cli	2019-08-20 15:08:59 +02:00
Ines Montani	7e8be44218	Auto-format	2019-08-20 15:06:31 +02:00
Ines Montani	009280fbc5	Tidy up and auto-format	2019-08-18 15:09:16 +02:00
Ines Montani	89f2b87266	Open file as utf-8 (closes #4138 )	2019-08-18 13:55:34 +02:00
Ines Montani	f35a8221d8	Move generation of parses out of with blocks	2019-08-18 13:54:26 +02:00
Ines Montani	e5c7e19e82	Fix typo and auto-format [ci skip]	2019-08-16 10:53:38 +02:00
adrianeboyd	a58cb023d7	WIP: Extending debug-data (#4114 ) * Extending debug-data with dependency checks, etc. * Modify debug-data to load with GoldCorpus to iterate over .json/.jsonl files within directories * Add GoldCorpus iterator train_docs_without_preprocessing to load original train docs without shuffling and projectivizing * Report number of misaligned tokens * Add more dependency checks and messages * Update spacy/cli/debug_data.py Co-Authored-By: Ines Montani <ines@ines.io> * Fixed conflict * Move counts to _compile_gold() * Move all dependency nonproj/sent/head/cycle counting to _compile_gold() * Unclobber previous merges * Update variable names * Update more variable names, fix misspelling * Don't clobber loading error messages * Only warn about misaligned tokens if present	2019-08-16 10:52:46 +02:00
Ines Montani	6bec24cdd0	Require downloaded model in pkg_resources (#4090 )	2019-08-07 13:18:11 +02:00
Ines Montani	8718ca8b1f	Fix init_model if there's no vocab (closes #4048 ) (#4049 )	2019-08-01 17:26:09 +02:00
Ines Montani	a3723f439c	Fix formatting [ci skip]	2019-07-27 16:35:42 +02:00
Ines Montani	e000b5ed82	Also support "requirements" in model.json	2019-07-27 13:34:57 +02:00
Ines Montani	f2ea3e3ea2	Merge branch 'master' into feature/nel-wiki	2019-07-09 21:57:47 +02:00
Björn Böing	04982ccc40	Update pretrain to prevent unintended overwriting of weight fil… (#3902 ) * Update pretrain to prevent unintended overwriting of weight files for #3859 * Add '--epoch-start' to pretrain docs * Add mising pretrain arguments to bash example * Update doc tag for v2.1.5	2019-07-09 21:48:30 +02:00
Ines Montani	ae2c208735	Auto-format [ci skip]	2019-06-20 10:36:38 +02:00
Ines Montani	872121955c	Update error code	2019-06-20 10:35:51 +02:00
Björn Böing	ebf5a04d6c	Update pretrain docs and add unsupported loss_func error (#3860 ) * Add error to `get_vectors_loss` for unsupported loss function of `pretrain` * Add missing "--loss-func" argument to pretrain docs. Update pretrain plac annotations to match docs. * Add missing quotation marks	2019-06-20 10:30:44 +02:00
BreakBB	d8573ee715	Update error raising for CLI pretrain to fix #3840 (#3843 ) * Add check for empty input file to CLI pretrain * Raise error if JSONL is not a dict or contains neither `tokens` nor `text` key * Skip empty values for correct pretrain keys and log a counter as warning * Add tests for CLI pretrain core function make_docs. * Add a short hint for the `tokens` key to the CLI pretrain docs * Add success message to CLI pretrain * Update model loading to fix the tests * Skip empty values and do not create docs out of it	2019-06-16 13:22:57 +02:00
Motoki Wu	9c064e6ad9	Add resume logic to spacy pretrain (#3652 ) * Added ability to resume training * Add to readmee * Remove duplicate entry	2019-06-12 13:29:23 +02:00
intrafind	2bba2a3536	Fix for #3811 (#3815 ) Corrected type of seed parameter.	2019-06-03 18:32:47 +02:00
Ines Montani	aea1c93a05	Replace cytoolz.partition_all with util.minibatch	2019-05-11 21:12:09 +02:00
Ines Montani	0bf6441863	Fix .iob converter (closes #3620 )	2019-05-11 19:15:26 +02:00
Ines Montani	6b3a79ac96	Call rmtree and copytree with strings (closes #3713 )	2019-05-11 15:48:35 +02:00
devforfu	21af12eb53	Make "text" key in JSONL format optional when "tokens" key is provided (#3721 ) * Fix issue with forcing text key when it is not required * Extending the docs to reflect the new behavior	2019-05-11 15:41:29 +02:00
F0rge1cE	dd1e6b0bc6	Fix offset bug in loading pre-trained word2vec. (#3689 ) * Fix offset bug in loading pre-trained word2vec. * add contributor agreement	2019-05-06 23:00:38 +02:00
Ines Montani	e0f487f904	Rename early_stopping_iter to n_early_stopping	2019-04-22 14:31:25 +02:00
Ines Montani	9767427669	Auto-format	2019-04-22 14:31:11 +02:00
Ines Montani	7917ce2f73	Make flag shortcut consistent and document	2019-04-22 14:23:44 +02:00
Motoki Wu	8e2cef49f3	Add save after `--save-every` batches for `spacy pretrain` (#3510 ) <!--- Provide a general summary of your changes in the title. --> When using `spacy pretrain`, the model is saved only after every epoch. But each epoch can be very big since `pretrain` is used for language modeling tasks. So I added a `--save-every` option in the CLI to save after every `--save-every` batches. ## Description <!--- Use this section to describe your changes. If your changes required testing, include information about the testing environment and the tests you ran. If your test fixes a bug reported in an issue, don't forget to include the issue number. If your PR is still a work in progress, that's totally fine – just include a note to let us know. --> To test... Save this file to `sample_sents.jsonl` ``` {"text": "hello there."} {"text": "hello there."} {"text": "hello there."} {"text": "hello there."} {"text": "hello there."} {"text": "hello there."} {"text": "hello there."} {"text": "hello there."} {"text": "hello there."} {"text": "hello there."} {"text": "hello there."} {"text": "hello there."} {"text": "hello there."} {"text": "hello there."} {"text": "hello there."} {"text": "hello there."} ``` Then run `--save-every 2` when pretraining. ```bash spacy pretrain sample_sents.jsonl en_core_web_md here -nw 1 -bs 1 -i 10 --save-every 2 ``` And it should save the model to the `here/` folder after every 2 batches. The models that are saved during an epoch will have a `.temp` appended to the save name. At the end the training, you should see these files (`ls here/`): ```bash config.json model2.bin model5.bin model8.bin log.jsonl model2.temp.bin model5.temp.bin model8.temp.bin model0.bin model3.bin model6.bin model9.bin model0.temp.bin model3.temp.bin model6.temp.bin model9.temp.bin model1.bin model4.bin model7.bin model1.temp.bin model4.temp.bin model7.temp.bin ``` ### Types of change <!-- What type of change does your PR cover? Is it a bug fix, an enhancement or new feature, or a change to the documentation? --> This is a new feature to `spacy pretrain`. 🌵 Unfortunately, I haven't been able to test this because compiling from source is not working (cythonize error). ``` Processing matcher.pyx [Errno 2] No such file or directory: '/Users/mwu/github/spaCy/spacy/matcher.pyx' Traceback (most recent call last): File "/Users/mwu/github/spaCy/bin/cythonize.py", line 169, in <module> run(args.root) File "/Users/mwu/github/spaCy/bin/cythonize.py", line 158, in run process(base, filename, db) File "/Users/mwu/github/spaCy/bin/cythonize.py", line 124, in process preserve_cwd(base, process_pyx, root + ".pyx", root + ".cpp") File "/Users/mwu/github/spaCy/bin/cythonize.py", line 87, in preserve_cwd func(args) File "/Users/mwu/github/spaCy/bin/cythonize.py", line 63, in process_pyx raise Exception("Cython failed") Exception: Cython failed Traceback (most recent call last): File "setup.py", line 276, in <module> setup_package() File "setup.py", line 209, in setup_package generate_cython(root, "spacy") File "setup.py", line 132, in generate_cython raise RuntimeError("Running cythonize failed") RuntimeError: Running cythonize failed ``` Edit: Fixed! after deleting all `.cpp` files: `find spacy -name ".cpp" \| xargs rm` ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information.	2019-04-22 14:10:16 +02:00
Krzysztof Kowalczyk	cc1516ec26	Improved training and evaluation (#3538 ) * Add early stopping * Add return_score option to evaluate * Fix missing str to path conversion * Fix import + old python compatibility * Fix bad beam_width setting during cpu evaluation in spacy train with gpu option turned on	2019-04-15 12:04:36 +02:00
Shikhar Chauhan	bbf6f9f764	Change default output format from `jsonl` to `json` for cli convert (#3583 ) (closes #3523 ) * Changing default ouput format from jsonl to json for cli convert * Adding Contributor Agreement	2019-04-12 11:31:23 +02:00
Ines Montani	c23e234d65	Auto-format	2019-04-01 12:11:27 +02:00
Matthew Honnibal	1c8ff59185	Merge pull request #3441 from explosion/fix/cli-ud-scripts 💫 Move UD scripts to bin	2019-03-20 12:19:15 +01:00
Matthew Honnibal	1612990e88	Implement cosine loss for spacy pretrain. Make default	2019-03-20 11:06:58 +00:00

1 2 3 4 5 ...

462 Commits