spaCy

Commit Graph

Author	SHA1	Message	Date
Sofie Van Landeghem	ab59f3124e	fix NEL overfitting test for GPU (#5236 )	2020-04-02 10:32:52 +02:00
Sofie Van Landeghem	311133e579	Train textcat with config (#5143 ) * bring back default build_text_classifier method * remove _set_dims_ hack in favor of proper dim inference * add tok2vec initialize to unit test * small fixes * add unit test for various textcat config settings * logistic output layer does not have nO * fix window_size setting * proper fix * fix W initialization * Update textcat training example * Use ml_datasets * Convert training data to `Example` format * Use `n_texts` to set proportionate dev size * fix _init renaming on latest thinc * avoid setting a non-existing dim * update to thinc==8.0.0a2 * add BOW and CNN defaults for easy testing * various experiments with train_textcat script, fix softmax activation in textcat bow * allow textcat train script to work on other datasets as well * have dataset as a parameter * train textcat from config, with example config * add config for training textcat * formatting * fix exclusive_classes * fixing BOW for GPU * bump thinc to 8.0.0a3 (not published yet so CI will fail) * add in link_vectors_to_models which got deleted Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2020-03-29 19:40:36 +02:00
adrianeboyd	ce0e538068	Check whether doc is instantiated in Example.get_gold_parses() (#5167 ) * Check whether doc is instantiated When creating docs to pair with gold parses, modify test to check whether a doc is unset rather than whether it contains tokens. * Restore test of evaluate on an empty doc * Set a minimal gold.orig for the scorer Without a minimal gold.orig the scorer can't evaluate empty docs. This is the v3 equivalent of #4925.	2020-03-29 13:57:00 +02:00
Sofie Van Landeghem	d6d95674c1	bugfix in span similarity (#5155 ) * bugfix in span similarity * also rewrite doc.pyx for clarity * formatting	2020-03-29 13:56:07 +02:00
Sofie Van Landeghem	1f9852abc3	Fix parser @ GPU (#5210 ) * ensure self.bias is numpy array in parser model * 2 more little bug fixes for parser on GPU * removing testing GPU statement * remove commented code	2020-03-28 23:09:35 +01:00
Sofie Van Landeghem	9b412516e7	Fixing pickling of the parser (#5218 ) * fix __reduce__ for pickling parser * setting the move object as 'state' during pickling * unskip test_issue4725 - works again	2020-03-27 19:35:26 +01:00
Ines Montani	92b9b631ef	xfail -> skip	2020-03-27 10:51:32 +01:00
Ines Montani	ee4bb0e3b6	Fix import	2020-03-26 21:44:18 +01:00
Ines Montani	4fe2299586	xfail hanging test	2020-03-26 20:58:13 +01:00
Ines Montani	f12a46472c	Remove unicode declarations	2020-03-26 15:18:32 +01:00
Ines Montani	7453df79d1	Fix argument	2020-03-26 14:09:02 +01:00
Ines Montani	e7341db5dc	Add sent_start to pattern schema	2020-03-26 14:05:40 +01:00
Ines Montani	70ee4ef4fd	Fix small errors	2020-03-26 13:47:31 +01:00
Ines Montani	46568f40a7	Merge branch 'master' into tmp/sync	2020-03-26 13:38:14 +01:00
adrianeboyd	8d3563f1c4	Minor bugfixes for train CLI (#5186 ) * Omit per_type scores from model-best calculations The addition of per_type scores to the included metrics (#4911) causes errors when they're compared while determining the best model, so omit them for this `max()` comparison. * Add default speed data for interrupted train CLI Add better speed meta defaults so that an interrupted iteration still produces a best model. Co-authored-by: Ines Montani <ines@ines.io>	2020-03-26 10:46:50 +01:00
adrianeboyd	a04f802099	Fix GoldParse init when token count differs (#5191 ) Fix the `GoldParse` initialization when the number of tokens has changed (due to merging subtokens with the parser).	2020-03-26 10:46:23 +01:00
adrianeboyd	d88a377bed	Remove Vectors.from_glove (#5209 )	2020-03-26 10:45:47 +01:00
Ines Montani	828acffc12	Tidy up and auto-format	2020-03-25 12:28:12 +01:00
adrianeboyd	86c43e55fa	Improve Lithuanian tokenization (#5205 ) * Improve Lithuanian tokenization Modify Lithuanian tokenization to improve performance for UD_Lithuanian-ALKSNIS. * Update Lithuanian tokenizer tests	2020-03-25 11:28:12 +01:00
adrianeboyd	1a944e5976	Improve Italian tokenization (#5204 ) Improve Italian tokenization for UD_Italian-ISDT.	2020-03-25 11:28:02 +01:00
adrianeboyd	923a453449	Modifications/updates to Portuguese tokenization (#5203 ) Modifications to Portuguese tokenization for UD_Portuguese-Bosque. Instead of splitting contactions as exceptions, they are kept as merged tokens.	2020-03-25 11:27:53 +01:00
adrianeboyd	4117a5c705	Improve French tokenization (#5202 ) Improve French tokenization for UD_French-Sequoia.	2020-03-25 11:27:42 +01:00
Ines Montani	a3d09ffe61	Merge pull request #5201 from adrianeboyd/feature/ud-tokenization-nb-v2 Improved tokenization for UD_Norwegian-Bokmaal	2020-03-25 11:27:31 +01:00
Sofie Van Landeghem	218e1706ac	Bugfix linking vectors (#5196 ) * restore call to _load_vectors * bump to thinc 8.0.0a3 * bump to 3.0.0.dev4	2020-03-25 10:20:11 +01:00
Adriane Boyd	09d442f5ad	Merge remote-tracking branch 'upstream/master' into feature/ud-tokenization-da	2020-03-25 09:41:52 +01:00
Adriane Boyd	cba2d1d972	Disable failing abbreviation test UD_Danish-DDT has (as far as I can tell) hallucinated periods after abbreviations, so the changes are an artifact of the corpus and not due to anything meaningful about Danish tokenization.	2020-03-25 09:39:26 +01:00
Adriane Boyd	79737adb90	Improved tokenization for UD_Norwegian-Bokmaal	2020-03-25 08:54:02 +01:00
Ines Montani	5f2afa0479	Merge pull request #5185 from adrianeboyd/bugfix/de-punctuation-style Improve German tokenizer settings style	2020-03-24 16:38:32 +01:00
Adriane Boyd	2897a73559	Improve German tokenizer settings style	2020-03-23 19:23:47 +01:00
Baciccin	3b53617a69	Add Ligurian language	2020-03-19 21:37:01 -07:00
Ines Montani	558032017e	Merge pull request #5157 from svlandeg/bugfix/language remove unnecessary itertools call	2020-03-16 15:04:25 +01:00
Ines Montani	c68f20b398	Merge pull request #5146 from adrianeboyd/bugfix/assert-docs-equal-sents Fix sents comparison in test util	2020-03-16 14:59:32 +01:00
svlandeg	fba219f737	remove unnecessary itertools call	2020-03-16 08:31:36 +01:00
svlandeg	59000ee21d	fix serialization of empty doc + unit test	2020-03-13 16:07:56 +01:00
Adriane Boyd	423849f94a	Fix sents comparison in test util Due to changes to `Span` (#5005), spans from different documents are now never equal. Check `Token.is_sent_start` values instead.	2020-03-13 09:25:23 +01:00
Matthew Honnibal	26a90f011b	Set version to v2.2.4	2020-03-12 11:30:41 +01:00
svlandeg	c4d030dbf6	remove accidental commit	2020-03-09 18:10:54 +01:00
svlandeg	1724a4f75b	additional information if doc is empty	2020-03-09 18:08:18 +01:00
Ines Montani	1d6aec805d	Fix formatting and update docs for v2.2.4	2020-03-09 11:17:20 +01:00
Mark Abraham	0345135167	Tokenizer to_disk and from_disk now ensure paths (#5116 ) * Tokenizer to_disk and from_disk now ensure strings are converted to paths Fixes #5115 * Sign contributor agreement	2020-03-08 13:25:56 +01:00
Sofie Van Landeghem	5847be6022	Tok2Vec: extract-embed-encode (#5102 ) * avoid changing original config * fix elif structure, batch with just int crashes otherwise * tok2vec example with doc2feats, encode and embed architectures * further clean up MultiHashEmbed * further generalize Tok2Vec to work with extract-embed-encode parts * avoid initializing the charembed layer with Docs (for now ?) * small fixes for bilstm config (still does not run) * rename to core layer * move new configs * walk model to set nI instead of using core ref * fix senter overfitting test to be more similar to the training data (avoid flakey behaviour)	2020-03-08 13:23:18 +01:00
adrianeboyd	993758c58f	Remove unnecessary iterator in Language.pipe (#5101 ) Remove iterator over `raw_texts` with `iterator.tee()` in `Language.pipe` that is never consumed and consumes memory unnecessarily.	2020-03-08 13:22:25 +01:00
Sofie Van Landeghem	1a2b8fc264	set vector of merged entity (#5085 ) * merge_entities sets the vector in the vocab for the merged token * add unit test * import unicode_literals * move code to _merge function * only set vector if vocab has non-zero vectors	2020-03-06 14:45:28 +01:00
adrianeboyd	c95ce96c44	Update sentence recognizer (#5109 ) * Update sentence recognizer * rename `sentrec` to `senter` * use `spacy.HashEmbedCNN.v1` by default * update to follow `Tagger` modifications * remove component methods that can be inherited from `Tagger` * add simple initialization and overfitting pipeline tests * Update serialization test for senter	2020-03-06 14:45:02 +01:00
Sofie Van Landeghem	6ac9fc0619	Unit test for NEL functionality (#5114 ) * empty begin_training for sentencizer * overfitting unit test for entity linker * fixed NEL IO by storing the entity_vector_length in the cfg	2020-03-06 14:42:23 +01:00
Ines Montani	b0cfab317f	Merge branch 'develop' into refactor/simplify-warnings	2020-03-04 16:38:55 +01:00
Muhammad Irfan	224a7f8e94	examples	2020-03-04 15:49:06 +05:00
Muhammad Irfan	03376c9d9b	Basque language added and tested.	2020-03-04 11:58:56 +05:00
adrianeboyd	9be90dbca3	Improve token head verification (#5079 ) * Improve token head verification Improve the verification for valid token heads when heads are set: * in `Token.head`: heads come from the same document * in `Doc.from_array()`: head indices are within the bounds of the document * Improve error message	2020-03-03 21:44:51 +01:00
adrianeboyd	8c20dae6f7	Fix model-final/model-best meta from train CLI (#5093 ) * Fix model-final/model-best meta * include speed and accuracy from final iteration * combine with speeds from base model if necessary * Include token_acc metric for all components	2020-03-03 21:43:25 +01:00
Sofie Van Landeghem	a0998868ff	prevent updating cfg if the Model was already defined (#5078 )	2020-03-03 13:58:56 +01:00
Sofie Van Landeghem	d307e9ca58	take care of global vectors in multiprocessing (#5081 ) * restore load_nlp.VECTORS in the child process * add unit test * fix test * remove unnecessary import * add utf8 encoding * import unicode_literals	2020-03-03 13:58:22 +01:00
adrianeboyd	d078b47c81	Break out of infinite loop as intended (#5077 )	2020-03-03 12:29:05 +01:00
adrianeboyd	697bec764d	Normalize IS_SENT_START to SENT_START for Matcher (#5080 )	2020-03-03 12:22:39 +01:00
adrianeboyd	2281c4708c	Restore empty tokenizer properties (#5026 ) * Restore empty tokenizer properties * Check for types in tokenizer.from_bytes() * Add test for setting empty tokenizer rules	2020-03-02 11:55:02 +01:00
Sofie Van Landeghem	c6b12ab02a	Bugfix/get doc (#5049 ) * new (broken) unit test * fixing get_doc method	2020-03-02 11:49:28 +01:00
Ines Montani	648f61d077	Tidy up compiler flags and imports (#5071 )	2020-03-02 11:48:10 +01:00
Ines Montani	7efaa76168	Update errors.py	2020-02-28 12:23:31 +01:00
Ines Montani	37691e6d5d	Simplify warnings	2020-02-28 12:20:23 +01:00
Ines Montani	5da3ad682a	Tidy up and auto-format	2020-02-28 11:57:41 +01:00
adrianeboyd	65d7bab10f	Initialize all values in a2b/b2a in new align (#5063 )	2020-02-27 18:43:00 +01:00
Sofie Van Landeghem	06f0a8daa0	Default settings to configurations (#4995 ) * fix grad_clip naming * cleaning up pretrained_vectors out of cfg * further refactoring Model init's * move Model building out of pipes * further refactor to require a model config when creating a pipe * small fixes * making cfg in nn_parser more consistent * fixing nr_class for parser * fixing nn_parser's nO * fix printing of loss * architectures in own file per type, consistent naming * convenience methods default_tagger_config and default_tok2vec_config * let create_pipe access default config if available for that component * default_parser_config * move defaults to separate folder * allow reading nlp from package or dir with argument 'name' * architecture spacy.VocabVectors.v1 to read static vectors from file * cleanup * default configs for nel, textcat, morphologizer, tensorizer * fix imports * fixing unit tests * fixes and clean up * fixing defaults, nO, fix unit tests * restore parser IO * fix IO * 'fix' serialization test * add .cfg to manifest fix example configs with additional arguments * replace Morpohologizer with Tagger * add IO bit when testing overfitting of tagger (currently failing) * fix IO - don't initialize when reading from disk * expand overfitting tests to also check IO goes OK * remove dropout from HashEmbed to fix Tagger performance * add defaults for sentrec * update thinc * always pass a Model instance to a Pipe * fix piped_added statement * remove obsolete W029 * remove obsolete errors * restore byte checking tests (work again) * clean up test * further test cleanup * convert from config to Model in create_pipe * bring back error when component is not initialized * cleanup * remove calls for nlp2.begin_training * use thinc.api in imports * allow setting charembed's nM and nC * fix for hardcoded nM/nC + unit test * formatting fixes * trigger build	2020-02-27 18:42:27 +01:00
Adriane Boyd	9f740a9891	Add a few more Danish tokenizer exceptions	2020-02-26 14:59:03 +01:00
Ines Montani	1c212215cd	Merge pull request #5064 from adrianeboyd/feature/german-tokenization Improve German tokenization	2020-02-26 13:41:44 +01:00
Adriane Boyd	d1f703d78d	Improve German tokenization Improve German tokenization with respect to Tiger.	2020-02-26 13:06:52 +01:00
Ines Montani	ed9358420e	Merge branch 'master' into pr/5060	2020-02-26 12:51:29 +01:00
adrianeboyd	ff184b7a9c	Add tag_map argument to CLI debug-data and train (#4750 ) (#5038 ) Add an argument for a path to a JSON-formatted tag map, which is used to update and extend the default language tag map.	2020-02-26 12:10:38 +01:00
svlandeg	18ff97589d	update spacy to 2.2.4.dev0	2020-02-26 10:50:05 +01:00
svlandeg	fc6e34c3a1	fix bugs from porting master to develop	2020-02-26 08:44:22 +01:00
Ines Montani	c1a5ece65f	Tidy up setup and update requirements tests	2020-02-25 15:46:39 +01:00
Ines Montani	5d21d3e8b9	Merge branch 'develop' into pr/5008	2020-02-25 15:24:47 +01:00
Ines Montani	d50152b917	Merge pull request #5019 from questoph/master Optimizing tokenization for Luxembourgish (dealing with apostrophe infixes)	2020-02-25 14:48:50 +01:00
Ines Montani	4440a072d2	Merge pull request #5006 from svlandeg/bugfix/multiproc-underscore load Underscore state when multiprocessing	2020-02-25 14:46:02 +01:00
svlandeg	d821c95eb0	debugging prints	2020-02-23 17:38:33 +01:00
svlandeg	58568bd0cd	fix	2020-02-23 16:45:37 +01:00
svlandeg	0f55e51704	assert we found the root_dir	2020-02-23 16:33:58 +01:00
svlandeg	783da088ea	avoid try except	2020-02-23 16:21:21 +01:00
svlandeg	b49a3afd0c	use clean_underscore fixture	2020-02-23 15:49:20 +01:00
Tom Keefe	ddf63b97a8	make idx available via to_array (#5030 )	2020-02-22 14:13:06 +01:00
Sofie Van Landeghem	44f4142ce4	add two abbreviations and some additional unit tests (#5040 )	2020-02-22 14:12:32 +01:00
Sofie Van Landeghem	479bd8d09f	add lemma option to displacy 'dep' visualiser (#5041 ) * add lemma option to displacy 'dep' visualiser * more compact list comprehension * add option to doc * fix test and add lemmas to util.get_doc * fix capital * remove lemma from get_doc * cleanup	2020-02-22 14:11:51 +01:00
adrianeboyd	2164e71ea8	Improved Romanian tokenization for UD RRT (#5036 ) Modifications to Romanian tokenization to improve tokenization for UD_Romanian-RRT.	2020-02-19 16:15:59 +01:00
svlandeg	9f1447bf71	where areth thou, file ?	2020-02-19 17:09:29 +02:00
svlandeg	9834527f2c	hack to switch between CLI folder setup and local setup	2020-02-19 16:22:48 +02:00
svlandeg	5c2f645470	root dir one level up	2020-02-19 16:15:56 +02:00
svlandeg	b20351792a	assert prints for more clarity	2020-02-19 15:51:53 +02:00
Ines Montani	a3335d36b8	Merge branch 'develop' into refactor/remove-symlinks	2020-02-18 17:22:20 +01:00
Ines Montani	09cbeaef27	Remove symlinks, data dir and related stuff	2020-02-18 17:20:17 +01:00
Ines Montani	e3f40a6a0f	Tidy up and auto-format	2020-02-18 15:38:18 +01:00
Ines Montani	1278161f47	Tidy up and fix issues	2020-02-18 15:17:03 +01:00
Ines Montani	de11ea753a	Merge branch 'master' into develop	2020-02-18 14:47:23 +01:00
Ines Montani	80e95d02b1	Allow spacy attr in token pattern	2020-02-18 14:32:53 +01:00
Jan Jessewitsch	c7e4fe9c5c	Fix/Improve german stop words (#5024 ) * Fix german stop words Two stop words ("einige" and "einigen") are sticking together. Remove three nouns that may serve as stop words in a specific context (e.g. religious or news) but are not applicable for general use. * Create Jan-711.md	2020-02-17 18:59:22 +01:00
Kabir Khan	f6ed07b85c	Use nlp.pipe in EntityRuler for phrase patterns in add_patterns (#4931 ) * Fix ent_ids and labels properties when id attribute used in patterns * use set for labels * sort end_ids for comparison in entity_ruler tests * fixing entity_ruler ent_ids test * add to set * Run make_doc optimistically if using phrase matcher patterns. * remove unused coveragerc I was testing with * format * Refactor EntityRuler.add_patterns to use nlp.pipe for phrase patterns. Improves speed substantially. * Removing old add_patterns function * Fixing spacing * Make sure token_patterns loaded as well, before generator was being emptied in from_disk	2020-02-16 18:17:47 +01:00
Sofie Van Landeghem	72c964bcf4	define pretrained_dims which is used by build_text_classifier (#5004 )	2020-02-16 17:21:17 +01:00
adrianeboyd	3b22eb651b	Sync Span __eq__ and __hash__ (#5005 ) * Sync Span __eq__ and __hash__ Use the same tuple for `__eq__` and `__hash__`, including all attributes except `vector` and `vector_norm`. * Update entity comparison in tests Update `assert_docs_equal()` test util to compare `Span` properties for ents rather than `Span` objects.	2020-02-16 17:20:36 +01:00
adrianeboyd	0c47a53b5e	Use int only in key2row for better performance (#4990 ) Cast all keys and rows to `int` in `vectors.key2row` for more efficient access and serialization.	2020-02-16 17:19:41 +01:00
adrianeboyd	5b102963bf	Require HEAD for is_parsed in Doc.from_array() (#5011 ) Modify flag settings so that `DEP` is not sufficient to set `is_parsed` and only run `set_children_from_heads()` if `HEAD` is provided. Then the combination `[SENT_START, DEP]` will set deps and not clobber sent starts with a lot of one-word sentences.	2020-02-16 17:17:09 +01:00
Sofie Van Landeghem	2572460175	add tok2vec parameters to train script to facilitate init_tok2vec (#5021 )	2020-02-16 17:16:41 +01:00
Sofie Van Landeghem	a27c77ce62	add message when cli train script throws exception (#5009 ) * add message when cli train script throws exception * fix formatting	2020-02-15 15:50:17 +01:00

1 2 3 4 5 ...

6821 Commits