Commit Graph

6838 Commits

Author SHA1 Message Date
svlandeg 1775f54a26 small little fixes 2020-06-03 22:17:02 +02:00
svlandeg 07886a3de3 rename init_tok2vec to resume 2020-06-03 22:00:25 +02:00
svlandeg 4ed6278663 small fixes to pretrain config, init_tok2vec TODO 2020-06-03 19:32:40 +02:00
svlandeg ffe0451d09 pretrain from config 2020-06-03 14:45:00 +02:00
svlandeg eac12cbb77 make dropout in embed layers configurable 2020-06-03 11:50:16 +02:00
svlandeg e91485dfc4 add discard_oversize parameter, move optimizer to training subsection 2020-06-03 10:04:16 +02:00
svlandeg 03c58b488c prevent infinite loop, custom warning 2020-06-03 10:00:21 +02:00
svlandeg 6504b7f161 Merge remote-tracking branch 'upstream/develop' into feature/pretrain-config 2020-06-03 08:30:16 +02:00
svlandeg c5ac382f0a fix name clash 2020-06-02 22:24:57 +02:00
svlandeg 2bf5111ecf additional test with discard_oversize=False 2020-06-02 22:09:37 +02:00
svlandeg aa6271b16c extending algorithm to deal better with edge cases 2020-06-02 22:05:08 +02:00
svlandeg f2e162fc60 it's only oversized if the tolerance level is also exceeded 2020-06-02 19:59:04 +02:00
svlandeg ef834b4cd7 fix comments 2020-06-02 19:50:44 +02:00
svlandeg 6208d322d3 slightly more challenging unit test 2020-06-02 19:47:30 +02:00
svlandeg 6651fafd5c using overflow buffer for examples within the tolerance margin 2020-06-02 19:43:39 +02:00
svlandeg 85b0597ed5 add test for minibatch util 2020-06-02 18:26:21 +02:00
svlandeg 5b350a6c99 bugfix of the bugfix 2020-06-02 17:49:33 +02:00
svlandeg fdfd822936 rewrite minibatch_by_words function 2020-06-02 15:22:54 +02:00
svlandeg ec52e7f886 add oversize examples before StopIteration returns 2020-06-02 13:21:55 +02:00
svlandeg e0f9f448f1 remove Tensorizer 2020-06-01 23:38:48 +02:00
Ines Montani b5ae2edcba
Merge pull request #5516 from explosion/feature/improve-model-version-deps 2020-05-31 12:54:01 +02:00
Ines Montani dc186afdc5 Add warning 2020-05-30 15:34:54 +02:00
Ines Montani b7aff6020c Make functions more general purpose and update docstrings and tests 2020-05-30 15:18:53 +02:00
Ines Montani a7e370bcbf Don't override spaCy version 2020-05-30 15:03:18 +02:00
Ines Montani e47e5a4b10 Use more sophisticated version parsing logic 2020-05-30 15:01:58 +02:00
Ines Montani 4fd087572a WIP: improve model version deps 2020-05-28 12:51:37 +02:00
Matthw Honnibal 58750b06f8 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2020-05-27 22:18:36 +02:00
Ines Montani 1a15896ba9 unicode -> str consistency [ci skip] 2020-05-24 18:51:10 +02:00
Ines Montani 5d3806e059 unicode -> str consistency 2020-05-24 17:20:58 +02:00
Ines Montani 387c7aba15 Update test 2020-05-24 14:55:16 +02:00
Ines Montani f9786d765e Simplify is_package check 2020-05-24 14:48:56 +02:00
Matthw Honnibal 2d9de8684d Support use_pytorch_for_gpu_memory config 2020-05-22 23:10:40 +02:00
Ines Montani 4465cad6c5 Rename spacy.analysis to spacy.pipe_analysis 2020-05-22 17:42:06 +02:00
Ines Montani 25d6ed3fb8
Merge pull request #5489 from explosion/feature/connected-components 2020-05-22 17:40:11 +02:00
Ines Montani 841c05b47b
Merge pull request #5490 from explosion/fix/remove-jsonschema 2020-05-22 17:39:54 +02:00
Ines Montani 569a65b60e Auto-format 2020-05-22 16:55:42 +02:00
Ines Montani d844528c5f Add test for is_compatible_model 2020-05-22 16:55:15 +02:00
Ines Montani 12b7be1d98 Remove jsonschema from dependencies 2020-05-22 16:49:26 +02:00
Matthew Honnibal f7f6df7275 Move to spacy.analysis 2020-05-22 16:43:18 +02:00
Matthew Honnibal 78d79d94ce Guess set_annotations=True in nlp.update
During `nlp.update`, components can be passed a boolean set_annotations
to indicate whether they should assign annotations to the `Doc`. This
needs to be called if downstream components expect to use the
annotations during training, e.g. if we wanted to use tagger features in
the parser.

Components can specify their assignments and requirements, so we can
figure out which components have these inter-dependencies. After
figuring this out, we can guess whether to pass set_annotations=True.

We could also call set_annotations=True always, or even just have this
as the only behaviour. The downside of this is that it would require the
`Doc` objects to be created afresh to avoid problematic modifications.
One approach would be to make a fresh copy of the `Doc` objects within
`nlp.update()`, so that we can write to the objects without any
problems. If we do that, we can drop this logic and also drop the
`set_annotations` mechanism. I would be fine with that approach,
although it runs the risk of introducing some performance overhead, and
we'll have to take care to copy all extension attributes etc.
2020-05-22 15:55:45 +02:00
Ines Montani 6e6db6afb6 Better model compatibility and validation 2020-05-22 15:42:46 +02:00
Matthw Honnibal 25b51f4fc8 Set version to v3.0.0.dev9 2020-05-21 20:47:52 +02:00
Matthw Honnibal bc94fdabd0 Fix begin_training 2020-05-21 20:46:21 +02:00
Matthw Honnibal d507ac28d8 Fix shape inference 2020-05-21 20:46:10 +02:00
Matthw Honnibal df87c32a40 Pass smaller doc sample into model initialize 2020-05-21 20:17:24 +02:00
Matthw Honnibal 3b5cfec1fc Tweak memory management in train_from_config 2020-05-21 19:32:04 +02:00
Matthw Honnibal f075655deb Fix shape inference in begin_training 2020-05-21 19:26:29 +02:00
Matthew Honnibal e6c4c1a507
Merge pull request #5468 from adrianeboyd/feature/cli-conllu-misc-ner
Improve handling of NER in CoNLL-U MISC
2020-05-21 16:39:46 +02:00
Adriane Boyd 4b229bfc22 Improve handling of NER in CoNLL-U MISC 2020-05-20 18:48:51 +02:00
Matthew Honnibal 609c0ba557
Fix accidentally quadratic runtime in Example.split_sents (#5464)
* Tidy up train-from-config a bit

* Fix accidentally quadratic perf in TokenAnnotation.brackets

When we're reading in the gold data, we had a nested loop where
we looped over the brackets for each token, looking for brackets
that start on that word. This is accidentally quadratic, because
we have one bracket per word (for the POS tags). So we had
an O(N**2) behaviour here that ended up being pretty slow.

To solve this I'm indexing the brackets by their starting word
on the TokenAnnotations object, and having a property to provide
the previous view.

* Fixes
2020-05-20 18:48:18 +02:00