Commit Graph

11424 Commits

Author SHA1 Message Date
Ines Montani 6e6db6afb6 Better model compatibility and validation 2020-05-22 15:42:46 +02:00
Matthw Honnibal 25b51f4fc8 Set version to v3.0.0.dev9 2020-05-21 20:47:52 +02:00
Matthw Honnibal bc94fdabd0 Fix begin_training 2020-05-21 20:46:21 +02:00
Matthw Honnibal d507ac28d8 Fix shape inference 2020-05-21 20:46:10 +02:00
Matthw Honnibal df87c32a40 Pass smaller doc sample into model initialize 2020-05-21 20:17:24 +02:00
Matthw Honnibal 3b5cfec1fc Tweak memory management in train_from_config 2020-05-21 19:32:04 +02:00
Matthw Honnibal f075655deb Fix shape inference in begin_training 2020-05-21 19:26:29 +02:00
Matthw Honnibal 1729165e90 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2020-05-21 19:11:08 +02:00
Matthew Honnibal e6c4c1a507
Merge pull request #5468 from adrianeboyd/feature/cli-conllu-misc-ner
Improve handling of NER in CoNLL-U MISC
2020-05-21 16:39:46 +02:00
Adriane Boyd 4b229bfc22 Improve handling of NER in CoNLL-U MISC 2020-05-20 18:48:51 +02:00
Matthew Honnibal 609c0ba557
Fix accidentally quadratic runtime in Example.split_sents (#5464)
* Tidy up train-from-config a bit

* Fix accidentally quadratic perf in TokenAnnotation.brackets

When we're reading in the gold data, we had a nested loop where
we looped over the brackets for each token, looking for brackets
that start on that word. This is accidentally quadratic, because
we have one bracket per word (for the POS tags). So we had
an O(N**2) behaviour here that ended up being pretty slow.

To solve this I'm indexing the brackets by their starting word
on the TokenAnnotations object, and having a property to provide
the previous view.

* Fixes
2020-05-20 18:48:18 +02:00
Matthw Honnibal 60e8da4813 Tidy up train-from-config a bit 2020-05-20 12:56:27 +02:00
Matthw Honnibal fda7355508 Fix train-from-config 2020-05-20 12:30:21 +02:00
Matthw Honnibal 24efd54a42 Merge from develop 2020-05-20 12:27:31 +02:00
Sofie Van Landeghem 7f5715a081
Various fixes to NEL functionality, Example class etc (#5460)
* setting KB in the EL constructor, similar to how the model is passed on

* removing wikipedia example files - moved to projects

* throw an error when nlp.update is called with 2 positional arguments

* rewriting the config logic in create pipe to accomodate for other objects (e.g. KB) in the config

* update config files with new parameters

* avoid training pipeline components that don't have a model (like sentencizer)

* various small fixes + UX improvements

* small fixes

* set thinc to 8.0.0a9 everywhere

* remove outdated comment
2020-05-20 11:41:12 +02:00
Matthew Honnibal 664a3603b0 Set version to v3.0.0.dev8 2020-05-19 17:15:39 +02:00
Matthew Honnibal a2830c3ef5 Use thinc 8.0.0a9 2020-05-19 16:23:11 +02:00
Sofie Van Landeghem f00de445dd
default models defined in component decorator (#5452)
* move defaults to pipeline and use in component decorator

* black formatting

* relative import
2020-05-19 16:20:03 +02:00
Sofie Van Landeghem 0d94737857
Feature toggle_pipes (#5378)
* make disable_pipes deprecated in favour of the new toggle_pipes

* rewrite disable_pipes statements

* update documentation

* remove bin/wiki_entity_linking folder

* one more fix

* remove deprecated link to documentation

* few more doc fixes

* add note about name change to the docs

* restore original disable_pipes

* small fixes

* fix typo

* fix error number to W096

* rename to select_pipes

* also make changes to the documentation

Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
2020-05-18 22:27:10 +02:00
Matthew Honnibal 333b1a308b
Adapt parser and NER for transformers (#5449)
* Draft layer for BILUO actions

* Fixes to biluo layer

* WIP on BILUO layer

* Add tests for BILUO layer

* Format

* Fix transitions

* Update test

* Link in the simple_ner

* Update BILUO tagger

* Update __init__

* Import simple_ner

* Update test

* Import

* Add files

* Add config

* Fix label passing for BILUO and tagger

* Fix label handling for simple_ner component

* Update simple NER test

* Update config

* Hack train script

* Update BILUO layer

* Fix SimpleNER component

* Update train_from_config

* Add biluo_to_iob helper

* Add IOB layer

* Add IOBTagger model

* Update biluo layer

* Update SimpleNER tagger

* Update BILUO

* Read random seed in train-from-config

* Update use of normal_init

* Fix normalization of gradient in SimpleNER

* Update IOBTagger

* Remove print

* Tweak masking in BILUO

* Add dropout in SimpleNER

* Update thinc

* Tidy up simple_ner

* Fix biluo model

* Unhack train-from-config

* Update setup.cfg and requirements

* Add tb_framework.py for parser model

* Try to avoid memory leak in BILUO

* Move ParserModel into spacy.ml, avoid need for subclass.

* Use updated parser model

* Remove incorrect call to model.initializre in PrecomputableAffine

* Update parser model

* Avoid divide by zero in tagger

* Add extra dropout layer in tagger

* Refine minibatch_by_words function to avoid oom

* Fix parser model after refactor

* Try to avoid div-by-zero in SimpleNER

* Fix infinite loop in minibatch_by_words

* Use SequenceCategoricalCrossentropy in Tagger

* Fix parser model when hidden layer

* Remove extra dropout from tagger

* Add extra nan check in tagger

* Fix thinc version

* Update tests and imports

* Fix test

* Update test

* Update tests

* Fix tests

* Fix test

Co-authored-by: Ines Montani <ines@ines.io>
2020-05-18 22:23:33 +02:00
Ines Montani 3100c97e69
Merge pull request #5441 from svlandeg/fix/updating 2020-05-18 10:53:41 +02:00
Ines Montani e8ff4c1e6a
Pin flake8 version 2020-05-18 10:50:21 +02:00
svlandeg 6fb6a8518c bump to 3.0.0.dev7 and thinc to 8.0.0a8 2020-05-15 13:25:54 +02:00
svlandeg 047f3d7d94 remove ops argument for Adam 2020-05-15 13:25:00 +02:00
svlandeg 79d4f196e5 pin flak8 to 3.5.0 2020-05-15 11:53:01 +02:00
svlandeg e0fda2bd81 throw warning when model_cfg is None 2020-05-15 11:02:10 +02:00
svlandeg 102c8c7e2f fix fan_in renaming 2020-05-12 13:56:10 +02:00
svlandeg 9fe1e23512 update to thinc 8.0.0a6 2020-05-12 13:51:25 +02:00
Matthew Honnibal eb117e2fce Add load_config_from_str helper 2020-05-02 14:09:21 +02:00
Ines Montani 962bf12a20
Merge pull request #5312 from odaxiom/fix/website-documentation-spacy-lookup 2020-04-29 12:54:31 +02:00
Sofie Van Landeghem 1bf2082ac4
update is_new_osx function (#5376) 2020-04-29 12:51:49 +02:00
Matthew Honnibal b2ef6100af
Only run backprop once when shared tok2vec weights (#5331)
Previously, pipelines with shared tok2vec weights would call the
tok2vec backprop callback multiple times, once for each pipeline
component. This caused errors for PyTorch, and was inefficient.

Instead, accumulate the gradient for all but one component, and just
call the callback once.
2020-04-21 19:30:41 +02:00
Matthew Honnibal 6918d99b6c
Improve GPU usage for train-with-config (#5330)
* Adjust for no ops in Optimizer

* Fix gpu in train-from-config

* Update train-from-config script

* Fix parser

* Fix GPU efficiency of padding backprop
2020-04-20 22:06:28 +02:00
Sébastien Harinck 688a328668 docs(website): fix issue on example in spacy-lookup 2020-04-15 16:47:29 +02:00
Sofie Van Landeghem 42364dcd9f
Remove "pala" tokenizer exception for Spanish (#5265) 2020-04-09 10:21:20 +02:00
Sofie Van Landeghem b2e93be867
Optimizer defaults (#5244)
* set optimizer defaults to mimic thinc 7 + bump to dev6

* larger error range for senter overfitting test
2020-04-03 13:02:46 +02:00
adrianeboyd b71a11ff6d
Update morphologizer (#5108)
* Add pos and morph scoring to Scorer

Add pos, morph, and morph_per_type to `Scorer`. Report pos and morph
accuracy in `spacy evaluate`.

* Update morphologizer for v3

* switch to tagger-based morphologizer
* use `spacy.HashCharEmbedCNN` for morphologizer defaults
* add `Doc.is_morphed` flag

* Add morphologizer to train CLI

* Add basic morphologizer pipeline tests

* Add simple morphologizer training example

* Remove subword_features from CharEmbed models

Remove `subword_features` argument from `spacy.HashCharEmbedCNN.v1` and
`spacy.HashCharEmbedBiLSTM.v1` since in these cases `subword_features`
is always `False`.

* Rename setting in morphologizer example

Use `with_pos_tags` instead of `without_pos_tags`.

* Fix kwargs for spacy.HashCharEmbedBiLSTM.v1

* Remove defaults for spacy.HashCharEmbedBiLSTM.v1

Remove default `nM/nC` for `spacy.HashCharEmbedBiLSTM.v1`.

* Set random seed for textcat overfitting test
2020-04-02 14:46:32 +02:00
Sofie Van Landeghem ab59f3124e
fix NEL overfitting test for GPU (#5236) 2020-04-02 10:32:52 +02:00
Sofie Van Landeghem 311133e579
Train textcat with config (#5143)
* bring back default build_text_classifier method

* remove _set_dims_ hack in favor of proper dim inference

* add tok2vec initialize to unit test

* small fixes

* add unit test for various textcat config settings

* logistic output layer does not have nO

* fix window_size setting

* proper fix

* fix W initialization

* Update textcat training example

* Use ml_datasets
* Convert training data to `Example` format
* Use `n_texts` to set proportionate dev size

* fix _init renaming on latest thinc

* avoid setting a non-existing dim

* update to thinc==8.0.0a2

* add BOW and CNN defaults for easy testing

* various experiments with train_textcat script, fix softmax activation in textcat bow

* allow textcat train script to work on other datasets as well

* have dataset as a parameter

* train textcat from config, with example config

* add config for training textcat

* formatting

* fix exclusive_classes

* fixing BOW for GPU

* bump thinc to 8.0.0a3 (not published yet so CI will fail)

* add in link_vectors_to_models which got deleted

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2020-03-29 19:40:36 +02:00
adrianeboyd ce0e538068
Check whether doc is instantiated in Example.get_gold_parses() (#5167)
* Check whether doc is instantiated

When creating docs to pair with gold parses, modify test to check
whether a doc is unset rather than whether it contains tokens.

* Restore test of evaluate on an empty doc

* Set a minimal gold.orig for the scorer

Without a minimal gold.orig the scorer can't evaluate empty docs. This
is the v3 equivalent of #4925.
2020-03-29 13:57:00 +02:00
Sofie Van Landeghem d6d95674c1
bugfix in span similarity (#5155)
* bugfix in span similarity

* also rewrite doc.pyx for clarity

* formatting
2020-03-29 13:56:07 +02:00
Sofie Van Landeghem 1f9852abc3
Fix parser @ GPU (#5210)
* ensure self.bias is numpy array in parser model

* 2 more little bug fixes for parser on GPU

* removing testing GPU statement

* remove commented code
2020-03-28 23:09:35 +01:00
Sofie Van Landeghem 9b412516e7
Fixing pickling of the parser (#5218)
* fix __reduce__ for pickling parser

* setting the move object as 'state' during pickling

* unskip test_issue4725 - works again
2020-03-27 19:35:26 +01:00
Ines Montani a0858ae761
Merge pull request #5213 from explosion/tmp/sync
Try master -> develop sync again (part 2)
2020-03-27 11:39:46 +01:00
Ines Montani 92b9b631ef xfail -> skip 2020-03-27 10:51:32 +01:00
Ines Montani ee4bb0e3b6 Fix import 2020-03-26 21:44:18 +01:00
Ines Montani 4fe2299586 xfail hanging test 2020-03-26 20:58:13 +01:00
Ines Montani f12a46472c Remove unicode declarations 2020-03-26 15:18:32 +01:00
Ines Montani 7453df79d1 Fix argument 2020-03-26 14:09:02 +01:00
Ines Montani e7341db5dc Add sent_start to pattern schema 2020-03-26 14:05:40 +01:00