Commit Graph

11400 Commits

Author SHA1 Message Date
svlandeg 79d4f196e5 pin flak8 to 3.5.0 2020-05-15 11:53:01 +02:00
svlandeg e0fda2bd81 throw warning when model_cfg is None 2020-05-15 11:02:10 +02:00
svlandeg 102c8c7e2f fix fan_in renaming 2020-05-12 13:56:10 +02:00
svlandeg 9fe1e23512 update to thinc 8.0.0a6 2020-05-12 13:51:25 +02:00
Matthew Honnibal eb117e2fce Add load_config_from_str helper 2020-05-02 14:09:21 +02:00
Ines Montani 962bf12a20
Merge pull request #5312 from odaxiom/fix/website-documentation-spacy-lookup 2020-04-29 12:54:31 +02:00
Sofie Van Landeghem 1bf2082ac4
update is_new_osx function (#5376) 2020-04-29 12:51:49 +02:00
Matthew Honnibal b2ef6100af
Only run backprop once when shared tok2vec weights (#5331)
Previously, pipelines with shared tok2vec weights would call the
tok2vec backprop callback multiple times, once for each pipeline
component. This caused errors for PyTorch, and was inefficient.

Instead, accumulate the gradient for all but one component, and just
call the callback once.
2020-04-21 19:30:41 +02:00
Matthew Honnibal 6918d99b6c
Improve GPU usage for train-with-config (#5330)
* Adjust for no ops in Optimizer

* Fix gpu in train-from-config

* Update train-from-config script

* Fix parser

* Fix GPU efficiency of padding backprop
2020-04-20 22:06:28 +02:00
Sébastien Harinck 688a328668 docs(website): fix issue on example in spacy-lookup 2020-04-15 16:47:29 +02:00
Sofie Van Landeghem 42364dcd9f
Remove "pala" tokenizer exception for Spanish (#5265) 2020-04-09 10:21:20 +02:00
Sofie Van Landeghem b2e93be867
Optimizer defaults (#5244)
* set optimizer defaults to mimic thinc 7 + bump to dev6

* larger error range for senter overfitting test
2020-04-03 13:02:46 +02:00
adrianeboyd b71a11ff6d
Update morphologizer (#5108)
* Add pos and morph scoring to Scorer

Add pos, morph, and morph_per_type to `Scorer`. Report pos and morph
accuracy in `spacy evaluate`.

* Update morphologizer for v3

* switch to tagger-based morphologizer
* use `spacy.HashCharEmbedCNN` for morphologizer defaults
* add `Doc.is_morphed` flag

* Add morphologizer to train CLI

* Add basic morphologizer pipeline tests

* Add simple morphologizer training example

* Remove subword_features from CharEmbed models

Remove `subword_features` argument from `spacy.HashCharEmbedCNN.v1` and
`spacy.HashCharEmbedBiLSTM.v1` since in these cases `subword_features`
is always `False`.

* Rename setting in morphologizer example

Use `with_pos_tags` instead of `without_pos_tags`.

* Fix kwargs for spacy.HashCharEmbedBiLSTM.v1

* Remove defaults for spacy.HashCharEmbedBiLSTM.v1

Remove default `nM/nC` for `spacy.HashCharEmbedBiLSTM.v1`.

* Set random seed for textcat overfitting test
2020-04-02 14:46:32 +02:00
Sofie Van Landeghem ab59f3124e
fix NEL overfitting test for GPU (#5236) 2020-04-02 10:32:52 +02:00
Sofie Van Landeghem 311133e579
Train textcat with config (#5143)
* bring back default build_text_classifier method

* remove _set_dims_ hack in favor of proper dim inference

* add tok2vec initialize to unit test

* small fixes

* add unit test for various textcat config settings

* logistic output layer does not have nO

* fix window_size setting

* proper fix

* fix W initialization

* Update textcat training example

* Use ml_datasets
* Convert training data to `Example` format
* Use `n_texts` to set proportionate dev size

* fix _init renaming on latest thinc

* avoid setting a non-existing dim

* update to thinc==8.0.0a2

* add BOW and CNN defaults for easy testing

* various experiments with train_textcat script, fix softmax activation in textcat bow

* allow textcat train script to work on other datasets as well

* have dataset as a parameter

* train textcat from config, with example config

* add config for training textcat

* formatting

* fix exclusive_classes

* fixing BOW for GPU

* bump thinc to 8.0.0a3 (not published yet so CI will fail)

* add in link_vectors_to_models which got deleted

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2020-03-29 19:40:36 +02:00
adrianeboyd ce0e538068
Check whether doc is instantiated in Example.get_gold_parses() (#5167)
* Check whether doc is instantiated

When creating docs to pair with gold parses, modify test to check
whether a doc is unset rather than whether it contains tokens.

* Restore test of evaluate on an empty doc

* Set a minimal gold.orig for the scorer

Without a minimal gold.orig the scorer can't evaluate empty docs. This
is the v3 equivalent of #4925.
2020-03-29 13:57:00 +02:00
Sofie Van Landeghem d6d95674c1
bugfix in span similarity (#5155)
* bugfix in span similarity

* also rewrite doc.pyx for clarity

* formatting
2020-03-29 13:56:07 +02:00
Sofie Van Landeghem 1f9852abc3
Fix parser @ GPU (#5210)
* ensure self.bias is numpy array in parser model

* 2 more little bug fixes for parser on GPU

* removing testing GPU statement

* remove commented code
2020-03-28 23:09:35 +01:00
Sofie Van Landeghem 9b412516e7
Fixing pickling of the parser (#5218)
* fix __reduce__ for pickling parser

* setting the move object as 'state' during pickling

* unskip test_issue4725 - works again
2020-03-27 19:35:26 +01:00
Ines Montani a0858ae761
Merge pull request #5213 from explosion/tmp/sync
Try master -> develop sync again (part 2)
2020-03-27 11:39:46 +01:00
Ines Montani 92b9b631ef xfail -> skip 2020-03-27 10:51:32 +01:00
Ines Montani ee4bb0e3b6 Fix import 2020-03-26 21:44:18 +01:00
Ines Montani 4fe2299586 xfail hanging test 2020-03-26 20:58:13 +01:00
Ines Montani f12a46472c Remove unicode declarations 2020-03-26 15:18:32 +01:00
Ines Montani 7453df79d1 Fix argument 2020-03-26 14:09:02 +01:00
Ines Montani e7341db5dc Add sent_start to pattern schema 2020-03-26 14:05:40 +01:00
Ines Montani 70ee4ef4fd Fix small errors 2020-03-26 13:47:31 +01:00
Ines Montani 46568f40a7 Merge branch 'master' into tmp/sync 2020-03-26 13:38:14 +01:00
Tiljander e53232533b
Describing priority rules for overlapping matches (#5197)
* Describing priority rules for overlapping matches

* Create Tiljander.md

* Describing priority rules for overlapping matches

* Update website/docs/api/entityruler.md

Co-Authored-By: Ines Montani <ines@ines.io>

Co-authored-by: Ines Montani <ines@ines.io>
2020-03-26 13:13:22 +01:00
adrianeboyd 8d3563f1c4
Minor bugfixes for train CLI (#5186)
* Omit per_type scores from model-best calculations

The addition of per_type scores to the included metrics (#4911) causes
errors when they're compared while determining the best model, so omit
them for this `max()` comparison.

* Add default speed data for interrupted train CLI

Add better speed meta defaults so that an interrupted iteration still
produces a best model.

Co-authored-by: Ines Montani <ines@ines.io>
2020-03-26 10:46:50 +01:00
adrianeboyd a04f802099
Fix GoldParse init when token count differs (#5191)
Fix the `GoldParse` initialization when the number of tokens has changed
(due to merging subtokens with the parser).
2020-03-26 10:46:23 +01:00
adrianeboyd d88a377bed
Remove Vectors.from_glove (#5209) 2020-03-26 10:45:47 +01:00
Ines Montani 828acffc12 Tidy up and auto-format 2020-03-25 12:28:12 +01:00
adrianeboyd b71dd44dbc
Improved Romanian tokenization for UD RRT (#5206)
Modifications to Romanian tokenization to improve tokenization for
UD_Romanian-RRT.
2020-03-25 11:28:19 +01:00
adrianeboyd 86c43e55fa
Improve Lithuanian tokenization (#5205)
* Improve Lithuanian tokenization

Modify Lithuanian tokenization to improve performance for
UD_Lithuanian-ALKSNIS.

* Update Lithuanian tokenizer tests
2020-03-25 11:28:12 +01:00
adrianeboyd 1a944e5976
Improve Italian tokenization (#5204)
Improve Italian tokenization for UD_Italian-ISDT.
2020-03-25 11:28:02 +01:00
adrianeboyd 923a453449
Modifications/updates to Portuguese tokenization (#5203)
Modifications to Portuguese tokenization for UD_Portuguese-Bosque.
Instead of splitting contactions as exceptions, they are kept as merged
tokens.
2020-03-25 11:27:53 +01:00
adrianeboyd 4117a5c705
Improve French tokenization (#5202)
Improve French tokenization for UD_French-Sequoia.
2020-03-25 11:27:42 +01:00
Ines Montani a3d09ffe61
Merge pull request #5201 from adrianeboyd/feature/ud-tokenization-nb-v2
Improved tokenization for UD_Norwegian-Bokmaal
2020-03-25 11:27:31 +01:00
Ines Montani 0e8dfdf77e
Merge pull request #5065 from adrianeboyd/feature/ud-tokenization-da
Add a few more Danish tokenizer exceptions
2020-03-25 11:27:19 +01:00
Sofie Van Landeghem 218e1706ac
Bugfix linking vectors (#5196)
* restore call to _load_vectors

* bump to thinc 8.0.0a3

* bump to 3.0.0.dev4
2020-03-25 10:20:11 +01:00
Adriane Boyd 09d442f5ad Merge remote-tracking branch 'upstream/master' into feature/ud-tokenization-da 2020-03-25 09:41:52 +01:00
Adriane Boyd cba2d1d972 Disable failing abbreviation test
UD_Danish-DDT has (as far as I can tell) hallucinated periods after
abbreviations, so the changes are an artifact of the corpus and not due
to anything meaningful about Danish tokenization.
2020-03-25 09:39:26 +01:00
Adriane Boyd 79737adb90 Improved tokenization for UD_Norwegian-Bokmaal 2020-03-25 08:54:02 +01:00
Ines Montani 5f2afa0479
Merge pull request #5185 from adrianeboyd/bugfix/de-punctuation-style
Improve German tokenizer settings style
2020-03-24 16:38:32 +01:00
Ines Montani 3fc2309c48
Merge pull request #5174 from Baciccin/master
Add Ligurian language
2020-03-24 16:33:59 +01:00
Ines Montani f434d6aaa9
Merge pull request #5190 from guerda/patch-1
Remove max_length parameter in PhraseMatcher example
2020-03-24 16:32:12 +01:00
Philip Gillißen 128acb9ee1
Update guerda.md 2020-03-24 10:42:30 +01:00
Philip Gillißen 5d067bcc5e
Add SCA for guerda 2020-03-24 10:42:10 +01:00
Philip Gillißen f8b4407a29
Remove max_length parameter
The parameter max_length is deprecated in PhraseMatcher, as stated here: https://spacy.io/api/phrasematcher#init
2020-03-24 10:22:12 +01:00