Commit Graph

9156 Commits

Author SHA1 Message Date
Matthew Honnibal 92f4b9c8ea set max batch size to 1000 2018-12-17 23:15:39 +00:00
Matthew Honnibal 3c4a2edf4a Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2018-12-17 23:08:40 +00:00
Matthew Honnibal 95fc0176d1 Pass tagger options in begin_training 2018-12-17 23:08:31 +00:00
Matthew Honnibal 7c504b6ddb Try to implement more losses for pretraining
* Try to implement cosine loss
This one seems to be correct? Still unsure, but it performs okay

* Try to implement the von Mises-Fisher loss
This one's definitely not right yet.
2018-12-17 14:48:27 +00:00
Ines Montani e3405f8af3 Don't call begin_training if updating new model (see #3059) [ci skip] 2018-12-17 13:45:49 +01:00
Matthew Honnibal ab4b61fb6e Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2018-12-16 20:11:43 +01:00
Matthew Honnibal 9ef30b0cde Accept 'text' in matcher as an alternative to ORTH 2018-12-16 20:10:43 +01:00
Sofie c6ad557cea French regular expressions instead of extensive exceptions list (on develop) (#3046) (resolves #2679)
* merge changes of PR 3023 into develop branch instead of master

* further deletions from exception list according to PR 3023
2018-12-16 18:04:55 +01:00
Ines Montani 7bbdffd36e Remove pre-set lemma for "cause" (resolves #2165) 2018-12-14 12:51:18 +01:00
Matthew Honnibal ab9494b2a3 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2018-12-12 21:08:50 +00:00
Matthew Honnibal fb56028476 Remove b1 and b2 decay 2018-12-12 12:37:07 +01:00
Matthew Honnibal df15279e88 Reduce batch size during pretrain 2018-12-10 15:30:23 +00:00
Matthew Honnibal 83ac227bd3
💫 Better support for semi-supervised learning (#3035)
The new spacy pretrain command implemented BERT/ULMFit/etc-like transfer learning, using our Language Modelling with Approximate Outputs version of BERT's cloze task. Pretraining is convenient, but in some ways it's a bit of a strange solution. All we're doing is initialising the weights. At the same time, we're putting a lot of work into our optimisation so that it's less sensitive to initial conditions, and more likely to find good optima. I discuss this a bit in the pseudo-rehearsal blog post: https://explosion.ai/blog/pseudo-rehearsal-catastrophic-forgetting
Support semi-supervised learning in spacy train

One obvious way to improve these pretraining methods is to do multi-task learning, instead of just transfer learning. This has been shown to work very well: https://arxiv.org/pdf/1809.08370.pdf . This patch makes it easy to do this sort of thing.

    Add a new argument to spacy train, --raw-text. This takes a jsonl file with unlabelled data that can be used in arbitrary ways to do semi-supervised learning.

    Add a new method to the Language class and to pipeline components, .rehearse(). This is like .update(), but doesn't expect GoldParse objects. It takes a batch of Doc objects, and performs an update on some semi-supervised objective.

    Move the BERT-LMAO objective out from spacy/cli/pretrain.py into spacy/_ml.py, so we can create a new pipeline component, ClozeMultitask. This can be specified as a parser or NER multitask in the spacy train command. Example usage:

python -m spacy train en ./tmp ~/data/en-core-web/train/nw.json ~/data/en-core-web/dev/nw.json --pipeline parser --raw-textt ~/data/unlabelled/reddit-100k.jsonl --vectors en_vectors_web_lg --parser-multitasks cloze

Implement rehearsal methods for pipeline components

The new --raw-text argument and nlp.rehearse() method also gives us a good place to implement the the idea in the pseudo-rehearsal blog post in the parser. This works as follows:

    Add a new nlp.resume_training() method. This allocates copies of pre-trained models in the pipeline, setting things up for the rehearsal updates. It also returns an optimizer object. This also greatly reduces confusion around the nlp.begin_training() method, which randomises the weights, making it not suitable for adding new labels or otherwise fine-tuning a pre-trained model.

    Implement rehearsal updates on the Parser class, making it available for the dependency parser and NER. During rehearsal, the initial model is used to supervise the model being trained. The current model is asked to match the predictions of the initial model on some data. This minimises catastrophic forgetting, by keeping the model's predictions close to the original. See the blog post for details.

    Implement rehearsal updates for tagger

    Implement rehearsal updates for text categoriz
2018-12-10 16:25:33 +01:00
Matthew Honnibal 449b889454 Fix KeyError in Vectors.most_similar. Fixes #2648 2018-12-10 16:19:18 +01:00
Matthew Honnibal 90aec6d2f6 Fix vectors for reserved words. Closes #2871 2018-12-10 16:09:49 +01:00
Matthew Honnibal 16fd8dce1d Add get_string_id helper to spacy.strings 2018-12-10 16:09:26 +01:00
Matthew Honnibal cc1ea03004 Add test for issue #2871 -- vectors for reserved words 2018-12-10 16:09:10 +01:00
Matthew Honnibal 375f0dc529
💫 Make TextCategorizer default to a simpler, GPU-friendly model (#3038)
Currently the TextCategorizer defaults to a fairly complicated model, designed partly around the active learning requirements of Prodigy. The model's a bit slow, and not very GPU-friendly.

This patch implements a straightforward CNN model that still performs pretty well. The replacement model also makes it easy to use the LMAO pretraining, since most of the parameters are in the CNN.

The replacement model has a flag to specify whether labels are mutually exclusive, which defaults to True. This has been a common problem with the text classifier. We'll also now be able to support adding labels to pretrained models again.

Resolves #2934, #2756, #1798, #1748.
2018-12-10 14:37:39 +01:00
Matthew Honnibal b1c8731b4d Make spacy train respect LOG_FRIENDLY 2018-12-10 09:46:53 +01:00
Matthew Honnibal 6936ca1664 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2018-12-10 09:44:07 +01:00
Matthew Honnibal 4405b5c875 Fix resizing edge-case for NER 2018-12-10 06:25:17 +00:00
Matthew Honnibal 0994dc50d8 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2018-12-10 05:35:01 +00:00
Matthew Honnibal 24f2e9bc07 Tweak training params 2018-12-09 17:08:58 +00:00
Matthew Honnibal 16c5861d29 Fix NER space constraints
Allow entities to end on spaces, to avoid stumping the oracle when we're
inside an entity, and there's a space just before a correct entity.
2018-12-09 08:06:45 +01:00
Matthew Honnibal 1b1a1af193 Fix printing in spacy train 2018-12-09 06:03:49 +01:00
Matthew Honnibal d2ac618af1 Set cbb_maxout_pieces=3 2018-12-08 23:27:29 +01:00
Matthew Honnibal cb16b78b0d Set dropout rate to 0.2 2018-12-08 19:59:11 +01:00
Matthew Honnibal 2c2db0c492 💫 Allow Span to take text label (#3031)
Fixes #3027.

* Allow Span.__init__ to take unicode values for the `label` argument.
* Allow `Span.label_` to be writeable.

- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [ ] My changes don't require a change to the documentation, or if they do, I've added all required information.
2018-12-08 13:08:41 +01:00
Matthew Honnibal 11a29af751 Set cupy.random seed in fix_random_seed helper 2018-12-08 12:37:38 +01:00
Ines Montani ffdd5e964f
Small CLI improvements (#3030)
* Add todo

* Auto-format

* Update wasabi pin

* Format training results with wasabi

* Remove loading animation from model saving

Currently behaves weirdly

* Inline messages

* Remove unnecessary path2str

Already taken care of by printer

* Inline messages in CLI

* Remove unused function

* Move loading indicator into loading function

* Check for invalid whitespace entities
2018-12-08 11:49:43 +01:00
Matthew Honnibal 8aa7882762
Make NORM a token attribute (#3029)
See #3028. The solution in this patch is pretty debateable.

What we do is give the TokenC struct a .norm field, by repurposing the previously idle .sense attribute. It's nice to repurpose a previous field because it means the TokenC doesn't change size, so even if someone's using the internals very deeply, nothing will break.

The weird thing here is that the TokenC and the LexemeC both have an attribute named NORM. This arguably assists in backwards compatibility. On the other hand, maybe it's really bad! We're changing the semantics of the attribute subtly, so maybe it's better if someone calling lex.norm gets a breakage, and instead is told to write lex.default_norm?

Overall I believe this patch makes the NORM feature work the way we sort of expected it to work. Certainly it's much more like how the docs describe it, and more in line with how we've been directing people to use the norm attribute. We'll also be able to use token.norm to do stuff like spelling correction, which is pretty cool.
2018-12-08 10:49:10 +01:00
Matthew Honnibal a338c6f8f6 Fix JSON segmentation bug that affected French
Fix a bug in the JSON streaming code that GoldCorpus uses. Escaped
slashes were being handled incorrectly. This bug caused low scores for
French in the early v2.1.0 alphas, because most of the data was not
being read in.

Fittingly, the document that triggered the bug was a Wikipedia article about
Perl. Parsing perl remains difficult!
2018-12-08 10:41:24 +01:00
Matthew Honnibal 6f36b6bc4e Pin pex version 2018-12-07 23:42:48 +01:00
Matthew Honnibal b2bfd1e1c8 Move dropout and batch sizes out of global scope in train cmd 2018-12-07 20:54:35 +01:00
Matthew Honnibal f8c4ee34fe Update wasabi pin 2018-12-07 01:43:07 +01:00
Matthew Honnibal 40e0da9cc1 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2018-12-07 00:12:22 +00:00
Matthew Honnibal 1e6725e9b7 Try to prevent spaces from being tagged as entities 2018-12-07 00:12:12 +00:00
Matthew Honnibal 427c0693c8 Fix missing comma in init-model command 2018-12-06 22:48:31 +01:00
Matthew Honnibal d896fbca62 Fix batch size in parser.pipe 2018-12-06 21:45:56 +01:00
Matthew Honnibal bb3304a4f1 Fix pickle tests 2018-12-06 20:46:36 +01:00
Matthew Honnibal e619f45287 Fix pickle tests 2018-12-06 20:43:47 +01:00
Matthew Honnibal 0a60726215 Remove cytoolz usage in CLI 2018-12-06 20:37:00 +01:00
Matthew Honnibal c0af627f32 Fix dill usage in vocab 2018-12-06 18:53:16 +01:00
Matthew Honnibal 9520489225 Fix removabl of dill (for srsly) 2018-12-06 18:46:09 +01:00
Matthew Honnibal 711f108532 Fix cytoolz import cytoolz 2018-12-06 16:04:12 +01:00
Matthew Honnibal cabaadd793
Fix build error from bad import
Thinc v7.0.0.dev6 moved FeatureExtracter around and didn't add a compatibility import.
2018-12-06 15:12:39 +01:00
Matthew Honnibal 8f6555df4e Update requirements 2018-12-04 00:07:28 +01:00
Matthew Honnibal 378ca4b46d Fix OSX build problem 2018-12-04 00:06:42 +01:00
Matthew Honnibal ea00dbaaa4 Remove usage of itertools.islice 2018-12-03 02:43:03 +01:00
Matthew Honnibal 3df26d820f Sort requirements 2018-12-03 02:41:05 +01:00