Commit Graph

9064 Commits

Author SHA1 Message Date
Matthew Honnibal 93be3ad038 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2018-11-29 12:37:06 +00:00
Matthew Honnibal 008e1ee1dd Update pretrain command 2018-11-29 12:36:43 +00:00
Ines Montani 8d3bfb3c04 Remove outdated options and fix formatting 2018-11-28 23:33:34 +01:00
Nathaniel J. Smith 73255091f8 Fix conftest getoption 2018-11-28 19:07:24 +01:00
Matthew Honnibal 87da5bcf5b Set version to v2.1.0a3 2018-11-28 18:22:09 +01:00
Matthew Honnibal 647d1a1efc Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2018-11-28 18:21:45 +01:00
Matthew Honnibal 61e435610e
💫 Feature/improve pretraining (#2971)
* Improve spacy pretrain script

* Implement BERT-style 'masked language model' objective. Much better
results.

* Improve logging.

* Add length cap for documents, to avoid memory errors.

* Require thinc 7.0.0.dev1

* Require thinc 7.0.0.dev1

* Add argument for using pretrained vectors

* Fix defaults

* Fix syntax error

* Improve spacy pretrain script

* Implement BERT-style 'masked language model' objective. Much better
results.

* Improve logging.

* Add length cap for documents, to avoid memory errors.

* Require thinc 7.0.0.dev1

* Require thinc 7.0.0.dev1

* Add argument for using pretrained vectors

* Fix defaults

* Fix syntax error

* Tweak pretraining script

* Fix data limits in spacy.gold

* Fix pretrain script
2018-11-28 18:04:58 +01:00
Matthew Honnibal 0fdb25b958
Fix msgpack error 2018-11-27 19:35:55 +01:00
Matthew Honnibal ef0820827a
Update hyper-parameters after NER random search (#2972)
These experiments were completed a few weeks ago, but I didn't make the PR, pending model release.

    Token vector width: 128->96
    Hidden width: 128->64
    Embed size: 5000->2000
    Dropout: 0.2->0.1
    Updated optimizer defaults (unclear how important?)

This should improve speed, model size and load time, while keeping
similar or slightly better accuracy.

The tl;dr is we prefer to prevent over-fitting by reducing model size,
rather than using more dropout.
2018-11-27 18:49:52 +01:00
Matthew Honnibal c9f6acc564 Set version to 2.1.0a3.dev0 2018-11-27 05:15:27 +01:00
Ines Montani b6e991440c 💫 Tidy up and auto-format tests (#2967)
* Auto-format tests with black

* Add flake8 config

* Tidy up and remove unused imports

* Fix redefinitions of test functions

* Replace orths_and_spaces with words and spaces

* Fix compatibility with pytest 4.0

* xfail test for now

Test was previously overwritten by following test due to naming conflict, so failure wasn't reported

* Unfail passing test

* Only use fixture via arguments

Fixes pytest 4.0 compatibility
2018-11-27 01:09:36 +01:00
Matthew Honnibal 2c37e0ccf6
💫 Use Blis for matrix multiplications (#2966)
Our epic matrix multiplication odyssey is drawing to a close...

I've now finally got the Blis linear algebra routines in a self-contained Python package, with wheels for Windows, Linux and OSX. The only missing platform at the moment is Windows Python 2.7. The result is at https://github.com/explosion/cython-blis

Thinc v7.0.0 will make the change to Blis. I've put a Thinc v7.0.0.dev0 up on PyPi so that we can test these changes with the CI, and even get them out to spacy-nightly, before Thinc v7.0.0 is released. This PR also updates the other dependencies to be in line with the current versions master is using. I've also resolved the msgpack deprecation problems, and gotten spaCy and Thinc up to date with the latest Cython.

The point of switching to Blis is to have control of how our matrix multiplications are executed across platforms. When we were using numpy for this, a different library would be used on pip and conda, OSX would use Accelerate, etc. This would open up different bugs and performance problems, especially when multi-threading was introduced.

With the change to Blis, we now strictly single-thread the matrix multiplications. This will make it much easier to use multiprocessing to parallelise the runtime, since we won't have nested parallelism problems to deal with.

* Use blis

* Use -2 arg to Cython

* Update dependencies

* Fix requirements

* Update setup dependencies

* Fix requirement typo

* Fix msgpack errors

* Remove Python27 test from Appveyor, until Blis works there

* Auto-format setup.py

* Fix murmurhash version
2018-11-27 00:44:04 +01:00
Ines Montani 3832c8a2c1 💫 Use README.md instead of README.rst (#2968)
* Auto-format setup.py

* Use README.md instead of README.rst
2018-11-26 22:04:35 +01:00
Ines Montani 41c6002fd8 Tidy up [ci skip] 2018-11-26 18:56:04 +01:00
Ines Montani c62d06ea5c Port over #2949 2018-11-26 18:54:27 +01:00
Ines Montani ec5ee9e616 Auto-format 2018-11-26 18:54:20 +01:00
Ines Montani 350c8d25b0 Add EntityRecognizer.label property 2018-11-18 00:06:26 +01:00
Ines Montani 017bc2ef2f Expose TextCategorizer via __all__ 2018-11-18 00:06:13 +01:00
Ines Montani b4581435f6 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2018-11-16 13:08:22 +01:00
Ines Montani e2f75eb492 Fix message formatting 2018-11-16 13:08:20 +01:00
Matthew Honnibal c89fd19f66 Hack broken pipe error for Python2 2018-11-16 02:22:05 +01:00
Matthew Honnibal 2874b8efd8 Fix tok2vec loading in spacy train 2018-11-15 23:34:54 +00:00
Matthew Honnibal 2ddd428834 Fix pretrain script 2018-11-15 23:34:35 +00:00
Matthew Honnibal 09a0227656 Temporarily add a script to load reddit 2018-11-15 23:18:35 +00:00
Matthew Honnibal f8afaa0c1c Fix pretrain 2018-11-15 22:46:53 +00:00
Matthew Honnibal 6af6950e46 Fix pretrain 2018-11-15 22:45:36 +00:00
Matthew Honnibal 3e7b214e57 Make pretrain script work with stream from stdin 2018-11-15 22:44:07 +00:00
Matthew Honnibal 8fdb9bc278
💫 Add experimental ULMFit/BERT/Elmo-like pretraining (#2931)
* Add 'spacy pretrain' command

* Fix pretrain command for Python 2

* Fix pretrain command

* Fix pretrain command
2018-11-15 22:17:16 +01:00
Ines Montani e89708c3eb 💫 Allow matching non-ORTH attributes in PhraseMatcher (#2925)
* Allow matching non-orth attributes in PhraseMatcher (see #1971)

Usage: PhraseMatcher(nlp.vocab, attr='POS')

* Allow attr argument to be int

* Fix formatting

* Fix typo
2018-11-15 03:00:58 +01:00
Matthew Honnibal 7ed9124a45
Fix Python2 error on example 2018-11-14 19:35:17 +01:00
Ines Montani 0d5b142c78 Fix typos and whitespace 2018-11-14 19:12:34 +01:00
Ines Montani bd1b0e396a Add deprecation warning for PhraseMatcher max_length 2018-11-14 19:10:46 +01:00
Ines Montani 64257bf3a7 Fix formatting 2018-11-14 19:10:21 +01:00
Ines Montani b3cadd5b81
Delete _matcher2_notes.py 2018-11-14 16:19:12 +01:00
Matthew Honnibal 5fc98ade04 Set version to 2.1.0a2 2018-11-08 09:56:56 +01:00
Matthew Honnibal 09aa616182 Make pretraining script work without GPU 2018-11-04 17:09:52 +01:00
Matthew Honnibal bc8cda818c Improve pretrain textcat example 2018-11-04 00:17:09 +00:00
Matthew Honnibal 3e7a96f99d Improve pretrain textcat example 2018-11-03 17:44:12 +00:00
Matthew Honnibal c87c50af62 Rename new example 2018-11-03 13:09:46 +00:00
Matthew Honnibal 8e8ccc0f92 Work on pretraining script 2018-11-03 12:53:25 +00:00
Matthew Honnibal ad44982f01 Fix dropout in tensorizer, update comment 2018-11-03 12:46:58 +00:00
Matthew Honnibal 0127f10ba3 Improve train tensorizer script 2018-11-03 10:54:20 +00:00
Matthew Honnibal ba365ae1c9 Normalize gradient by number of words in tensorizer 2018-11-03 10:53:22 +00:00
Matthew Honnibal dac3f1b280 Improve Tensorizer 2018-11-03 10:52:50 +00:00
Matthew Honnibal baf7feae68 Add tensorizer training example 2018-11-02 23:30:06 +00:00
Matthew Honnibal 2527ba68e5 Fix tensorizer 2018-11-02 23:29:54 +00:00
Suraj Rajan 0bf14082a4 Added more constucts for dependency tree matcher (#2836) 2018-10-29 23:21:39 +01:00
Matthew Honnibal 817e1fc5e5 Fix out-of-bounds access in NER training
The helper method state.B(1) gets the index of the first token of the
buffer, or -1 if no such token exists. Normally this is safe because we
pass this to functions like state.safe_get(), which returns an empty
token. Here we used it directly as an array index, which is not okay!

This error may have been the cause of out-of-bounds access errors during
training. Similar errors may still be around, so much be hunted down.
Hunting this one down took a long time...I printed out values across
training runs and diffed, looking for points of divergence between
runs, when no randomness should be allowed.
2018-10-27 01:12:50 +02:00
Ines Montani ea20b72c08 💫 Make like_num work for prefixed numbers (#2808)
* Only split + prefix if not numbers

* Make like_num work for prefixed numbers

* Add test for like_num
2018-10-01 10:49:14 +02:00
Matthew Honnibal b39810d692 Fix copy_reg compatibility on _serialize module 2018-09-28 15:23:14 +02:00