Commit Graph

9053 Commits

Author SHA1 Message Date
Matthew Honnibal 2c37e0ccf6
💫 Use Blis for matrix multiplications (#2966)
Our epic matrix multiplication odyssey is drawing to a close...

I've now finally got the Blis linear algebra routines in a self-contained Python package, with wheels for Windows, Linux and OSX. The only missing platform at the moment is Windows Python 2.7. The result is at https://github.com/explosion/cython-blis

Thinc v7.0.0 will make the change to Blis. I've put a Thinc v7.0.0.dev0 up on PyPi so that we can test these changes with the CI, and even get them out to spacy-nightly, before Thinc v7.0.0 is released. This PR also updates the other dependencies to be in line with the current versions master is using. I've also resolved the msgpack deprecation problems, and gotten spaCy and Thinc up to date with the latest Cython.

The point of switching to Blis is to have control of how our matrix multiplications are executed across platforms. When we were using numpy for this, a different library would be used on pip and conda, OSX would use Accelerate, etc. This would open up different bugs and performance problems, especially when multi-threading was introduced.

With the change to Blis, we now strictly single-thread the matrix multiplications. This will make it much easier to use multiprocessing to parallelise the runtime, since we won't have nested parallelism problems to deal with.

* Use blis

* Use -2 arg to Cython

* Update dependencies

* Fix requirements

* Update setup dependencies

* Fix requirement typo

* Fix msgpack errors

* Remove Python27 test from Appveyor, until Blis works there

* Auto-format setup.py

* Fix murmurhash version
2018-11-27 00:44:04 +01:00
Ines Montani 3832c8a2c1 💫 Use README.md instead of README.rst (#2968)
* Auto-format setup.py

* Use README.md instead of README.rst
2018-11-26 22:04:35 +01:00
Ines Montani 41c6002fd8 Tidy up [ci skip] 2018-11-26 18:56:04 +01:00
Ines Montani c62d06ea5c Port over #2949 2018-11-26 18:54:27 +01:00
Ines Montani ec5ee9e616 Auto-format 2018-11-26 18:54:20 +01:00
Ines Montani 350c8d25b0 Add EntityRecognizer.label property 2018-11-18 00:06:26 +01:00
Ines Montani 017bc2ef2f Expose TextCategorizer via __all__ 2018-11-18 00:06:13 +01:00
Ines Montani b4581435f6 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2018-11-16 13:08:22 +01:00
Ines Montani e2f75eb492 Fix message formatting 2018-11-16 13:08:20 +01:00
Matthew Honnibal c89fd19f66 Hack broken pipe error for Python2 2018-11-16 02:22:05 +01:00
Matthew Honnibal 2874b8efd8 Fix tok2vec loading in spacy train 2018-11-15 23:34:54 +00:00
Matthew Honnibal 2ddd428834 Fix pretrain script 2018-11-15 23:34:35 +00:00
Matthew Honnibal 09a0227656 Temporarily add a script to load reddit 2018-11-15 23:18:35 +00:00
Matthew Honnibal f8afaa0c1c Fix pretrain 2018-11-15 22:46:53 +00:00
Matthew Honnibal 6af6950e46 Fix pretrain 2018-11-15 22:45:36 +00:00
Matthew Honnibal 3e7b214e57 Make pretrain script work with stream from stdin 2018-11-15 22:44:07 +00:00
Matthew Honnibal 8fdb9bc278
💫 Add experimental ULMFit/BERT/Elmo-like pretraining (#2931)
* Add 'spacy pretrain' command

* Fix pretrain command for Python 2

* Fix pretrain command

* Fix pretrain command
2018-11-15 22:17:16 +01:00
Ines Montani e89708c3eb 💫 Allow matching non-ORTH attributes in PhraseMatcher (#2925)
* Allow matching non-orth attributes in PhraseMatcher (see #1971)

Usage: PhraseMatcher(nlp.vocab, attr='POS')

* Allow attr argument to be int

* Fix formatting

* Fix typo
2018-11-15 03:00:58 +01:00
Matthew Honnibal 7ed9124a45
Fix Python2 error on example 2018-11-14 19:35:17 +01:00
Ines Montani 0d5b142c78 Fix typos and whitespace 2018-11-14 19:12:34 +01:00
Ines Montani bd1b0e396a Add deprecation warning for PhraseMatcher max_length 2018-11-14 19:10:46 +01:00
Ines Montani 64257bf3a7 Fix formatting 2018-11-14 19:10:21 +01:00
Ines Montani b3cadd5b81
Delete _matcher2_notes.py 2018-11-14 16:19:12 +01:00
Matthew Honnibal 5fc98ade04 Set version to 2.1.0a2 2018-11-08 09:56:56 +01:00
Matthew Honnibal 09aa616182 Make pretraining script work without GPU 2018-11-04 17:09:52 +01:00
Matthew Honnibal bc8cda818c Improve pretrain textcat example 2018-11-04 00:17:09 +00:00
Matthew Honnibal 3e7a96f99d Improve pretrain textcat example 2018-11-03 17:44:12 +00:00
Matthew Honnibal c87c50af62 Rename new example 2018-11-03 13:09:46 +00:00
Matthew Honnibal 8e8ccc0f92 Work on pretraining script 2018-11-03 12:53:25 +00:00
Matthew Honnibal ad44982f01 Fix dropout in tensorizer, update comment 2018-11-03 12:46:58 +00:00
Matthew Honnibal 0127f10ba3 Improve train tensorizer script 2018-11-03 10:54:20 +00:00
Matthew Honnibal ba365ae1c9 Normalize gradient by number of words in tensorizer 2018-11-03 10:53:22 +00:00
Matthew Honnibal dac3f1b280 Improve Tensorizer 2018-11-03 10:52:50 +00:00
Matthew Honnibal baf7feae68 Add tensorizer training example 2018-11-02 23:30:06 +00:00
Matthew Honnibal 2527ba68e5 Fix tensorizer 2018-11-02 23:29:54 +00:00
Suraj Rajan 0bf14082a4 Added more constucts for dependency tree matcher (#2836) 2018-10-29 23:21:39 +01:00
Matthew Honnibal 817e1fc5e5 Fix out-of-bounds access in NER training
The helper method state.B(1) gets the index of the first token of the
buffer, or -1 if no such token exists. Normally this is safe because we
pass this to functions like state.safe_get(), which returns an empty
token. Here we used it directly as an array index, which is not okay!

This error may have been the cause of out-of-bounds access errors during
training. Similar errors may still be around, so much be hunted down.
Hunting this one down took a long time...I printed out values across
training runs and diffed, looking for points of divergence between
runs, when no randomness should be allowed.
2018-10-27 01:12:50 +02:00
Ines Montani ea20b72c08 💫 Make like_num work for prefixed numbers (#2808)
* Only split + prefix if not numbers

* Make like_num work for prefixed numbers

* Add test for like_num
2018-10-01 10:49:14 +02:00
Matthew Honnibal b39810d692 Fix copy_reg compatibility on _serialize module 2018-09-28 15:23:14 +02:00
Matthew Honnibal f82f8ba5dd Fix serialization when empty parser model. Closes #2482 2018-09-28 15:18:52 +02:00
Matthew Honnibal d5a6c63b62 Add regression test for #2482 2018-09-28 15:18:30 +02:00
Matthew Honnibal e3e9fe18d4 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2018-09-28 14:27:35 +02:00
Matthew Honnibal 0323f5be0c Fix _serialize module 2018-09-28 14:27:24 +02:00
Ines Montani 5d56eb70d7 Tidy up tests 2018-09-27 16:41:57 +02:00
Ines Montani 1f1bab9264 Remove unused import 2018-09-27 16:41:37 +02:00
Matthew Honnibal b42c123e5d Fix regression introduced by 1759abf1e 2018-09-25 11:08:58 +02:00
Matthew Honnibal 500898907b Fix regression in parser.begin_training() 2018-09-25 11:08:31 +02:00
Matthew Honnibal 1759abf1e5 Fix bug in sentence starts for non-projective parses
The set_children_from_heads function assumed parse trees were
projective. However, non-projective parses may be passed in during
deserialization, or after deprojectivising. This caused incorrect
sentence boundaries to be set for non-projective parses. Close #2772.
2018-09-19 14:50:06 +02:00
Matthew Honnibal 48fd36bf05 Fix test for issue 27772 2018-09-19 14:47:27 +02:00
Matthew Honnibal 6cd920e088 Add xfail test for deprojectivization SBD bug 2018-09-19 14:00:31 +02:00