Commit Graph

13274 Commits

Author SHA1 Message Date
Matthew Honnibal c1bf3a5602
Fix significant performance bug in parser training (#6010)
The parser training makes use of a trick for long documents, where we
use the oracle to cut up the document into sections, so that we can have
batch items in the middle of a document. For instance, if we have one
document of 600 words, we might make 6 states, starting at words 0, 100,
200, 300, 400 and 500.

The problem is for v3, I screwed this up and didn't stop parsing! So
instead of a batch of [100, 100, 100, 100, 100, 100], we'd have a batch
of [600, 500, 400, 300, 200, 100]. Oops.

The implementation here could probably be improved, it's annoying to
have this extra variable in the state. But this'll do.

This makes the v3 parser training 5-10 times faster, depending on document
lengths. This problem wasn't in v2.
2020-09-02 12:57:13 +02:00
svlandeg 474abb2e59 remove unused MORPH_RULES from test 2020-09-02 11:37:56 +02:00
svlandeg 6fd7f140ec custom-architectures section 2020-09-02 11:14:06 +02:00
svlandeg 3d9ae9286f small fixes 2020-09-02 10:46:38 +02:00
Sofie Van Landeghem f7a25d69f7
Bugfix in merge_entities (#6005)
* failing test

* bugfix
2020-09-01 21:57:52 +02:00
Sofie Van Landeghem 6bfb1b3a29
Fix sparse checkout for 'spacy project' (#6008)
* exit if cloning fails

* UX

* rewrite http link to git protocol, don't use stdin

* fixes to sparse checkout

* formatting
2020-09-01 19:49:01 +02:00
Matthew Honnibal 4cce32f090 Fix tagger initialization 2020-09-01 16:38:34 +02:00
Matthew Honnibal 046c38bd26
Remove 'cleanup' of strings (#6007)
A long time ago we went to some trouble to try to clean up "unused"
strings, to avoid the `StringStore` growing in long-running processes.

This never really worked reliably, and I think it was a really wrong
approach. It's much better to let the user reload the `nlp` object as
necessary, now that the string encoding is stable (in v1, the string IDs
were sequential integers, making reloading the NLP object really
annoying.)

The extra book-keeping does make some performance difference, and the
feature is unsed, so it's past time we killed it.
2020-09-01 16:12:15 +02:00
Ines Montani 690bd77669 Add todos [ci skip] 2020-09-01 14:04:36 +02:00
Ines Montani 70b226f69d Support ignore marker in project document [ci skip] 2020-09-01 12:49:04 +02:00
Ines Montani a4c51f0f18 Add v3 info to project docs [ci skip] 2020-09-01 12:36:21 +02:00
Ines Montani ef9005273b Update fill-config command and add silent mode [ci skip] 2020-09-01 12:07:04 +02:00
Matthew Honnibal 027c82c068 Update makefile 2020-09-01 01:22:54 +02:00
Matthew Honnibal bff1640a75 Try to debug tmpdir problem 2020-09-01 01:13:09 +02:00
Matthew Honnibal 61a71d8bcc Try to debug tmpdir problem 2020-09-01 01:10:53 +02:00
Matthew Honnibal ec660e3131 Fix use_pytorch_for_gpu_memory 2020-09-01 00:41:38 +02:00
Adriane Boyd 9130094199
Prevent Tagger model init with 0 labels (#5984)
* Prevent Tagger model init with 0 labels

Raise an error before trying to initialize a tagger model with 0 labels.

* Add dummy tagger label for test

* Remove tagless tagger model initializiation

* Fix error number after merge

* Add dummy tagger label to test

* Fix formatting

Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
2020-08-31 21:24:33 +02:00
Matthw Honnibal c38298b8fa Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2020-08-31 19:55:55 +02:00
Matthw Honnibal fe298fa50a Shuffle on first epoch of train 2020-08-31 19:55:22 +02:00
Ines Montani 9af82f3f11
Merge pull request #6003 from explosion/feature/matcher-as-spans 2020-08-31 17:50:56 +02:00
Sofie Van Landeghem 3ac620f09d
fix config example [ci skip] 2020-08-31 17:40:04 +02:00
Ines Montani 3929431af1 Update docs [ci skip] 2020-08-31 17:06:33 +02:00
Ines Montani c3b6cbd740
Merge pull request #6004 from svlandeg/feature/console-ex
console logging example
2020-08-31 17:03:52 +02:00
Ines Montani add9de5487 Deprecate (Phrase)Matcher.pipe 2020-08-31 17:01:24 +02:00
svlandeg 2c3b64a567 console logging example 2020-08-31 16:56:13 +02:00
Ines Montani bca6bf8dda Update docs [ci skip] 2020-08-31 16:39:53 +02:00
Ines Montani 97ffb4ed05
Merge pull request #6002 from svlandeg/feature/vectors-docs 2020-08-31 16:25:18 +02:00
Ines Montani db9f8896f5 Add docs [ci skip] 2020-08-31 16:10:41 +02:00
Ines Montani 83aff38c59
Make argument keyword-only
Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
2020-08-31 15:39:03 +02:00
Ines Montani 6340d1c63d Add as_spans to Matcher/PhraseMatcher 2020-08-31 14:53:22 +02:00
svlandeg fe6c08218e fixes 2020-08-31 14:51:49 +02:00
svlandeg 0e0abb0378 fix 2020-08-31 14:50:29 +02:00
svlandeg 56ba691ecd small fixes 2020-08-31 14:46:00 +02:00
svlandeg e47ea88aeb revert annotations refactor 2020-08-31 14:40:55 +02:00
svlandeg 13ee742fb4 example of custom logger 2020-08-31 14:24:41 +02:00
svlandeg 2c90a06fee some more information about the loggers 2020-08-31 13:43:17 +02:00
svlandeg c18eb63483 Merge remote-tracking branch 'upstream/develop' into feature/vectors-docs
# Conflicts:
#	website/docs/usage/embeddings-transformers.md
2020-08-31 13:21:36 +02:00
Juan Gutiérrez 9002bea29f
Update suffixes example (#5989)
* Update suffixes example

The current example will throw `TypeError: can only concatenate list (not "tuple") to list`

* Signing Contributor Agreement
2020-08-31 12:44:56 +02:00
Sofie Van Landeghem ec14744ee4
Rename Transformer listener (#6001)
* rename to spacy-transformers.TransformerListener

* add some more tok2vec tests

* use select_pipes

* fix docs - annotation setter was not changed in the end
2020-08-31 12:41:39 +02:00
Ines Montani 6ac3299e2e
Merge pull request #6000 from adrianeboyd/feature/tokenizer-special-case-filter
Restrict tokenizer exceptions to ORTH and NORM
2020-08-31 12:38:38 +02:00
Adriane Boyd 216efaf5f5 Restrict tokenizer exceptions to ORTH and NORM 2020-08-31 09:55:01 +02:00
Matthew Honnibal 9341cbc013 Set version to v3.0.0a13 2020-08-30 23:10:43 +02:00
Matthew Honnibal b69a0e332d Fix makefile 2020-08-30 20:14:52 +02:00
Matthew Honnibal acdd7b9478 Allow wheelhouse to be set in makefile 2020-08-30 20:00:49 +02:00
Matthew Honnibal 2ee0154bd0 Fix makefile 2020-08-30 17:11:24 +02:00
Matthew Honnibal b2463e4d04 Fix makefile 2020-08-30 16:37:04 +02:00
Matthew Honnibal d62a3c6551 Fix makefile 2020-08-30 16:35:10 +02:00
Matthew Honnibal af6cbb29e8 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2020-08-30 16:16:44 +02:00
Matthew Honnibal e3d959d4b4 Fix makefile 2020-08-30 16:16:30 +02:00
Ines Montani 9b86312bab Update docs [ci skip] 2020-08-29 18:43:19 +02:00