Commit Graph

13300 Commits

Author SHA1 Message Date
Adriane Boyd f212303729 Add sent_starts to Doc.__init__
Add sent_starts to `Doc.__init__`. Officially specify `is_sent_start`
values but also convert to and accept `sent_start` internally.
2020-09-21 17:59:09 +02:00
svlandeg 447b3e5787 Merge remote-tracking branch 'upstream/develop' into fix/debug_model
# Conflicts:
#	spacy/cli/debug_model.py
2020-09-21 16:58:40 +02:00
Ines Montani b3327c1e45 Increment version [ci skip] 2020-09-21 16:04:30 +02:00
Ines Montani e8bcaa44f1 Don't auto-decompress archives with smart_open [ci skip] 2020-09-21 16:01:46 +02:00
Adriane Boyd 6aa91c7ca0 Make user_data keyword-only 2020-09-21 16:00:06 +02:00
Adriane Boyd 177df15d89 Implement Doc.set_ents 2020-09-21 15:54:05 +02:00
Ines Montani e548654aca Update docs [ci skip] 2020-09-21 14:46:55 +02:00
Ines Montani 4b79d697ee
Merge pull request #6096 from explosion/feature/config-overrides-env-vars 2020-09-21 14:46:19 +02:00
Ines Montani 626cfd7155
Merge pull request #6099 from adrianeboyd/docs/alphabetize-api-sidebar [ci skip]
Alphabetize API sidebars
2020-09-21 14:44:43 +02:00
Adriane Boyd 13fbf6556a Merge remote-tracking branch 'upstream/develop' into feature/doc-ents-v3-2 2020-09-21 14:42:04 +02:00
svlandeg eb9b447960 Merge remote-tracking branch 'upstream/develop' into fix/debug_model
# Conflicts:
#	spacy/cli/debug_model.py
2020-09-21 14:05:16 +02:00
Adriane Boyd ce455f30ca Fix formatting 2020-09-21 13:53:29 +02:00
Adriane Boyd 9b8d0b7f90 Alphabetize API sidebars 2020-09-21 13:46:21 +02:00
Adriane Boyd bc02e86494 Extend Doc.__init__ with additional annotation
Mostly copying from `spacy.tests.util.get_doc`, add additional kwargs to
`Doc.__init__` to initialize the most common doc/token values.
2020-09-21 13:36:24 +02:00
Ines Montani 758ead8a47 Sync overrides with CLI overrides 2020-09-21 12:50:13 +02:00
Ines Montani 5497acf49a Support config overrides via environment variables 2020-09-21 11:25:10 +02:00
Ines Montani 1114219ae3 Tidy up and auto-format 2020-09-21 10:59:07 +02:00
Ines Montani 9d32cac736 Update docs [ci skip] 2020-09-21 10:55:36 +02:00
Adriane Boyd cc71ec901f Fix typo in saving and loading usage docs 2020-09-21 09:08:55 +02:00
Adriane Boyd 3aa57ce6c9 Update alignment mode in Doc.char_span docs 2020-09-21 09:07:20 +02:00
Ines Montani b9d2b29684 Update docs [ci skip] 2020-09-20 17:49:09 +02:00
Ines Montani 012b3a7096 Update docs [ci skip] 2020-09-20 17:44:58 +02:00
Ines Montani b2302c0a1c Improve error for missing dependency 2020-09-20 17:44:51 +02:00
Ines Montani 6898b35028
Merge pull request #6094 from explosion/bugfix/run_process 2020-09-20 16:49:30 +02:00
Ines Montani 744f259b9c Update landing [ci skip] 2020-09-20 16:37:23 +02:00
Matthew Honnibal 8fb59d958c Format 2020-09-20 16:31:48 +02:00
Matthew Honnibal dc22771f87 Fix sparse checkout 2020-09-20 16:30:05 +02:00
Matthew Honnibal a0fb5e50db Use simple git clone call if not sparse 2020-09-20 16:22:04 +02:00
Matthew Honnibal 2c24d633d0 Use updated run_command 2020-09-20 16:21:43 +02:00
Matthew Honnibal 889128e5c5 Improve error handling in run_command 2020-09-20 16:20:57 +02:00
Ines Montani 554c9a2497 Update docs [ci skip] 2020-09-20 12:30:53 +02:00
svlandeg 6db1d5dc0d trying some stuff 2020-09-19 19:11:30 +02:00
Ines Montani e863b3dc14
Merge pull request #6092 from adrianeboyd/bugfix/load-vocab-lookups-2 2020-09-19 12:33:38 +02:00
Sofie Van Landeghem 39872de1f6
Introducing the gpu_allocator (#6091)
* rename 'use_pytorch_for_gpu_memory' to 'gpu_allocator'

* --code instead of --code-path

* update documentation

* avoid querying the "system" section directly

* add explanation of gpu_allocator to TF/PyTorch section in docs

* fix typo

* fix typo 2

* use set_gpu_allocator from thinc 8.0.0a34

* default null instead of empty string
2020-09-19 01:17:02 +02:00
Adriane Boyd 47080fba98 Minor renaming / refactoring
* Rename loader to `spacy.LookupsDataLoader.v1`, add debugging message
* Make `Vocab.lookups` a property
2020-09-18 19:43:19 +02:00
svlandeg 73ff52b9ec hack for tok2vec listener 2020-09-18 16:43:15 +02:00
Adriane Boyd eed4b785f5 Load vocab lookups tables at beginning of training
Similar to how vectors are handled, move the vocab lookups to be loaded
at the start of training rather than when the vocab is initialized,
since the vocab doesn't have access to the full config when it's
created.

The option moves from `nlp.load_vocab_data` to `training.lookups`.

Typically these tables will come from `spacy-lookups-data`, but any
`Lookups` object can be provided.

The loading from `spacy-lookups-data` is now strict, so configs for each
language should specify the exact tables required. This also makes it
easier to control whether the larger clusters and probs tables are
included.

To load `lexeme_norm` from `spacy-lookups-data`:

```
[training.lookups]
@misc = "spacy.LoadLookupsData.v1"
lang = ${nlp.lang}
tables = ["lexeme_norm"]
```
2020-09-18 15:59:16 +02:00
Ines Montani 0406200a1e Update docs [ci skip] 2020-09-18 15:13:13 +02:00
Ines Montani a127fa475e
Merge pull request #6078 from svlandeg/fix/corpus 2020-09-18 14:44:21 +02:00
Matthew Honnibal bbdb5f62b7
Temporary work-around for scoring a subset of components (#6090)
* Try hacking the scorer to work around sentence boundaries

* Upd scorer

* Set dev version

* Upd scorer hack

* Fix version

* Improve comment on hack
2020-09-18 14:26:42 +02:00
Ines Montani d32ce121be Fix docs [ci skip] 2020-09-18 13:41:12 +02:00
Adriane Boyd a88106e852
Remove W106: HEAD and SENT_START in doc.from_array (#6086)
* Remove W106: HEAD and SENT_START in doc.from_array

This warning was hacky and being triggered too often.

* Fix test
2020-09-18 03:01:29 +02:00
svlandeg e4fc7e0222 fixing output sample to proper 2D array 2020-09-17 22:34:36 +02:00
Adriane Boyd 8b650f3a78 Modify setting missing and blocked entity tokens
In order to make it easier to construct `Doc` objects as training data,
modify how missing and blocked entity tokens are set to prioritize
setting `O` and missing entity tokens for training purposes over setting
blocked entity tokens.

* `Doc.ents` setter sets tokens outside entity spans to `O` regardless
of the current state of each token

* For `Doc.ents`, setting a span with a missing label sets the `ent_iob`
to missing instead of blocked

* `Doc.block_ents(spans)` marks spans as hard `O` for use with the
`EntityRecognizer`
2020-09-17 21:27:42 +02:00
Ines Montani 9062585a13
Merge pull request #6087 from explosion/docs/pretrain-usage [ci skip] 2020-09-17 19:25:24 +02:00
Ines Montani a0b4389a38 Update docs [ci skip] 2020-09-17 19:24:48 +02:00
Matthew Honnibal 6efb7688a6 Draft pretrain usage 2020-09-17 18:17:03 +02:00
Sofie Van Landeghem ed0fb034cb
ml_datasets v0.2.0a0 2020-09-17 18:11:10 +02:00
Ines Montani 1bb8b4f824 Merge branch 'master' into develop 2020-09-17 17:46:20 +02:00
Ines Montani 6bd0d25fb9
Merge pull request #6085 from explosion/docs/static-vectors-intro [ci skip] 2020-09-17 17:14:45 +02:00