Merge remote-tracking branch 'upstream/develop' into feature/update-docs

# Conflicts:
#	website/docs/usage/training.md
This commit is contained in:
svlandeg 2020-08-19 17:58:25 +02:00
commit 169b5bcda0
4 changed files with 89 additions and 16 deletions

View File

@ -509,8 +509,6 @@ page should be safe to use and we'll try to ensure backwards compatibility.
However, we recommend having additional tests in place if your application However, we recommend having additional tests in place if your application
depends on any of spaCy's utilities. depends on any of spaCy's utilities.
<!-- TODO: document new config-related util functions? -->
### util.get_lang_class {#util.get_lang_class tag="function"} ### util.get_lang_class {#util.get_lang_class tag="function"}
Import and load a `Language` class. Allows lazy-loading Import and load a `Language` class. Allows lazy-loading

View File

@ -623,7 +623,7 @@ added to the pipeline:
> >
> @Language.factory("my_component") > @Language.factory("my_component")
> def my_component(nlp, name): > def my_component(nlp, name):
> return MyComponent() > return MyComponent()
> ``` > ```
| Argument | Description | | Argument | Description |
@ -636,8 +636,6 @@ All other settings can be passed in by the user via the `config` argument on
[`@Language.factory`](/api/language#factory) decorator also lets you define a [`@Language.factory`](/api/language#factory) decorator also lets you define a
`default_config` that's used as a fallback. `default_config` that's used as a fallback.
<!-- TODO: add example of passing in a custom Python object via the config based on a registered function -->
```python ```python
### With config {highlight="4,9"} ### With config {highlight="4,9"}
import spacy import spacy
@ -688,7 +686,7 @@ make your factory a separate function. That's also how spaCy does it internally.
</Accordion> </Accordion>
### Example: Stateful component with settings ### Example: Stateful component with settings {#example-stateful-components}
This example shows a **stateful** pipeline component for handling acronyms: This example shows a **stateful** pipeline component for handling acronyms:
based on a dictionary, it will detect acronyms and their expanded forms in both based on a dictionary, it will detect acronyms and their expanded forms in both
@ -757,6 +755,85 @@ doc = nlp("LOL, be right back")
print(doc._.acronyms) print(doc._.acronyms)
``` ```
Many stateful components depend on **data resources** like dictionaries and
lookup tables that should ideally be **configurable**. For example, it makes
sense to make the `DICTIONARY` and argument of the registered function, so the
`AcronymComponent` can be re-used with different data. One logical solution
would be to make it an argument of the component factory, and allow it to be
initialized with different dictionaries.
> #### Example
>
> Making the data an argument of the registered function would result in output
> like this in your `config.cfg`, which is typically not what you want (and only
> works for JSON-serializable data).
>
> ```ini
> [components.acronyms.dictionary]
> lol = "laugh out loud"
> brb = "be right back"
> ```
However, passing in the dictionary directly is problematic, because it means
that if a component saves out its config and settings, the
[`config.cfg`](/usage/training#config) will include a dump of the entire data,
since that's the config the component was created with.
```diff
DICTIONARY = {"lol": "laughing out loud", "brb": "be right back"}
- default_config = {"dictionary:" DICTIONARY}
```
If what you're passing in isn't JSON-serializable e.g. a custom object like a
[model](#trainable-components) saving out the component config becomes
impossible because there's no way for spaCy to know _how_ that object was
created, and what to do to create it again. This makes it much harder to save,
load and train custom models with custom components. A simple solution is to
**register a function** that returns your resources. The
[registry](/api/top-level#registry) lets you **map string names to functions**
that create objects, so given a name and optional arguments, spaCy will know how
to recreate the object. To register a function that returns a custom asset, you
can use the `@spacy.registry.assets` decorator with a single argument, the name:
```python
### Registered function for assets {highlight="1"}
@spacy.registry.assets("acronyms.slang_dict.v1")
def create_acronyms_slang_dict():
dictionary = {"lol": "laughing out loud", "brb": "be right back"}
dictionary.update({value: key for key, value in dictionary.items()})
return dictionary
```
In your `default_config` (and later in your
[training config](/usage/training#config)), you can now refer to the function
registered under the name `"acronyms.slang_dict.v1"` using the `@assets` key.
This tells spaCy how to create the value, and when your component is created,
the result of the registered function is passed in as the key `"dictionary"`.
> #### config.cfg
>
> ```ini
> [components.acronyms]
> factory = "acronyms"
>
> [components.acronyms.dictionary]
> @assets = "acronyms.slang_dict.v1"
> ```
```diff
- default_config = {"dictionary:" DICTIONARY}
+ default_config = {"dictionary": {"@assets": "acronyms.slang_dict.v1"}}
```
Using a registered function also means that you can easily include your custom
components in models that you [train](/usage/training). To make sure spaCy knows
where to find your custom `@assets` function, you can pass in a Python file via
the argument `--code`. If someone else is using your component, all they have to
do to customize the data is to register their own function and swap out the
name. Registered functions can also take **arguments** by the way that can be
defined in the config as well you can read more about this in the docs on
[training with custom code](/usage/training#custom-code).
### Python type hints and pydantic validation {#type-hints new="3"} ### Python type hints and pydantic validation {#type-hints new="3"}
spaCy's configs are powered by our machine learning library Thinc's spaCy's configs are powered by our machine learning library Thinc's
@ -994,7 +1071,7 @@ loss is calculated and to add evaluation scores to the training output.
| [`get_loss`](/api/pipe#get_loss) | Return a tuple of the loss and the gradient for a batch of [`Example`](/api/example) objects. | | [`get_loss`](/api/pipe#get_loss) | Return a tuple of the loss and the gradient for a batch of [`Example`](/api/example) objects. |
| [`score`](/api/pipe#score) | Score a batch of [`Example`](/api/example) objects and return a dictionary of scores. The [`@Language.factory`](/api/language#factory) decorator can define the `default_socre_weights` of the component to decide which keys of the scores to display during training and how they count towards the final score. | | [`score`](/api/pipe#score) | Score a batch of [`Example`](/api/example) objects and return a dictionary of scores. The [`@Language.factory`](/api/language#factory) decorator can define the `default_socre_weights` of the component to decide which keys of the scores to display during training and how they count towards the final score. |
<!-- TODO: add more details, examples and maybe an example project --> <!-- TODO: link to (not yet created) page for defining models for trainable components -->
## Extension attributes {#custom-components-attributes new="2"} ## Extension attributes {#custom-components-attributes new="2"}

View File

@ -97,7 +97,7 @@ to download and where to put them. The
[`spacy project assets`](/api/cli#project-assets) will fetch the project assets [`spacy project assets`](/api/cli#project-assets) will fetch the project assets
for you: for you:
``` ```cli
$ cd some_example_project $ cd some_example_project
$ python -m spacy project assets $ python -m spacy project assets
``` ```

View File

@ -414,7 +414,7 @@ recipe once the dish has already been prepared. You have to make a new one.
spaCy includes a variety of built-in [architectures](/api/architectures) for spaCy includes a variety of built-in [architectures](/api/architectures) for
different tasks. For example: different tasks. For example:
<!-- TODO: --> <!-- TODO: select example architectures to showcase -->
| Architecture | Description | | Architecture | Description |
| ----------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | ----------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
@ -776,12 +776,11 @@ mattis pretium.
### Defining custom architectures {#custom-architectures} ### Defining custom architectures {#custom-architectures}
<!-- TODO: this could maybe be a more general example of using Thinc to compose some layers? We don't want to go too deep here and probably want to focus on a simple architecture example to show how it works --> <!-- TODO: this should probably move to new section on models -->
<!-- TODO: Wrapping PyTorch and TensorFlow -->
## Transfer learning {#transfer-learning} ## Transfer learning {#transfer-learning}
<!-- TODO: link to embeddings and transformers page --> <!-- TODO: write something, link to embeddings and transformers page should probably wait until transformers/embeddings/transfer learning docs are done -->
### Using transformer models like BERT {#transformers} ### Using transformer models like BERT {#transformers}
@ -811,7 +810,7 @@ config and customize the implementations, see the usage guide on
### Pretraining with spaCy {#pretraining} ### Pretraining with spaCy {#pretraining}
<!-- TODO: document spacy pretrain, objectives etc. --> <!-- TODO: document spacy pretrain, objectives etc. should probably wait until transformers/embeddings/transfer learning docs are done -->
## Parallel Training with Ray {#parallel-training} ## Parallel Training with Ray {#parallel-training}
@ -836,9 +835,8 @@ spaCy gives you full control over the training loop. However, for most use
cases, it's recommended to train your models via the cases, it's recommended to train your models via the
[`spacy train`](/api/cli#train) command with a [`config.cfg`](#config) to keep [`spacy train`](/api/cli#train) command with a [`config.cfg`](#config) to keep
track of your settings and hyperparameters, instead of writing your own training track of your settings and hyperparameters, instead of writing your own training
scripts from scratch. scripts from scratch. [Custom registered functions](#custom-code) should
[Custom registered functions](/usage/training/#custom-code) should typically typically give you everything you need to train fully custom models with
give you everything you need to train fully custom models with
[`spacy train`](/api/cli#train). [`spacy train`](/api/cli#train).
</Infobox> </Infobox>