Merge remote-tracking branch 'upstream/develop' into feature/update-docs

# Conflicts: # website/docs/usage/training.md
2020-08-19 17:58:25 +02:00 · 2020-08-19 17:58:25 +02:00 · 169b5bcda0
parent 7119295a8a 63921161c8
commit 169b5bcda0
4 changed files with 89 additions and 16 deletions
--- a/website/docs/api/top-level.md
+++ b/website/docs/api/top-level.md
@ -509,8 +509,6 @@ page should be safe to use and we'll try to ensure backwards compatibility.
 However, we recommend having additional tests in place if your application
 depends on any of spaCy's utilities.

-<!-- TODO: document new config-related util functions? -->
-
 ### util.get_lang_class {#util.get_lang_class tag="function"}

 Import and load a `Language` class. Allows lazy-loading
--- a/website/docs/usage/processing-pipelines.md
+++ b/website/docs/usage/processing-pipelines.md
@ -623,7 +623,7 @@ added to the pipeline:
 >
 > @Language.factory("my_component")
 > def my_component(nlp, name):
->    return MyComponent()
+>     return MyComponent()
 > ```

 | Argument | Description                                                                                                                       |
@ -636,8 +636,6 @@ All other settings can be passed in by the user via the `config` argument on
 [`@Language.factory`](/api/language#factory) decorator also lets you define a
 `default_config` that's used as a fallback.

-<!-- TODO: add example of passing in a custom Python object via the config based on a registered function -->
-
 ```python
 ### With config {highlight="4,9"}
 import spacy
@ -688,7 +686,7 @@ make your factory a separate function. That's also how spaCy does it internally.

 </Accordion>

-### Example: Stateful component with settings
+### Example: Stateful component with settings {#example-stateful-components}

 This example shows a **stateful** pipeline component for handling acronyms:
 based on a dictionary, it will detect acronyms and their expanded forms in both
@ -757,6 +755,85 @@ doc = nlp("LOL, be right back")
 print(doc._.acronyms)
 ```

+Many stateful components depend on **data resources** like dictionaries and
+lookup tables that should ideally be **configurable**. For example, it makes
+sense to make the `DICTIONARY` and argument of the registered function, so the
+`AcronymComponent` can be re-used with different data. One logical solution
+would be to make it an argument of the component factory, and allow it to be
+initialized with different dictionaries.
+
+> #### Example
+>
+> Making the data an argument of the registered function would result in output
+> like this in your `config.cfg`, which is typically not what you want (and only
+> works for JSON-serializable data).
+>
+> ```ini
+> [components.acronyms.dictionary]
+> lol = "laugh out loud"
+> brb = "be right back"
+> ```
+
+However, passing in the dictionary directly is problematic, because it means
+that if a component saves out its config and settings, the
+[`config.cfg`](/usage/training#config) will include a dump of the entire data,
+since that's the config the component was created with.
+
+```diff
+DICTIONARY = {"lol": "laughing out loud", "brb": "be right back"}
+- default_config = {"dictionary:" DICTIONARY}
+```
+
+If what you're passing in isn't JSON-serializable – e.g. a custom object like a
+[model](#trainable-components) – saving out the component config becomes
+impossible because there's no way for spaCy to know _how_ that object was
+created, and what to do to create it again. This makes it much harder to save,
+load and train custom models with custom components. A simple solution is to
+**register a function** that returns your resources. The
+[registry](/api/top-level#registry) lets you **map string names to functions**
+that create objects, so given a name and optional arguments, spaCy will know how
+to recreate the object. To register a function that returns a custom asset, you
+can use the `@spacy.registry.assets` decorator with a single argument, the name:
+
+```python
+### Registered function for assets {highlight="1"}
+@spacy.registry.assets("acronyms.slang_dict.v1")
+def create_acronyms_slang_dict():
+    dictionary = {"lol": "laughing out loud", "brb": "be right back"}
+    dictionary.update({value: key for key, value in dictionary.items()})
+    return dictionary
+```
+
+In your `default_config` (and later in your
+[training config](/usage/training#config)), you can now refer to the function
+registered under the name `"acronyms.slang_dict.v1"` using the `@assets` key.
+This tells spaCy how to create the value, and when your component is created,
+the result of the registered function is passed in as the key `"dictionary"`.
+
+> #### config.cfg
+>
+> ```ini
+> [components.acronyms]
+> factory = "acronyms"
+>
+> [components.acronyms.dictionary]
+> @assets = "acronyms.slang_dict.v1"
+> ```
+
+```diff
+- default_config = {"dictionary:" DICTIONARY}
+ default_config = {"dictionary": {"@assets": "acronyms.slang_dict.v1"}}
+```
+
+Using a registered function also means that you can easily include your custom
+components in models that you [train](/usage/training). To make sure spaCy knows
+where to find your custom `@assets` function, you can pass in a Python file via
+the argument `--code`. If someone else is using your component, all they have to
+do to customize the data is to register their own function and swap out the
+name. Registered functions can also take **arguments** by the way that can be
+defined in the config as well – you can read more about this in the docs on
+[training with custom code](/usage/training#custom-code).
+
 ### Python type hints and pydantic validation {#type-hints new="3"}

 spaCy's configs are powered by our machine learning library Thinc's
@ -994,7 +1071,7 @@ loss is calculated and to add evaluation scores to the training output.
 | [`get_loss`](/api/pipe#get_loss)             | Return a tuple of the loss and the gradient for a batch of [`Example`](/api/example) objects.                                                                                                                                                                                                                      |
 | [`score`](/api/pipe#score)                   | Score a batch of [`Example`](/api/example) objects and return a dictionary of scores. The [`@Language.factory`](/api/language#factory) decorator can define the `default_socre_weights` of the component to decide which keys of the scores to display during training and how they count towards the final score. |

-<!-- TODO: add more details, examples and maybe an example project -->
+<!-- TODO: link to (not yet created) page for defining models for trainable components -->

 ## Extension attributes {#custom-components-attributes new="2"}

--- a/website/docs/usage/projects.md
+++ b/website/docs/usage/projects.md
@ -97,7 +97,7 @@ to download and where to put them. The
 [`spacy project assets`](/api/cli#project-assets) will fetch the project assets
 for you:

-```
+```cli
 $ cd some_example_project
 $ python -m spacy project assets
 ```
--- a/website/docs/usage/training.md
+++ b/website/docs/usage/training.md
@ -414,7 +414,7 @@ recipe once the dish has already been prepared. You have to make a new one.
 spaCy includes a variety of built-in [architectures](/api/architectures) for
 different tasks. For example:

-<!-- TODO: -->
+<!-- TODO: select example architectures to showcase -->

 | Architecture                                    | Description                                                                                                                                                            |
 | ----------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
@ -776,12 +776,11 @@ mattis pretium.

 ### Defining custom architectures {#custom-architectures}

-<!-- TODO: this could maybe be a more general example of using Thinc to compose some layers? We don't want to go too deep here and probably want to focus on a simple architecture example to show how it works -->
-<!-- TODO: Wrapping PyTorch and TensorFlow -->
+<!-- TODO: this should probably move to new section on models -->

 ## Transfer learning {#transfer-learning}

-<!-- TODO: link to embeddings and transformers page -->
+<!-- TODO: write something, link to embeddings and transformers page – should probably wait until transformers/embeddings/transfer learning docs are done -->

 ### Using transformer models like BERT {#transformers}

@ -811,7 +810,7 @@ config and customize the implementations, see the usage guide on

 ### Pretraining with spaCy {#pretraining}

-<!-- TODO: document spacy pretrain, objectives etc. -->
+<!-- TODO: document spacy pretrain, objectives etc. – should probably wait until transformers/embeddings/transfer learning docs are done -->

 ## Parallel Training with Ray {#parallel-training}

@ -836,9 +835,8 @@ spaCy gives you full control over the training loop. However, for most use
 cases, it's recommended to train your models via the
 [`spacy train`](/api/cli#train) command with a [`config.cfg`](#config) to keep
 track of your settings and hyperparameters, instead of writing your own training
-scripts from scratch.
-[Custom registered functions](/usage/training/#custom-code) should typically
-give you everything you need to train fully custom models with
+scripts from scratch. [Custom registered functions](#custom-code) should
+typically give you everything you need to train fully custom models with
 [`spacy train`](/api/cli#train).

 </Infobox>