Merge remote-tracking branch 'upstream/develop' into feature/update-docs

# Conflicts: # website/docs/usage/training.md
2020-08-19 17:58:25 +02:00 · 2020-08-19 17:58:25 +02:00 · 169b5bcda0
parent 7119295a8a 63921161c8
commit 169b5bcda0
4 changed files with 89 additions and 16 deletions
--- a/website/docs/api/top-level.md
+++ b/website/docs/api/top-level.md
@ -509,8 +509,6 @@ page should be safe to use and we'll try to ensure backwards compatibility.
 However, we recommend having additional tests in place if your application
 depends on any of spaCy's utilities.
 <!-- TODO: document new config-related util functions? -->
 ### util.get_lang_class {#util.get_lang_class tag="function"}
 Import and load a `Language` class. Allows lazy-loading
--- a/website/docs/usage/processing-pipelines.md
+++ b/website/docs/usage/processing-pipelines.md
@ -623,7 +623,7 @@ added to the pipeline:
 >
 > @Language.factory("my_component")
 > def my_component(nlp, name):
->    return MyComponent()
+>     return MyComponent()
 > ```
 | Argument | Description                                                                                                                       |
@ -636,8 +636,6 @@ All other settings can be passed in by the user via the `config` argument on
 [`@Language.factory`](/api/language#factory) decorator also lets you define a
 `default_config` that's used as a fallback.
 <!-- TODO: add example of passing in a custom Python object via the config based on a registered function -->
 ```python
 ### With config {highlight="4,9"}
 import spacy
@ -688,7 +686,7 @@ make your factory a separate function. That's also how spaCy does it internally.
 </Accordion>
-### Example: Stateful component with settings
+### Example: Stateful component with settings {#example-stateful-components}
 This example shows a **stateful** pipeline component for handling acronyms:
 based on a dictionary, it will detect acronyms and their expanded forms in both
@ -757,6 +755,85 @@ doc = nlp("LOL, be right back")
 print(doc._.acronyms)
 ```
 Many stateful components depend on **data resources** like dictionaries and
 lookup tables that should ideally be **configurable**. For example, it makes
 sense to make the `DICTIONARY` and argument of the registered function, so the
 `AcronymComponent` can be re-used with different data. One logical solution
 would be to make it an argument of the component factory, and allow it to be
 initialized with different dictionaries.
 > #### Example
 >
 > Making the data an argument of the registered function would result in output
 > like this in your `config.cfg`, which is typically not what you want (and only
 > works for JSON-serializable data).
 >
 > ```ini
 > [components.acronyms.dictionary]
 > lol = "laugh out loud"
 > brb = "be right back"
 > ```
 However, passing in the dictionary directly is problematic, because it means
 that if a component saves out its config and settings, the
 [`config.cfg`](/usage/training#config) will include a dump of the entire data,
 since that's the config the component was created with.
 ```diff
 DICTIONARY = {"lol": "laughing out loud", "brb": "be right back"}
 - default_config = {"dictionary:" DICTIONARY}
 ```
 If what you're passing in isn't JSON-serializable – e.g. a custom object like a
 [model](#trainable-components) – saving out the component config becomes
 impossible because there's no way for spaCy to know _how_ that object was
 created, and what to do to create it again. This makes it much harder to save,
 load and train custom models with custom components. A simple solution is to
 **register a function** that returns your resources. The
 [registry](/api/top-level#registry) lets you **map string names to functions**
 that create objects, so given a name and optional arguments, spaCy will know how
 to recreate the object. To register a function that returns a custom asset, you
 can use the `@spacy.registry.assets` decorator with a single argument, the name:
 ```python
 ### Registered function for assets {highlight="1"}
@spacy.registry.assets("acronyms.slang_dict.v1")
 def create_acronyms_slang_dict():
    dictionary = {"lol": "laughing out loud", "brb": "be right back"}
    dictionary.update({value: key for key, value in dictionary.items()})
    return dictionary
 ```
 In your `default_config` (and later in your
 [training config](/usage/training#config)), you can now refer to the function
 registered under the name `"acronyms.slang_dict.v1"` using the `@assets` key.
 This tells spaCy how to create the value, and when your component is created,
 the result of the registered function is passed in as the key `"dictionary"`.
 > #### config.cfg
 >
 > ```ini
 > [components.acronyms]
 > factory = "acronyms"
 >
 > [components.acronyms.dictionary]
 > @assets = "acronyms.slang_dict.v1"
 > ```
 ```diff
 - default_config = {"dictionary:" DICTIONARY}
 + default_config = {"dictionary": {"@assets": "acronyms.slang_dict.v1"}}
 ```
 Using a registered function also means that you can easily include your custom
 components in models that you [train](/usage/training). To make sure spaCy knows
 where to find your custom `@assets` function, you can pass in a Python file via
 the argument `--code`. If someone else is using your component, all they have to
 do to customize the data is to register their own function and swap out the
 name. Registered functions can also take **arguments** by the way that can be
 defined in the config as well – you can read more about this in the docs on
 [training with custom code](/usage/training#custom-code).
 ### Python type hints and pydantic validation {#type-hints new="3"}
 spaCy's configs are powered by our machine learning library Thinc's
@ -994,7 +1071,7 @@ loss is calculated and to add evaluation scores to the training output.
 | [`get_loss`](/api/pipe#get_loss)             | Return a tuple of the loss and the gradient for a batch of [`Example`](/api/example) objects.                                                                                                                                                                                                                      |
 | [`score`](/api/pipe#score)                   | Score a batch of [`Example`](/api/example) objects and return a dictionary of scores. The [`@Language.factory`](/api/language#factory) decorator can define the `default_socre_weights` of the component to decide which keys of the scores to display during training and how they count towards the final score. |
-<!-- TODO: add more details, examples and maybe an example project -->
+<!-- TODO: link to (not yet created) page for defining models for trainable components -->
 ## Extension attributes {#custom-components-attributes new="2"}
--- a/website/docs/usage/projects.md
+++ b/website/docs/usage/projects.md
@ -97,7 +97,7 @@ to download and where to put them. The
 [`spacy project assets`](/api/cli#project-assets) will fetch the project assets
 for you:
-```
+```cli
 $ cd some_example_project
 $ python -m spacy project assets
 ```
--- a/website/docs/usage/training.md
+++ b/website/docs/usage/training.md
@ -414,7 +414,7 @@ recipe once the dish has already been prepared. You have to make a new one.
 spaCy includes a variety of built-in [architectures](/api/architectures) for
 different tasks. For example:
-<!-- TODO: -->
+<!-- TODO: select example architectures to showcase -->
 | Architecture                                    | Description                                                                                                                                                            |
 | ----------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
@ -776,12 +776,11 @@ mattis pretium.
 ### Defining custom architectures {#custom-architectures}
-<!-- TODO: this could maybe be a more general example of using Thinc to compose some layers? We don't want to go too deep here and probably want to focus on a simple architecture example to show how it works -->
+<!-- TODO: this should probably move to new section on models -->
 <!-- TODO: Wrapping PyTorch and TensorFlow -->
 ## Transfer learning {#transfer-learning}
-<!-- TODO: link to embeddings and transformers page -->
+<!-- TODO: write something, link to embeddings and transformers page – should probably wait until transformers/embeddings/transfer learning docs are done -->
 ### Using transformer models like BERT {#transformers}
@ -811,7 +810,7 @@ config and customize the implementations, see the usage guide on
 ### Pretraining with spaCy {#pretraining}
-<!-- TODO: document spacy pretrain, objectives etc. -->
+<!-- TODO: document spacy pretrain, objectives etc. – should probably wait until transformers/embeddings/transfer learning docs are done -->
 ## Parallel Training with Ray {#parallel-training}
@ -836,9 +835,8 @@ spaCy gives you full control over the training loop. However, for most use
 cases, it's recommended to train your models via the
 [`spacy train`](/api/cli#train) command with a [`config.cfg`](#config) to keep
 track of your settings and hyperparameters, instead of writing your own training
-scripts from scratch.
+scripts from scratch. [Custom registered functions](#custom-code) should
-[Custom registered functions](/usage/training/#custom-code) should typically
+typically give you everything you need to train fully custom models with
 give you everything you need to train fully custom models with
 [`spacy train`](/api/cli#train).
 </Infobox>