From a8aa9a806818955d3bbecd0413878d6892ff8002 Mon Sep 17 00:00:00 2001 From: svlandeg Date: Wed, 9 Sep 2020 15:56:27 +0200 Subject: [PATCH] document Pipe API details, crossreferences etc --- website/docs/api/language.md | 9 ++- website/docs/api/pipe.md | 86 ++++++++++++++++++++-- website/docs/usage/layers-architectures.md | 12 +-- 3 files changed, 94 insertions(+), 13 deletions(-) diff --git a/website/docs/api/language.md b/website/docs/api/language.md index 7799f103b..9c9ccb6cf 100644 --- a/website/docs/api/language.md +++ b/website/docs/api/language.md @@ -205,9 +205,16 @@ examples can either be the full training data or a representative sample. They are used to **initialize the models** of trainable pipeline components and are passed each component's [`begin_training`](/api/pipe#begin_training) method, if available. Initialization includes validating the network, -[inferring missing shapes](https://thinc.ai/docs/usage-models#validation) and +[inferring missing shapes](/usage/layers-architectures#shape-inference) and setting up the label scheme based on the data. +If no `get_examples` function is provided when calling `nlp.begin_training`, the +pipeline components will be initialized with generic data. In this case, it is +crucial that the output dimension of each component has already been defined +either in the [config](/usage/training#config), or by calling +[`pipe.add_label`](/api/pipe#add_label) for each possible output label (e.g. for +the tagger or textcat). + The `Language.update` method now takes a **function** that is called with no diff --git a/website/docs/api/pipe.md b/website/docs/api/pipe.md index 57b2af44d..7b77141fa 100644 --- a/website/docs/api/pipe.md +++ b/website/docs/api/pipe.md @@ -286,9 +286,6 @@ context, the original parameters are restored. ## Pipe.add_label {#add_label tag="method"} -Add a new label to the pipe. It's possible to extend trained models with new -labels, but care should be taken to avoid the "catastrophic forgetting" problem. - > #### Example > > ```python @@ -296,10 +293,85 @@ labels, but care should be taken to avoid the "catastrophic forgetting" problem. > pipe.add_label("MY_LABEL") > ``` -| Name | Description | -| ----------- | ----------------------------------------------------------- | -| `label` | The label to add. ~~str~~ | -| **RETURNS** | `0` if the label is already present, otherwise `1`. ~~int~~ | + + +This method needs to be overwritten with your own custom `add_label` method. + + + +Add a new label to the pipe, to be predicted by the model. The actual +implementation depends on the specific component, but in general `add_label` +shouldn't be called if the output dimension is already set, or if the model has +already been fully [initialized](#begin_training). If these conditions are +violated, the function will raise an Error. The exception to this rule is when +the component is [resizable](#is_resizable), in which case +[`set_output`](#set_output) should be called to ensure that the model is +properly resized. + +| Name | Description | +| ----------- | ------------------------------------------------------- | +| `label` | The label to add. ~~str~~ | +| **RETURNS** | 0 if the label is already present, otherwise 1. ~~int~~ | + +Note that in general, you don't have to call `pipe.add_label` if you provide a +representative data sample to the [`begin_training`](#begin_training) method. In +this case, all labels found in the sample will be automatically added to the +model, and the output dimension will be +[inferred](/usage/layers-architectures#shape-inference) automatically. + +## Pipe.is_resizable {#is_resizable tag="method"} + +> #### Example +> +> ```python +> can_resize = pipe.is_resizable() +> ``` + +Check whether or not the output dimension of the component's model can be +resized. If this method returns `True`, [`set_output`](#set_output) can be +called to change the model's output dimension. + +| Name | Description | +| ----------- | ---------------------------------------------------------------------------------------------- | +| **RETURNS** | Whether or not the output dimension of the model can be changed after initialization. ~~bool~~ | + +> #### Example +> +> ```python +> def custom_resize(model, new_nO): +> # adjust model +> return model +> custom_model.attrs["resize_output"] = custom_resize +> ``` + +For built-in components that are not resizable, you have to create and train a +new model from scratch with the appropriate architecture and output dimension. + +For custom components, you can implement a `resize_output` function and add it +as an attribute to the component's model. + +## Pipe.set_output {#set_output tag="method"} + +Change the output dimension of the component's model. If the component is not +[resizable](#is_resizable), this method will throw a `NotImplementedError`. + +If a component is resizable, the model's attribute `resize_output` will be +called. This is a function that takes the original model and the new output +dimension `nO`, and changes the model in place. + +When resizing an already trained model, care should be taken to avoid the +"catastrophic forgetting" problem. + +> #### Example +> +> ```python +> if pipe.is_resizable(): +> pipe.set_output(512) +> ``` + +| Name | Description | +| ---- | --------------------------------- | +| `nO` | The new output dimension. ~~int~~ | ## Pipe.to_disk {#to_disk tag="method"} diff --git a/website/docs/usage/layers-architectures.md b/website/docs/usage/layers-architectures.md index 1e39ffb9a..95afe3239 100644 --- a/website/docs/usage/layers-architectures.md +++ b/website/docs/usage/layers-architectures.md @@ -382,9 +382,11 @@ contrast to how the PyTorch layers are defined, where `in_features` precedes ### Shape inference in thinc {#shape-inference} It is not strictly necessary to define all the input and output dimensions for -each layer, as Thinc can perform shape inference between sequential layers by -matching up the output dimensionality of one layer to the input dimensionality -of the next. This means that we can simplify the `layers` definition: +each layer, as Thinc can perform +[shape inference](https://thinc.ai/docs/usage-models#validation) between +sequential layers by matching up the output dimensionality of one layer to the +input dimensionality of the next. This means that we can simplify the `layers` +definition: ```python with Model.define_operators({">>": chain}): @@ -399,8 +401,8 @@ with Model.define_operators({">>": chain}): Thinc can go one step further and deduce the correct input dimension of the first layer, and output dimension of the last. To enable this functionality, you -can call [`model.initialize`](https://thinc.ai/docs/api-model#initialize) with -an input sample `X` and an output sample `Y` with the correct dimensions. +have to call [`model.initialize`](https://thinc.ai/docs/api-model#initialize) +with an input sample `X` and an output sample `Y` with the correct dimensions. ```python with Model.define_operators({">>": chain}):