document Pipe API details, crossreferences etc

This commit is contained in:
svlandeg 2020-09-09 15:56:27 +02:00
parent 9a7c6cc61a
commit a8aa9a8068
3 changed files with 94 additions and 13 deletions

View File

@ -205,9 +205,16 @@ examples can either be the full training data or a representative sample. They
are used to **initialize the models** of trainable pipeline components and are are used to **initialize the models** of trainable pipeline components and are
passed each component's [`begin_training`](/api/pipe#begin_training) method, if passed each component's [`begin_training`](/api/pipe#begin_training) method, if
available. Initialization includes validating the network, available. Initialization includes validating the network,
[inferring missing shapes](https://thinc.ai/docs/usage-models#validation) and [inferring missing shapes](/usage/layers-architectures#shape-inference) and
setting up the label scheme based on the data. setting up the label scheme based on the data.
If no `get_examples` function is provided when calling `nlp.begin_training`, the
pipeline components will be initialized with generic data. In this case, it is
crucial that the output dimension of each component has already been defined
either in the [config](/usage/training#config), or by calling
[`pipe.add_label`](/api/pipe#add_label) for each possible output label (e.g. for
the tagger or textcat).
<Infobox variant="warning" title="Changed in v3.0"> <Infobox variant="warning" title="Changed in v3.0">
The `Language.update` method now takes a **function** that is called with no The `Language.update` method now takes a **function** that is called with no

View File

@ -286,9 +286,6 @@ context, the original parameters are restored.
## Pipe.add_label {#add_label tag="method"} ## Pipe.add_label {#add_label tag="method"}
Add a new label to the pipe. It's possible to extend trained models with new
labels, but care should be taken to avoid the "catastrophic forgetting" problem.
> #### Example > #### Example
> >
> ```python > ```python
@ -296,10 +293,85 @@ labels, but care should be taken to avoid the "catastrophic forgetting" problem.
> pipe.add_label("MY_LABEL") > pipe.add_label("MY_LABEL")
> ``` > ```
<Infobox variant="danger">
This method needs to be overwritten with your own custom `add_label` method.
</Infobox>
Add a new label to the pipe, to be predicted by the model. The actual
implementation depends on the specific component, but in general `add_label`
shouldn't be called if the output dimension is already set, or if the model has
already been fully [initialized](#begin_training). If these conditions are
violated, the function will raise an Error. The exception to this rule is when
the component is [resizable](#is_resizable), in which case
[`set_output`](#set_output) should be called to ensure that the model is
properly resized.
| Name | Description | | Name | Description |
| ----------- | ----------------------------------------------------------- | | ----------- | ------------------------------------------------------- |
| `label` | The label to add. ~~str~~ | | `label` | The label to add. ~~str~~ |
| **RETURNS** | `0` if the label is already present, otherwise `1`. ~~int~~ | | **RETURNS** | 0 if the label is already present, otherwise 1. ~~int~~ |
Note that in general, you don't have to call `pipe.add_label` if you provide a
representative data sample to the [`begin_training`](#begin_training) method. In
this case, all labels found in the sample will be automatically added to the
model, and the output dimension will be
[inferred](/usage/layers-architectures#shape-inference) automatically.
## Pipe.is_resizable {#is_resizable tag="method"}
> #### Example
>
> ```python
> can_resize = pipe.is_resizable()
> ```
Check whether or not the output dimension of the component's model can be
resized. If this method returns `True`, [`set_output`](#set_output) can be
called to change the model's output dimension.
| Name | Description |
| ----------- | ---------------------------------------------------------------------------------------------- |
| **RETURNS** | Whether or not the output dimension of the model can be changed after initialization. ~~bool~~ |
> #### Example
>
> ```python
> def custom_resize(model, new_nO):
> # adjust model
> return model
> custom_model.attrs["resize_output"] = custom_resize
> ```
For built-in components that are not resizable, you have to create and train a
new model from scratch with the appropriate architecture and output dimension.
For custom components, you can implement a `resize_output` function and add it
as an attribute to the component's model.
## Pipe.set_output {#set_output tag="method"}
Change the output dimension of the component's model. If the component is not
[resizable](#is_resizable), this method will throw a `NotImplementedError`.
If a component is resizable, the model's attribute `resize_output` will be
called. This is a function that takes the original model and the new output
dimension `nO`, and changes the model in place.
When resizing an already trained model, care should be taken to avoid the
"catastrophic forgetting" problem.
> #### Example
>
> ```python
> if pipe.is_resizable():
> pipe.set_output(512)
> ```
| Name | Description |
| ---- | --------------------------------- |
| `nO` | The new output dimension. ~~int~~ |
## Pipe.to_disk {#to_disk tag="method"} ## Pipe.to_disk {#to_disk tag="method"}

View File

@ -382,9 +382,11 @@ contrast to how the PyTorch layers are defined, where `in_features` precedes
### Shape inference in thinc {#shape-inference} ### Shape inference in thinc {#shape-inference}
It is not strictly necessary to define all the input and output dimensions for It is not strictly necessary to define all the input and output dimensions for
each layer, as Thinc can perform shape inference between sequential layers by each layer, as Thinc can perform
matching up the output dimensionality of one layer to the input dimensionality [shape inference](https://thinc.ai/docs/usage-models#validation) between
of the next. This means that we can simplify the `layers` definition: sequential layers by matching up the output dimensionality of one layer to the
input dimensionality of the next. This means that we can simplify the `layers`
definition:
```python ```python
with Model.define_operators({">>": chain}): with Model.define_operators({">>": chain}):
@ -399,8 +401,8 @@ with Model.define_operators({">>": chain}):
Thinc can go one step further and deduce the correct input dimension of the Thinc can go one step further and deduce the correct input dimension of the
first layer, and output dimension of the last. To enable this functionality, you first layer, and output dimension of the last. To enable this functionality, you
can call [`model.initialize`](https://thinc.ai/docs/api-model#initialize) with have to call [`model.initialize`](https://thinc.ai/docs/api-model#initialize)
an input sample `X` and an output sample `Y` with the correct dimensions. with an input sample `X` and an output sample `Y` with the correct dimensions.
```python ```python
with Model.define_operators({">>": chain}): with Model.define_operators({">>": chain}):