spaCy/website/docs/usage/layers-architectures.md

930 lines
33 KiB
Markdown
Raw Normal View History

2020-08-21 14:11:38 +00:00
---
title: Layers and Model Architectures
teaser: Power spaCy components with custom neural networks
menu:
- ['Type Signatures', 'type-sigs']
- ['Swapping Architectures', 'swap-architectures']
2020-08-21 14:11:38 +00:00
- ['PyTorch & TensorFlow', 'frameworks']
2020-09-09 19:26:10 +00:00
- ['Custom Thinc Models', 'thinc']
2020-08-21 14:11:38 +00:00
- ['Trainable Components', 'components']
2020-08-21 14:21:55 +00:00
next: /usage/projects
2020-08-21 14:11:38 +00:00
---
2020-09-02 11:41:18 +00:00
> #### Example
>
2020-09-02 13:26:07 +00:00
> ```python
2020-09-02 11:41:18 +00:00
> from thinc.api import Model, chain
2020-09-02 13:26:07 +00:00
>
2020-09-02 12:15:50 +00:00
> @spacy.registry.architectures.register("model.v1")
2020-09-02 11:41:18 +00:00
> def build_model(width: int, classes: int) -> Model:
> tok2vec = build_tok2vec(width)
> output_layer = build_output_layer(width, classes)
> model = chain(tok2vec, output_layer)
> return model
2020-09-02 13:26:07 +00:00
> ```
2020-09-02 11:41:18 +00:00
A **model architecture** is a function that wires up a
[Thinc `Model`](https://thinc.ai/docs/api-model) instance. It describes the
2020-09-02 13:26:07 +00:00
neural network that is run internally as part of a component in a spaCy
pipeline. To define the actual architecture, you can implement your logic in
Thinc directly, or you can use Thinc as a thin wrapper around frameworks such as
2020-09-12 15:05:10 +00:00
PyTorch, TensorFlow and MXNet. Each `Model` can also be used as a sublayer of a
2020-09-02 13:26:07 +00:00
larger network, allowing you to freely combine implementations from different
2020-09-12 15:05:10 +00:00
frameworks into a single model.
2020-09-02 11:41:18 +00:00
spaCy's built-in components require a `Model` instance to be passed to them via
the config system. To change the model architecture of an existing component,
2020-09-02 13:26:07 +00:00
you just need to [**update the config**](#swap-architectures) so that it refers
to a different registered function. Once the component has been created from
this config, you won't be able to change it anymore. The architecture is like a
recipe for the network, and you can't change the recipe once the dish has
already been prepared. You have to make a new one.
2020-08-21 14:11:38 +00:00
2020-09-02 12:15:50 +00:00
```ini
### config.cfg (excerpt)
[components.tagger]
factory = "tagger"
[components.tagger.model]
@architectures = "model.v1"
width = 512
classes = 16
```
2020-08-21 14:11:38 +00:00
## Type signatures {#type-sigs}
2020-08-21 17:34:06 +00:00
> #### Example
>
> ```python
2020-09-02 12:15:50 +00:00
> from typing import List
> from thinc.api import Model, chain
> from thinc.types import Floats2d
> def chain_model(
2020-09-02 13:26:07 +00:00
> tok2vec: Model[List[Doc], List[Floats2d]],
> layer1: Model[List[Floats2d], Floats2d],
2020-09-02 12:15:50 +00:00
> layer2: Model[Floats2d, Floats2d]
> ) -> Model[List[Doc], Floats2d]:
> model = chain(tok2vec, layer1, layer2)
2020-08-21 17:34:06 +00:00
> return model
> ```
2020-09-02 12:25:18 +00:00
The Thinc `Model` class is a **generic type** that can specify its input and
2020-08-21 14:11:38 +00:00
output types. Python uses a square-bracket notation for this, so the type
~~Model[List, Dict]~~ says that each batch of inputs to the model will be a
2020-09-02 13:26:07 +00:00
list, and the outputs will be a dictionary. You can be even more specific and
write for instance~~Model[List[Doc], Dict[str, float]]~~ to specify that the
model expects a list of [`Doc`](/api/doc) objects as input, and returns a
dictionary mapping of strings to floats. Some of the most common types you'll
see are:
2020-08-21 14:11:38 +00:00
| Type | Description |
| ------------------ | ---------------------------------------------------------------------------------------------------- |
| ~~List[Doc]~~ | A batch of [`Doc`](/api/doc) objects. Most components expect their models to take this as input. |
| ~~Floats2d~~ | A two-dimensional `numpy` or `cupy` array of floats. Usually 32-bit. |
| ~~Ints2d~~ | A two-dimensional `numpy` or `cupy` array of integers. Common dtypes include uint64, int32 and int8. |
| ~~List[Floats2d]~~ | A list of two-dimensional arrays, generally with one array per `Doc` and one row per token. |
| ~~Ragged~~ | A container to handle variable-length sequence data in an unpadded contiguous array. |
2020-09-02 08:46:38 +00:00
| ~~Padded~~ | A container to handle variable-length sequence data in a padded contiguous array. |
2020-08-21 14:11:38 +00:00
2020-10-05 11:06:20 +00:00
See the [Thinc type reference](https://thinc.ai/docs/api-types) for details. The
model type signatures help you figure out which model architectures and
2020-08-21 17:34:06 +00:00
components can **fit together**. For instance, the
2020-08-21 14:21:55 +00:00
[`TextCategorizer`](/api/textcategorizer) class expects a model typed
2020-08-21 14:11:38 +00:00
~~Model[List[Doc], Floats2d]~~, because the model will predict one row of
2020-08-21 17:34:06 +00:00
category probabilities per [`Doc`](/api/doc). In contrast, the
[`Tagger`](/api/tagger) class expects a model typed ~~Model[List[Doc],
List[Floats2d]]~~, because it needs to predict one row of probabilities per
token.
There's no guarantee that two models with the same type signature can be used
interchangeably. There are many other ways they could be incompatible. However,
if the types don't match, they almost surely _won't_ be compatible. This little
bit of validation goes a long way, especially if you
[configure your editor](https://thinc.ai/docs/usage-type-checking) or other
2020-09-02 13:26:07 +00:00
tools to highlight these errors early. The config file is also validated at the
beginning of training, to verify that all the types match correctly.
2020-08-21 14:11:38 +00:00
2020-09-03 08:07:45 +00:00
<Accordion title="Tip: Static type checking in your editor">
2020-08-21 18:02:18 +00:00
If you're using a modern editor like Visual Studio Code, you can
[set up `mypy`](https://thinc.ai/docs/usage-type-checking#install) with the
custom Thinc plugin and get live feedback about mismatched types as you write
code.
[![](../images/thinc_mypy.jpg)](https://thinc.ai/docs/usage-type-checking#linting)
</Accordion>
2020-08-21 18:02:18 +00:00
## Swapping model architectures {#swap-architectures}
2020-09-02 13:26:07 +00:00
If no model is specified for the [`TextCategorizer`](/api/textcategorizer), the
[TextCatEnsemble](/api/architectures#TextCatEnsemble) architecture is used by
default. This architecture combines a simple bag-of-words model with a neural
2020-09-02 13:26:07 +00:00
network, usually resulting in the most accurate results, but at the cost of
speed. The config file for this model would look something like this:
```ini
### config.cfg (excerpt)
[components.textcat]
factory = "textcat"
labels = []
[components.textcat.model]
@architectures = "spacy.TextCatEnsemble.v2"
nO = null
[components.textcat.model.tok2vec]
@architectures = "spacy.Tok2Vec.v1"
[components.textcat.model.tok2vec.embed]
@architectures = "spacy.MultiHashEmbed.v1"
2020-09-02 13:26:07 +00:00
width = 64
rows = [2000, 2000, 1000, 1000, 1000, 1000]
attrs = ["ORTH", "LOWER", "PREFIX", "SUFFIX", "SHAPE", "ID"]
include_static_vectors = false
[components.textcat.model.tok2vec.encode]
@architectures = "spacy.MaxoutWindowEncoder.v1"
width = ${components.textcat.model.tok2vec.embed.width}
2020-09-02 13:26:07 +00:00
window_size = 1
maxout_pieces = 3
depth = 2
[components.textcat.model.linear_model]
@architectures = "spacy.TextCatBOW.v1"
exclusive_classes = false
2020-09-02 13:26:07 +00:00
ngram_size = 1
no_output_layer = false
2020-09-02 13:26:07 +00:00
```
spaCy has two additional built-in `textcat` architectures, and you can easily
use those by swapping out the definition of the textcat's model. For instance,
2020-09-03 08:07:45 +00:00
to use the simple and fast bag-of-words model
[TextCatBOW](/api/architectures#TextCatBOW), you can change the config to:
2020-09-02 13:26:07 +00:00
```ini
2020-09-03 08:07:45 +00:00
### config.cfg (excerpt) {highlight="6-10"}
2020-09-02 13:26:07 +00:00
[components.textcat]
factory = "textcat"
labels = []
[components.textcat.model]
@architectures = "spacy.TextCatBOW.v1"
exclusive_classes = false
ngram_size = 1
no_output_layer = false
nO = null
```
2020-09-03 08:07:45 +00:00
For details on all pre-defined architectures shipped with spaCy and how to
configure them, check out the [model architectures](/api/architectures)
documentation.
### Defining sublayers {#sublayers}
2020-08-21 14:11:38 +00:00
2020-09-02 15:36:22 +00:00
Model architecture functions often accept **sublayers as arguments**, so that
2020-08-21 17:34:06 +00:00
you can try **substituting a different layer** into the network. Depending on
how the architecture function is structured, you might be able to define your
network structure entirely through the [config system](/usage/training#config),
2020-09-02 15:36:22 +00:00
using layers that have already been defined.
2020-08-21 17:34:06 +00:00
In most neural network models for NLP, the most important parts of the network
are what we refer to as the
2020-09-02 15:36:22 +00:00
[embed and encode](https://explosion.ai/blog/deep-learning-formula-nlp) steps.
2020-08-21 14:11:38 +00:00
These steps together compute dense, context-sensitive representations of the
2020-09-02 15:36:22 +00:00
tokens, and their combination forms a typical
[`Tok2Vec`](/api/architectures#Tok2Vec) layer:
```ini
### config.cfg (excerpt)
[components.tok2vec]
factory = "tok2vec"
[components.tok2vec.model]
@architectures = "spacy.Tok2Vec.v1"
[components.tok2vec.model.embed]
@architectures = "spacy.MultiHashEmbed.v1"
# ...
[components.tok2vec.model.encode]
@architectures = "spacy.MaxoutWindowEncoder.v1"
# ...
```
2020-08-21 17:34:06 +00:00
2020-09-02 15:36:22 +00:00
By defining these sublayers specifically, it becomes straightforward to swap out
a sublayer for another one, for instance changing the first sublayer to a
character embedding with the [CharacterEmbed](/api/architectures#CharacterEmbed)
architecture:
```ini
### config.cfg (excerpt)
[components.tok2vec.model.embed]
@architectures = "spacy.CharacterEmbed.v1"
# ...
[components.tok2vec.model.encode]
@architectures = "spacy.MaxoutWindowEncoder.v1"
# ...
```
Most of spaCy's default architectures accept a `tok2vec` layer as a sublayer
within the larger task-specific neural network. This makes it easy to **switch
between** transformer, CNN, BiLSTM or other feature extraction approaches. The
[transformers documentation](/usage/embeddings-transformers#training-custom-model)
section shows an example of swapping out a model's standard `tok2vec` layer with
a transformer. And if you want to define your own solution, all you need to do
is register a ~~Model[List[Doc], List[Floats2d]]~~ architecture function, and
you'll be able to try it out in any of the spaCy components.
2020-08-21 14:11:38 +00:00
## Wrapping PyTorch, TensorFlow and other frameworks {#frameworks}
2020-08-21 14:11:38 +00:00
Thinc allows you to [wrap models](https://thinc.ai/docs/usage-frameworks)
written in other machine learning frameworks like PyTorch, TensorFlow and MXNet
2020-09-09 19:26:10 +00:00
using a unified [`Model`](https://thinc.ai/docs/api-model) API. This makes it
easy to use a model implemented in a different framework to power a component in
your spaCy pipeline. For example, to wrap a PyTorch model as a Thinc `Model`,
you can use Thinc's
[`PyTorchWrapper`](https://thinc.ai/docs/api-layers#pytorchwrapper):
2020-09-09 19:26:10 +00:00
```python
from thinc.api import PyTorchWrapper
wrapped_pt_model = PyTorchWrapper(torch_model)
```
Let's use PyTorch to define a very simple neural network consisting of two
hidden `Linear` layers with `ReLU` activation and dropout, and a
softmax-activated output layer:
```python
2020-09-09 19:26:10 +00:00
### PyTorch model
from torch import nn
torch_model = nn.Sequential(
nn.Linear(width, hidden_width),
nn.ReLU(),
nn.Dropout2d(dropout),
nn.Linear(hidden_width, nO),
nn.ReLU(),
nn.Dropout2d(dropout),
nn.Softmax(dim=1)
2020-09-12 15:05:10 +00:00
)
```
The resulting wrapped `Model` can be used as a **custom architecture** as such,
or can be a **subcomponent of a larger model**. For instance, we can use Thinc's
[`chain`](https://thinc.ai/docs/api-layers#chain) combinator, which works like
`Sequential` in PyTorch, to combine the wrapped model with other components in a
larger network. This effectively means that you can easily wrap different
components from different frameworks, and "glue" them together with Thinc:
```python
2020-09-12 15:05:10 +00:00
from thinc.api import chain, with_array, PyTorchWrapper
from spacy.ml import CharacterEmbed
2020-09-12 15:05:10 +00:00
wrapped_pt_model = PyTorchWrapper(torch_model)
char_embed = CharacterEmbed(width, embed_size, nM, nC)
model = chain(char_embed, with_array(wrapped_pt_model))
```
In the above example, we have combined our custom PyTorch model with a character
embedding layer defined by spaCy.
[CharacterEmbed](/api/architectures#CharacterEmbed) returns a `Model` that takes
2020-09-09 19:26:10 +00:00
a ~~List[Doc]~~ as input, and outputs a ~~List[Floats2d]~~. To make sure that
the wrapped PyTorch model receives valid inputs, we use Thinc's
[`with_array`](https://thinc.ai/docs/api-layers#with_array) helper.
2020-08-21 14:11:38 +00:00
2020-09-09 19:26:10 +00:00
You could also implement a model that only uses PyTorch for the transformer
layers, and "native" Thinc layers to do fiddly input and output transformations
and add on task-specific "heads", as efficiency is less of a consideration for
those parts of the network.
2020-08-21 14:11:38 +00:00
2020-09-09 19:26:10 +00:00
### Using wrapped models {#frameworks-usage}
2020-08-21 18:02:18 +00:00
2020-09-09 14:27:21 +00:00
To use our custom model including the PyTorch subnetwork, all we need to do is
2020-09-09 19:26:10 +00:00
register the architecture using the
2020-10-04 23:05:37 +00:00
[`architectures` registry](/api/top-level#registry). This assigns the
2020-09-09 19:26:10 +00:00
architecture a name so spaCy knows how to find it, and allows passing in
arguments like hyperparameters via the [config](/usage/training#config). The
full example then becomes:
```python
2020-09-09 19:26:10 +00:00
### Registering the architecture {highlight="9"}
from typing import List
from thinc.types import Floats2d
from thinc.api import Model, PyTorchWrapper, chain, with_array
import spacy
from spacy.tokens.doc import Doc
from spacy.ml import CharacterEmbed
from torch import nn
@spacy.registry.architectures("CustomTorchModel.v1")
2020-09-09 19:26:10 +00:00
def create_torch_model(
2020-09-09 09:25:35 +00:00
nO: int,
width: int,
hidden_width: int,
embed_size: int,
nM: int,
nC: int,
dropout: float,
) -> Model[List[Doc], List[Floats2d]]:
char_embed = CharacterEmbed(width, embed_size, nM, nC)
torch_model = nn.Sequential(
nn.Linear(width, hidden_width),
nn.ReLU(),
nn.Dropout2d(dropout),
nn.Linear(hidden_width, nO),
nn.ReLU(),
nn.Dropout2d(dropout),
nn.Softmax(dim=1)
)
wrapped_pt_model = PyTorchWrapper(torch_model)
model = chain(char_embed, with_array(wrapped_pt_model))
return model
```
2020-09-09 19:26:10 +00:00
The model definition can now be used in any existing trainable spaCy component,
by specifying it in the config file. In this configuration, all required
parameters for the various subcomponents of the custom architecture are passed
in as settings via the config.
```ini
2020-09-09 09:25:35 +00:00
### config.cfg (excerpt) {highlight="5-5"}
[components.tagger]
factory = "tagger"
[components.tagger.model]
@architectures = "CustomTorchModel.v1"
nO = 50
width = 96
hidden_width = 48
embed_size = 2000
2020-09-09 09:25:35 +00:00
nM = 64
nC = 8
dropout = 0.2
```
2020-09-09 19:26:10 +00:00
<Infobox variant="warning">
Remember that it is best not to rely on any (hidden) default values to ensure
2020-09-09 19:26:10 +00:00
that training configs are complete and experiments fully reproducible.
</Infobox>
2020-09-20 15:44:58 +00:00
Note that when using a PyTorch or Tensorflow model, it is recommended to set the
GPU memory allocator accordingly. When `gpu_allocator` is set to "pytorch" or
"tensorflow" in the training config, cupy will allocate memory via those
respective libraries, preventing OOM errors when there's available memory
sitting in the other library's pool.
```ini
### config.cfg (excerpt)
[training]
gpu_allocator = "pytorch"
```
2020-09-09 19:26:10 +00:00
## Custom models with Thinc {#thinc}
2020-09-09 19:26:10 +00:00
Of course it's also possible to define the `Model` from the previous section
2020-09-09 11:57:05 +00:00
entirely in Thinc. The Thinc documentation provides details on the
[various layers](https://thinc.ai/docs/api-layers) and helper functions
available. Combinators can be used to
2020-09-09 19:26:10 +00:00
[overload operators](https://thinc.ai/docs/usage-models#operators) and a common
usage pattern is to bind `chain` to `>>`. The "native" Thinc version of our
simple neural network would then become:
```python
from thinc.api import chain, with_array, Model, Relu, Dropout, Softmax
from spacy.ml import CharacterEmbed
char_embed = CharacterEmbed(width, embed_size, nM, nC)
with Model.define_operators({">>": chain}):
layers = (
2020-09-09 19:26:10 +00:00
Relu(hidden_width, width)
>> Dropout(dropout)
>> Relu(hidden_width, hidden_width)
>> Dropout(dropout)
>> Softmax(nO, hidden_width)
)
model = char_embed >> with_array(layers)
```
2020-09-09 19:26:10 +00:00
<Infobox variant="warning" title="Important note on inputs and outputs">
Note that Thinc layers define the output dimension (`nO`) as the first argument,
followed (optionally) by the input dimension (`nI`). This is in contrast to how
the PyTorch layers are defined, where `in_features` precedes `out_features`.
2020-09-09 19:26:10 +00:00
</Infobox>
2020-09-09 19:26:10 +00:00
### Shape inference in Thinc {#thinc-shape-inference}
It is **not** strictly necessary to define all the input and output dimensions
for each layer, as Thinc can perform
[shape inference](https://thinc.ai/docs/usage-models#validation) between
sequential layers by matching up the output dimensionality of one layer to the
input dimensionality of the next. This means that we can simplify the `layers`
definition:
2020-09-09 19:26:10 +00:00
> #### Diff
>
> ```diff
> layers = (
> Relu(hidden_width, width)
> >> Dropout(dropout)
> - >> Relu(hidden_width, hidden_width)
> + >> Relu(hidden_width)
> >> Dropout(dropout)
> - >> Softmax(nO, hidden_width)
> + >> Softmax(nO)
> )
> ```
2020-09-09 11:57:05 +00:00
```python
with Model.define_operators({">>": chain}):
layers = (
2020-09-09 19:26:10 +00:00
Relu(hidden_width, width)
>> Dropout(dropout)
>> Relu(hidden_width)
>> Dropout(dropout)
>> Softmax(nO)
2020-09-09 11:57:05 +00:00
)
```
2020-09-09 19:26:10 +00:00
Thinc can even go one step further and **deduce the correct input dimension** of
the first layer, and output dimension of the last. To enable this functionality,
you have to call
[`Model.initialize`](https://thinc.ai/docs/api-model#initialize) with an **input
sample** `X` and an **output sample** `Y` with the correct dimensions:
2020-09-09 11:57:05 +00:00
```python
2020-09-09 19:26:10 +00:00
### Shape inference with initialization {highlight="3,7,10"}
2020-09-09 11:57:05 +00:00
with Model.define_operators({">>": chain}):
layers = (
2020-09-09 19:26:10 +00:00
Relu(hidden_width)
>> Dropout(dropout)
>> Relu(hidden_width)
>> Dropout(dropout)
>> Softmax()
2020-09-09 11:57:05 +00:00
)
model = char_embed >> with_array(layers)
model.initialize(X=input_sample, Y=output_sample)
```
The built-in [pipeline components](/usage/processing-pipelines) in spaCy ensure
2020-09-09 19:26:10 +00:00
that their internal models are **always initialized** with appropriate sample
data. In this case, `X` is typically a ~~List[Doc]~~, while `Y` is typically a
~~List[Array1d]~~ or ~~List[Array2d]~~, depending on the specific task. This
2020-09-28 19:35:09 +00:00
functionality is triggered when [`nlp.initialize`](/api/language#initialize) is
called.
2020-09-09 11:57:05 +00:00
2020-09-09 19:26:10 +00:00
### Dropout and normalization in Thinc {#thinc-dropout-norm}
2020-09-09 11:57:05 +00:00
2020-09-09 19:26:10 +00:00
Many of the available Thinc [layers](https://thinc.ai/docs/api-layers) allow you
to define a `dropout` argument that will result in "chaining" an additional
2020-09-09 11:57:05 +00:00
[`Dropout`](https://thinc.ai/docs/api-layers#dropout) layer. Optionally, you can
often specify whether or not you want to add layer normalization, which would
result in an additional
2020-09-09 19:26:10 +00:00
[`LayerNorm`](https://thinc.ai/docs/api-layers#layernorm) layer. That means that
the following `layers` definition is equivalent to the previous:
2020-09-09 11:57:05 +00:00
```python
with Model.define_operators({">>": chain}):
layers = (
2020-09-09 19:26:10 +00:00
Relu(hidden_width, dropout=dropout, normalize=False)
>> Relu(hidden_width, dropout=dropout, normalize=False)
>> Softmax()
2020-09-09 11:57:05 +00:00
)
model = char_embed >> with_array(layers)
model.initialize(X=input_sample, Y=output_sample)
```
2020-08-21 14:11:38 +00:00
## Create new trainable components {#components}
2020-08-21 14:11:38 +00:00
2020-10-03 21:27:05 +00:00
In addition to [swapping out](#swap-architectures) default models in built-in
components, you can also implement an entirely new,
2020-10-05 11:06:20 +00:00
[trainable](/usage/processing-pipelines#trainable-components) pipeline component
2020-10-03 22:08:02 +00:00
from scratch. This can be done by creating a new class inheriting from
[`TrainablePipe`](/api/pipe), and linking it up to your custom model
implementation.
2020-10-03 21:27:05 +00:00
2020-10-05 11:06:20 +00:00
<Infobox title="Trainable component API" emoji="💡">
2020-10-03 21:27:05 +00:00
2020-10-05 11:06:20 +00:00
For details on how to implement pipeline components, check out the usage guide
on [custom components](/usage/processing-pipelines#custom-component) and the
overview of the `TrainablePipe` methods used by
2020-10-05 11:06:20 +00:00
[trainable components](/usage/processing-pipelines#trainable-components).
2020-09-12 15:05:10 +00:00
</Infobox>
2020-10-14 13:01:19 +00:00
### Example: Entity relation extraction component {#component-rel}
2020-10-05 11:06:20 +00:00
This section outlines an example use-case of implementing a **novel relation
extraction component** from scratch. We'll implement a binary relation
extraction method that determines whether or not **two entities** in a document
are related, and if so, what type of relation. We'll allow multiple types of
relations between two such entities (multi-label setting). There are two major
steps required:
1. Implement a [machine learning model](#component-rel-model) specific to this
task. It will have to extract candidates from a [`Doc`](/api/doc) and predict
a relation for the available candidate pairs.
2. Implement a custom [pipeline component](#component-rel-pipe) powered by the
machine learning model that sets annotations on the [`Doc`](/api/doc) passing
through the pipeline.
<!-- TODO: <Project id="tutorials/ner-relations">
</Project> -->
#### Step 1: Implementing the Model {#component-rel-model}
2020-10-04 23:05:37 +00:00
We need to implement a [`Model`](https://thinc.ai/docs/api-model) that takes a
2020-10-05 11:06:20 +00:00
**list of documents** (~~List[Doc]~~) as input, and outputs a **two-dimensional
matrix** (~~Floats2d~~) of predictions:
> #### Model type annotations
>
> The `Model` class is a generic type that can specify its input and output
> types, e.g. ~~Model[List[Doc], Floats2d]~~. Type hints are used for static
> type checks and validation. See the section on [type signatures](#type-sigs)
> for details.
```python
2020-10-05 11:06:20 +00:00
### Register the model architecture
@registry.architectures.register("rel_model.v1")
def create_relation_model(...) -> Model[List[Doc], Floats2d]:
2020-10-05 11:06:20 +00:00
model = ... # 👈 model will go here
return model
```
The first layer in this model will typically be an
[embedding layer](/usage/embeddings-transformers) such as a
2020-10-04 23:05:37 +00:00
[`Tok2Vec`](/api/tok2vec) component or a [`Transformer`](/api/transformer). This
layer is assumed to be of type ~~Model[List[Doc], List[Floats2d]]~~ as it
2020-10-05 11:06:20 +00:00
transforms each **document into a list of tokens**, with each token being
represented by its embedding in the vector space.
2020-10-05 11:06:20 +00:00
Next, we need a method that **generates pairs of entities** that we want to
classify as being related or not. As these candidate pairs are typically formed
within one document, this function takes a [`Doc`](/api/doc) as input and
outputs a `List` of `Span` tuples. For instance, a very straightforward
implementation would be to just take any two entities from the same document:
2020-08-22 15:15:05 +00:00
```python
2020-10-05 11:06:20 +00:00
### Simple candiate generation
def get_candidates(doc: Doc) -> List[Tuple[Span, Span]]:
2020-10-03 22:08:02 +00:00
candidates = []
for ent1 in doc.ents:
for ent2 in doc.ents:
candidates.append((ent1, ent2))
return candidates
2020-10-03 21:27:05 +00:00
```
2020-10-05 11:06:20 +00:00
But we could also refine this further by **excluding relations** of an entity
with itself, and posing a **maximum distance** (in number of tokens) between two
entities. We register this function in the
[`@misc` registry](/api/top-level#registry) so we can refer to it from the
config, and easily swap it out for any other candidate generation function.
> #### config.cfg (excerpt)
>
> ```ini
> [model]
> @architectures = "rel_model.v1"
>
> [model.tok2vec]
2020-10-05 11:06:20 +00:00
> # ...
>
> [model.get_candidates]
2020-10-05 11:06:20 +00:00
> @misc = "rel_cand_generator.v1"
2020-10-04 23:05:37 +00:00
> max_length = 20
2020-10-03 22:08:02 +00:00
> ```
2020-10-03 21:27:05 +00:00
```python
2020-10-05 11:06:20 +00:00
### Extended candidate generation {highlight="1,2,7,8"}
@registry.misc.register("rel_cand_generator.v1")
2020-10-03 21:27:05 +00:00
def create_candidate_indices(max_length: int) -> Callable[[Doc], List[Tuple[Span, Span]]]:
2020-10-03 22:08:02 +00:00
def get_candidates(doc: "Doc") -> List[Tuple[Span, Span]]:
candidates = []
2020-10-03 21:27:05 +00:00
for ent1 in doc.ents:
for ent2 in doc.ents:
if ent1 != ent2:
if max_length and abs(ent2.start - ent1.start) <= max_length:
2020-10-03 22:08:02 +00:00
candidates.append((ent1, ent2))
return candidates
return get_candidates
```
2020-10-04 23:05:37 +00:00
Finally, we require a method that transforms the candidate entity pairs into a
2020-10-05 11:06:20 +00:00
2D tensor using the specified [`Tok2Vec`](/api/tok2vec) or
[`Transformer`](/api/transformer). The resulting ~~Floats2~~ object will then be
processed by a final `output_layer` of the network. Putting all this together,
we can define our relation model in a config file as such:
2020-10-05 11:06:20 +00:00
```ini
### config.cfg
[model]
@architectures = "rel_model.v1"
2020-10-05 11:06:20 +00:00
# ...
2020-10-03 22:08:02 +00:00
[model.tok2vec]
2020-10-05 11:06:20 +00:00
# ...
[model.get_candidates]
2020-10-14 13:01:19 +00:00
@misc = "rel_cand_generator.v1"
2020-10-04 23:05:37 +00:00
max_length = 20
[model.create_candidate_tensor]
@misc = "rel_cand_tensor.v1"
[model.output_layer]
@architectures = "rel_output_layer.v1"
2020-10-05 11:06:20 +00:00
# ...
```
2020-10-05 11:06:20 +00:00
<!-- TODO: link to project for implementation details -->
<!-- TODO: maybe embed files from project that show the architectures? -->
2020-10-04 23:05:37 +00:00
When creating this model, we store the custom functions as
[attributes](https://thinc.ai/docs/api-model#properties) and the sublayers as
references, so we can access them easily:
```python
tok2vec_layer = model.get_ref("tok2vec")
output_layer = model.get_ref("output_layer")
create_candidate_tensor = model.attrs["create_candidate_tensor"]
get_candidates = model.attrs["get_candidates"]
```
#### Step 2: Implementing the pipeline component {#component-rel-pipe}
2020-10-05 11:06:20 +00:00
To use our new relation extraction model as part of a custom
[trainable component](/usage/processing-pipelines#trainable-components), we
create a subclass of [`TrainablePipe`](/api/pipe) that holds the model.
2020-10-06 12:15:08 +00:00
![Illustration of Pipe methods](../images/trainable_component.svg)
```python
2020-10-05 11:06:20 +00:00
### Pipeline component skeleton
from spacy.pipeline import TrainablePipe
class RelationExtractor(TrainablePipe):
2020-10-05 11:06:20 +00:00
def __init__(self, vocab, model, name="rel"):
"""Create a component instance."""
2020-10-04 22:39:36 +00:00
self.model = model
2020-10-05 11:06:20 +00:00
self.vocab = vocab
self.name = name
2020-10-05 11:06:20 +00:00
def update(self, examples, drop=0.0, set_annotations=False, sgd=None, losses=None):
"""Learn from a batch of Example objects."""
2020-10-04 23:05:37 +00:00
...
def predict(self, docs):
2020-10-05 11:06:20 +00:00
"""Apply the model to a batch of Doc objects."""
...
2020-10-04 22:39:36 +00:00
def set_annotations(self, docs, predictions):
2020-10-05 11:06:20 +00:00
"""Modify a batch of Doc objects using the predictions."""
...
2020-10-05 11:06:20 +00:00
def initialize(self, get_examples, nlp=None, labels=None):
"""Initialize the model before training."""
...
def add_label(self, label):
"""Add a label to the component."""
...
2020-10-04 22:39:36 +00:00
```
2020-10-04 23:05:37 +00:00
Before the model can be used, it needs to be
2020-10-05 11:06:20 +00:00
[initialized](/usage/training#initialization). This function receives a callback
to access the full **training data set**, or a representative sample. This data
set can be used to deduce all **relevant labels**. Alternatively, a list of
labels can be provided to `initialize`, or you can call
2020-10-14 13:01:19 +00:00
`RelationExtractor.add_label` directly. The number of labels defines the output
2020-10-05 11:06:20 +00:00
dimensionality of the network, and will be used to do
[shape inference](https://thinc.ai/docs/usage-models#validation) throughout the
layers of the neural network. This is triggered by calling
[`Model.initialize`](https://thinc.ai/api/model#initialize).
2020-10-04 22:39:36 +00:00
```python
2020-10-05 11:06:20 +00:00
### The initialize method {highlight="12,18,22"}
2020-10-04 22:39:36 +00:00
from itertools import islice
def initialize(
self,
get_examples: Callable[[], Iterable[Example]],
*,
nlp: Language = None,
labels: Optional[List[str]] = None,
):
if labels is not None:
for label in labels:
self.add_label(label)
else:
for example in get_examples():
relations = example.reference._.rel
for indices, label_dict in relations.items():
for label in label_dict.keys():
self.add_label(label)
subbatch = list(islice(get_examples(), 10))
doc_sample = [eg.reference for eg in subbatch]
label_sample = self._examples_to_truth(subbatch)
self.model.initialize(X=doc_sample, Y=label_sample)
```
2020-10-04 22:39:36 +00:00
2020-10-04 23:05:37 +00:00
The `initialize` method is triggered whenever this component is part of an `nlp`
2020-10-05 11:06:20 +00:00
pipeline, and [`nlp.initialize`](/api/language#initialize) is invoked.
Typically, this happens when the pipeline is set up before training in
[`spacy train`](/api/cli#training). After initialization, the pipeline component
and its internal model can be trained and used to make predictions.
2020-10-04 23:05:37 +00:00
During training, the function [`update`](/api/pipe#update) is invoked which
delegates to
2020-10-05 11:06:20 +00:00
[`Model.begin_update`](https://thinc.ai/docs/api-model#begin_update) and a
2020-10-14 13:01:19 +00:00
[`get_loss`](/api/pipe#get_loss) function that **calculates the loss** for a
2020-10-05 11:06:20 +00:00
batch of examples, as well as the **gradient** of loss that will be used to
update the weights of the model layers. Thinc provides several
[loss functions](https://thinc.ai/docs/api-loss) that can be used for the
implementation of the `get_loss` function.
2020-10-04 22:39:36 +00:00
```python
2020-10-05 11:06:20 +00:00
### The update method {highlight="12-14"}
2020-10-04 22:39:36 +00:00
def update(
self,
examples: Iterable[Example],
*,
drop: float = 0.0,
set_annotations: bool = False,
sgd: Optional[Optimizer] = None,
losses: Optional[Dict[str, float]] = None,
) -> Dict[str, float]:
...
2020-08-22 15:15:05 +00:00
docs = [ex.predicted for ex in examples]
predictions, backprop = self.model.begin_update(docs)
2020-10-04 22:39:36 +00:00
loss, gradient = self.get_loss(examples, predictions)
2020-08-22 15:15:05 +00:00
backprop(gradient)
2020-10-04 22:39:36 +00:00
losses[self.name] += loss
...
return losses
```
2020-10-04 23:05:37 +00:00
When the internal model is trained, the component can be used to make novel
2020-10-05 11:06:20 +00:00
**predictions**. The [`predict`](/api/pipe#predict) function needs to be
implemented for each subclass of `TrainablePipe`. In our case, we can simply
delegate to the internal model's
[predict](https://thinc.ai/docs/api-model#predict) function that takes a batch
of `Doc` objects and returns a ~~Floats2d~~ array:
2020-10-04 12:56:48 +00:00
```python
2020-10-05 11:06:20 +00:00
### The predict method
def predict(self, docs: Iterable[Doc]) -> Floats2d:
2020-10-04 22:39:36 +00:00
predictions = self.model.predict(docs)
return self.model.ops.asarray(predictions)
```
2020-10-03 21:27:05 +00:00
2020-10-04 23:05:37 +00:00
The final method that needs to be implemented, is
[`set_annotations`](/api/pipe#set_annotations). This function takes the
predictions, and modifies the given `Doc` object in place to store them. For our
relation extraction component, we store the data as a dictionary in a custom
2020-10-05 11:06:20 +00:00
[extension attribute](/usage/processing-pipelines#custom-components-attributes)
`doc._.rel`. As keys, we represent the candidate pair by the **start offsets of
each entity**, as this defines an entity pair uniquely within one document.
2020-10-04 12:56:48 +00:00
2020-10-05 11:06:20 +00:00
To interpret the scores predicted by the relation extraction model correctly, we
need to refer to the model's `get_candidates` function that defined which pairs
of entities were relevant candidates, so that the predictions can be linked to
those exact entities:
2020-10-04 12:56:48 +00:00
> #### Example output
>
> ```python
> doc = nlp("Amsterdam is the capital of the Netherlands.")
2020-10-05 11:06:20 +00:00
> print("spans", [(e.start, e.text, e.label_) for e in doc.ents])
2020-10-04 12:56:48 +00:00
> for value, rel_dict in doc._.rel.items():
> print(f"{value}: {rel_dict}")
2020-10-05 11:06:20 +00:00
>
> # spans [(0, 'Amsterdam', 'LOC'), (6, 'Netherlands', 'LOC')]
> # (0, 6): {'CAPITAL_OF': 0.89, 'LOCATED_IN': 0.75, 'UNRELATED': 0.002}
> # (6, 0): {'CAPITAL_OF': 0.01, 'LOCATED_IN': 0.13, 'UNRELATED': 0.017}
2020-10-04 12:56:48 +00:00
> ```
2020-10-05 11:06:20 +00:00
```python
### Registering the extension attribute
from spacy.tokens import Doc
Doc.set_extension("rel", default={})
```
2020-10-04 12:56:48 +00:00
```python
2020-10-05 11:06:20 +00:00
### The set_annotations method {highlight="5-6,10"}
2020-10-04 22:39:36 +00:00
def set_annotations(self, docs: Iterable[Doc], predictions: Floats2d):
2020-10-04 12:56:48 +00:00
c = 0
get_candidates = self.model.attrs["get_candidates"]
for doc in docs:
for (e1, e2) in get_candidates(doc):
offset = (e1.start, e2.start)
if offset not in doc._.rel:
doc._.rel[offset] = {}
for j, label in enumerate(self.labels):
2020-10-04 22:39:36 +00:00
doc._.rel[offset][label] = predictions[c, j]
2020-10-04 12:56:48 +00:00
c += 1
```
2020-10-03 21:27:05 +00:00
2020-10-04 23:05:37 +00:00
Under the hood, when the pipe is applied to a document, it delegates to the
2020-10-05 11:06:20 +00:00
`predict` and `set_annotations` methods:
2020-08-22 15:15:05 +00:00
2020-10-04 22:39:36 +00:00
```python
2020-10-05 11:06:20 +00:00
### The __call__ method
2020-10-04 22:39:36 +00:00
def __call__(self, Doc doc):
predictions = self.predict([doc])
self.set_annotations([doc], predictions)
return doc
2020-08-22 15:15:05 +00:00
```
2020-10-03 21:27:05 +00:00
There is one more optional method to implement: [`score`](/api/pipe#score)
calculates the performance of your component on a set of examples, and
returns the results as a dictionary:
```python
### The score method
def score(self, examples: Iterable[Example]) -> Dict[str, Any]:
prf = PRFScore()
for example in examples:
...
return {
2020-10-26 10:09:25 +00:00
"rel_micro_p": prf.precision,
"rel_micro_r": prf.recall,
"rel_micro_f": prf.fscore,
}
```
This is particularly useful to see the scores on the development corpus
when training the component with [`spacy train`](/api/cli#training).
Once our `TrainablePipe` subclass is fully implemented, we can
2020-10-05 11:06:20 +00:00
[register](/usage/processing-pipelines#custom-components-factories) the
2020-10-06 12:15:08 +00:00
component with the [`@Language.factory`](/api/language#factory) decorator. This
2020-10-05 11:06:20 +00:00
assigns it a name and lets you create the component with
[`nlp.add_pipe`](/api/language#add_pipe) and via the
[config](/usage/training#config).
2020-10-05 11:06:20 +00:00
> #### config.cfg (excerpt)
2020-10-04 23:05:37 +00:00
>
2020-10-05 11:06:20 +00:00
> ```ini
2020-10-04 22:39:36 +00:00
> [components.relation_extractor]
> factory = "relation_extractor"
2020-10-04 23:05:37 +00:00
>
2020-10-04 22:39:36 +00:00
> [components.relation_extractor.model]
> @architectures = "rel_model.v1"
2020-10-05 11:06:20 +00:00
>
> [components.relation_extractor.model.tok2vec]
> # ...
>
> [components.relation_extractor.model.get_candidates]
> @misc = "rel_cand_generator.v1"
> max_length = 20
2020-10-26 10:09:25 +00:00
>
> [training.score_weights]
> rel_micro_p: 0.0
> rel_micro_r: 0.0
> rel_micro_f: 1.0
2020-10-04 22:39:36 +00:00
> ```
2020-08-22 15:15:05 +00:00
```python
2020-10-05 11:06:20 +00:00
### Registering the pipeline component
2020-10-04 22:39:36 +00:00
from spacy.language import Language
2020-08-22 15:15:05 +00:00
2020-10-04 22:39:36 +00:00
@Language.factory("relation_extractor")
2020-10-05 11:06:20 +00:00
def make_relation_extractor(nlp, name, model):
return RelationExtractor(nlp.vocab, model, name)
2020-08-22 15:15:05 +00:00
```
2020-10-04 22:39:36 +00:00
You can extend the decorator to include information such as the type of
annotations that are required for this component to run, the type of annotations
it produces, and the scores that can be calculated:
```python
2020-10-26 10:09:25 +00:00
### Factory annotations {highlight="5-11"}
from spacy.language import Language
@Language.factory(
"relation_extractor",
requires=["doc.ents", "token.ent_iob", "token.ent_type"],
assigns=["doc._.rel"],
default_score_weights={
"rel_micro_p": None,
"rel_micro_r": None,
"rel_micro_f": None,
},
)
def make_relation_extractor(nlp, name, model):
return RelationExtractor(nlp.vocab, model, name)
```
2020-10-05 11:06:20 +00:00
<!-- TODO: <Project id="tutorials/ner-relations">
2020-10-04 22:39:36 +00:00
2020-10-05 11:06:20 +00:00
</Project> -->