various small fixes

This commit is contained in:
svlandeg 2020-10-05 01:05:37 +02:00
parent 52b660e9dc
commit 9a6c9b133b
1 changed files with 74 additions and 68 deletions

View File

@ -288,7 +288,7 @@ those parts of the network.
To use our custom model including the PyTorch subnetwork, all we need to do is To use our custom model including the PyTorch subnetwork, all we need to do is
register the architecture using the register the architecture using the
[`architectures` registry](/api/top-level#registry). This will assign the [`architectures` registry](/api/top-level#registry). This assigns the
architecture a name so spaCy knows how to find it, and allows passing in architecture a name so spaCy knows how to find it, and allows passing in
arguments like hyperparameters via the [config](/usage/training#config). The arguments like hyperparameters via the [config](/usage/training#config). The
full example then becomes: full example then becomes:
@ -488,27 +488,27 @@ with Model.define_operators({">>": chain}):
In addition to [swapping out](#swap-architectures) default models in built-in In addition to [swapping out](#swap-architectures) default models in built-in
components, you can also implement an entirely new, components, you can also implement an entirely new,
[trainable pipeline component](usage/processing-pipelines#trainable-components) [trainable pipeline component](/usage/processing-pipelines#trainable-components)
from scratch. This can be done by creating a new class inheriting from from scratch. This can be done by creating a new class inheriting from
[`Pipe`](/api/pipe), and linking it up to your custom model implementation. [`Pipe`](/api/pipe), and linking it up to your custom model implementation.
### Example: Pipeline component for relation extraction {#component-rel} ### Example: Pipeline component for relation extraction {#component-rel}
This section outlines an example use-case of implementing a novel relation This section outlines an example use-case of implementing a novel relation
extraction component from scratch. We assume we want to implement a binary extraction component from scratch. We'll implement a binary relation extraction
relation extraction method that determines whether two entities in a document method that determines whether or not two entities in a document are related,
are related or not, and if so, with what type of relation. We'll allow multiple and if so, what type of relation. We'll allow multiple types of relations
types of relations between two such entities - i.e. it is a multi-label setting. between two such entities (multi-label setting).
There are two major steps required: first, we need to There are two major steps required: first, we need to
[implement a machine learning model](#component-rel-model) specific to this [implement a machine learning model](#component-rel-model) specific to this
task, and then we'll use this model to task, and subsequently we use this model to
[implement a custom pipeline component](#component-rel-pipe). [implement a custom pipeline component](#component-rel-pipe).
#### Step 1: Implementing the Model {#component-rel-model} #### Step 1: Implementing the Model {#component-rel-model}
We'll need to implement a [`Model`](https://thinc.ai/docs/api-model) that takes We need to implement a [`Model`](https://thinc.ai/docs/api-model) that takes a
a list of documents as input, and outputs a two-dimensional matrix of scores: list of documents as input, and outputs a two-dimensional matrix of predictions:
```python ```python
@registry.architectures.register("rel_model.v1") @registry.architectures.register("rel_model.v1")
@ -519,17 +519,16 @@ def create_relation_model(...) -> Model[List[Doc], Floats2d]:
The first layer in this model will typically be an The first layer in this model will typically be an
[embedding layer](/usage/embeddings-transformers) such as a [embedding layer](/usage/embeddings-transformers) such as a
[`Tok2Vec`](/api/tok2vec) component or [`Transformer`](/api/transformer). This [`Tok2Vec`](/api/tok2vec) component or a [`Transformer`](/api/transformer). This
layer is assumed to be of type `Model[List["Doc"], List[Floats2d]]` as it layer is assumed to be of type ~~Model[List[Doc], List[Floats2d]]~~ as it
transforms each document into a list of tokens, with each token being transforms each document into a list of tokens, with each token being
represented by its embedding in the vector space. represented by its embedding in the vector space.
Next, we need a method that will generate pairs of entities that we want to Next, we need a method that generates pairs of entities that we want to classify
classify as being related or not. These candidate pairs are typically formed as being related or not. As these candidate pairs are typically formed within
within one document, which means we'll have a function that takes a `Doc` as one document, this function takes a `Doc` as input and outputs a `List` of
input and outputs a `List` of `Span` tuples. For instance, a very `Span` tuples. For instance, a very straightforward implementation would be to
straightforward implementation would be to just take any two entities from the just take any two entities from the same document:
same document:
```python ```python
def get_candidates(doc: "Doc") -> List[Tuple[Span, Span]]: def get_candidates(doc: "Doc") -> List[Tuple[Span, Span]]:
@ -549,12 +548,12 @@ def get_candidates(doc: "Doc") -> List[Tuple[Span, Span]]:
> >
> [model.get_candidates] > [model.get_candidates]
> @misc = "rel_cand_generator.v2" > @misc = "rel_cand_generator.v2"
> max_length = 6 > max_length = 20
> ``` > ```
But we could also refine this further by excluding relations of an entity with But we could also refine this further by excluding relations of an entity with
itself, and posing a maximum distance (in number of tokens) between two itself, and posing a maximum distance (in number of tokens) between two
entities. We'll register this function in the entities. We register this function in the
[`@misc` registry](/api/top-level#registry) so we can refer to it from the [`@misc` registry](/api/top-level#registry) so we can refer to it from the
config, and easily swap it out for any other candidate generation function. config, and easily swap it out for any other candidate generation function.
@ -573,10 +572,10 @@ def create_candidate_indices(max_length: int) -> Callable[[Doc], List[Tuple[Span
return get_candidates return get_candidates
``` ```
Finally, we'll require a method that transforms the candidate pairs of entities Finally, we require a method that transforms the candidate entity pairs into a
into a 2D tensor using the specified Tok2Vec function, and this `Floats2d` 2D tensor using the specified `Tok2Vec` function. The resulting `Floats2d`
object will then be processed by a final `output_layer` of the network. Taking object will then be processed by a final `output_layer` of the network. Putting
all this together, we can define our relation model like this in the config: all this together, we can define our relation model in a config file as such:
``` ```
[model] [model]
@ -588,7 +587,7 @@ all this together, we can define our relation model like this in the config:
[model.get_candidates] [model.get_candidates]
@misc = "rel_cand_generator.v2" @misc = "rel_cand_generator.v2"
max_length = 6 max_length = 20
[model.create_candidate_tensor] [model.create_candidate_tensor]
@misc = "rel_cand_tensor.v1" @misc = "rel_cand_tensor.v1"
@ -600,7 +599,7 @@ max_length = 6
<!-- TODO: Link to project for implementation details --> <!-- TODO: Link to project for implementation details -->
When creating this model, we'll store the custom functions as When creating this model, we store the custom functions as
[attributes](https://thinc.ai/docs/api-model#properties) and the sublayers as [attributes](https://thinc.ai/docs/api-model#properties) and the sublayers as
references, so we can access them easily: references, so we can access them easily:
@ -614,7 +613,7 @@ get_candidates = model.attrs["get_candidates"]
#### Step 2: Implementing the pipeline component {#component-rel-pipe} #### Step 2: Implementing the pipeline component {#component-rel-pipe}
To use our new relation extraction model as part of a custom component, we To use our new relation extraction model as part of a custom component, we
create a subclass of [`Pipe`](/api/pipe) that will hold the model: create a subclass of [`Pipe`](/api/pipe) that holds the model:
```python ```python
from spacy.pipeline import Pipe from spacy.pipeline import Pipe
@ -624,6 +623,9 @@ class RelationExtractor(Pipe):
self.model = model self.model = model
... ...
def update(self, examples, ...):
...
def predict(self, docs): def predict(self, docs):
... ...
@ -631,18 +633,19 @@ class RelationExtractor(Pipe):
... ...
``` ```
Before the model can be used however, it needs to be Before the model can be used, it needs to be
[initialized](/api/pipe#initialize). This function recieves either the full [initialized](/api/pipe#initialize). This function receives either the full
training data set, or a representative sample. The training data can be used training data set, or a representative sample. This data set can be used to
to deduce all relevant labels. Alternatively, a list of labels can be provided, deduce all relevant labels. Alternatively, a list of labels can be provided, or
or a script can call `rel_component.add_label()` to add each label separately. a script can call `rel_component.add_label()` directly.
The number of labels will define the output dimensionality of the network, The number of labels defines the output dimensionality of the network, and will
and will be used to do be used to do [shape inference](https://thinc.ai/docs/usage-models#validation)
[shape inference](https://thinc.ai/docs/usage-models#validation) throughout throughout the layers of the neural network. This is triggered by calling
the layers of the neural network. This is triggerd by calling `model.initialize`. `model.initialize`.
```python ```python
### {highlight="12,18,22"}
from itertools import islice from itertools import islice
def initialize( def initialize(
@ -667,17 +670,20 @@ def initialize(
self.model.initialize(X=doc_sample, Y=label_sample) self.model.initialize(X=doc_sample, Y=label_sample)
``` ```
The `initialize` method will be triggered whenever this component is part of an The `initialize` method is triggered whenever this component is part of an `nlp`
`nlp` pipeline, and `nlp.initialize()` is invoked. After doing so, the pipeline pipeline, and [`nlp.initialize()`](/api/language#initialize) is invoked. After
component and its internal model can be trained and used to make predictions. doing so, the pipeline component and its internal model can be trained and used
to make predictions.
During training the function [`update`](/api/pipe#update) is invoked which delegates to During training, the function [`update`](/api/pipe#update) is invoked which
[`self.model.begin_update`](https://thinc.ai/docs/api-model#begin_update) and delegates to
needs a function [`get_loss`](/api/pipe#get_loss) that will calculate the [`self.model.begin_update`](https://thinc.ai/docs/api-model#begin_update) and a
loss for a batch of examples, as well as the gradient of loss that will be used to update [`get_loss`](/api/pipe#get_loss) function that calculate the loss for a batch of
the weights of the model layers. examples, as well as the gradient of loss that will be used to update the
weights of the model layers.
```python ```python
### {highlight="12-14"}
def update( def update(
self, self,
examples: Iterable[Example], examples: Iterable[Example],
@ -697,13 +703,13 @@ def update(
return losses return losses
``` ```
Thinc provides some [loss functions](https://thinc.ai/docs/api-loss) that can be used Thinc provides several [loss functions](https://thinc.ai/docs/api-loss) that can
for the implementation of the `get_loss` function. be used for the implementation of the `get_loss` function.
When the internal model is trained, the component can be used to make novel predictions. When the internal model is trained, the component can be used to make novel
The [`predict`](/api/pipe#predict) function needs to be implemented for each predictions. The [`predict`](/api/pipe#predict) function needs to be implemented
subclass of `Pipe`. In our case, we can simply delegate to the internal model's for each subclass of `Pipe`. In our case, we can simply delegate to the internal
[predict](https://thinc.ai/docs/api-model#predict) function: model's [predict](https://thinc.ai/docs/api-model#predict) function:
```python ```python
def predict(self, docs: Iterable[Doc]) -> Floats2d: def predict(self, docs: Iterable[Doc]) -> Floats2d:
@ -711,24 +717,24 @@ def predict(self, docs: Iterable[Doc]) -> Floats2d:
return self.model.ops.asarray(predictions) return self.model.ops.asarray(predictions)
``` ```
The other method that needs to be implemented, is The final method that needs to be implemented, is
[`set_annotations`](/api/pipe#set_annotations). It takes the predicted scores, [`set_annotations`](/api/pipe#set_annotations). This function takes the
and modifies the given `Doc` object in place to hold the predictions. For our predictions, and modifies the given `Doc` object in place to store them. For our
relation extraction component, we'll store the data as a dictionary in a custom relation extraction component, we store the data as a dictionary in a custom
extension attribute `doc._.rel`. As keys, we represent the candidate pair by the extension attribute `doc._.rel`. As keys, we represent the candidate pair by the
start offsets of each entity, as this defines an entity pair uniquely within one start offsets of each entity, as this defines an entity pair uniquely within one
document. document.
To interpret the scores predicted by the REL model correctly, we need to To interpret the scores predicted by the REL model correctly, we need to refer
refer to the model's `get_candidates` function that originally defined which to the model's `get_candidates` function that defined which pairs of entities
pairs of entities would be run through the model, so that the scores can be were relevant candidates, so that the predictions can be linked to those exact
related to those exact entities: entities:
> #### Example output > #### Example output
> >
> ```python > ```python
> doc = nlp("Amsterdam is the capital of the Netherlands.") > doc = nlp("Amsterdam is the capital of the Netherlands.")
> print(f"spans: {[(e.start, e.text, e.label_) for e in doc.ents]}") > print(f"spans: [(e.start, e.text, e.label_) for e in doc.ents]")
> for value, rel_dict in doc._.rel.items(): > for value, rel_dict in doc._.rel.items():
> print(f"{value}: {rel_dict}") > print(f"{value}: {rel_dict}")
> ``` > ```
@ -740,6 +746,7 @@ related to those exact entities:
> ``` > ```
```python ```python
### {highlight="5-6,10"}
def set_annotations(self, docs: Iterable[Doc], predictions: Floats2d): def set_annotations(self, docs: Iterable[Doc], predictions: Floats2d):
c = 0 c = 0
get_candidates = self.model.attrs["get_candidates"] get_candidates = self.model.attrs["get_candidates"]
@ -753,8 +760,8 @@ def set_annotations(self, docs: Iterable[Doc], predictions: Floats2d):
c += 1 c += 1
``` ```
Under the hood, when the pipe is applied to a document, it will delegate to these Under the hood, when the pipe is applied to a document, it delegates to the
two methods: `predict` and `set_annotations` functions:
```python ```python
def __call__(self, Doc doc): def __call__(self, Doc doc):
@ -765,9 +772,8 @@ def __call__(self, Doc doc):
Once our `Pipe` subclass is fully implemented, we can Once our `Pipe` subclass is fully implemented, we can
[register](http://localhost:8000/usage/processing-pipelines#custom-components-factories) [register](http://localhost:8000/usage/processing-pipelines#custom-components-factories)
the component with the the component with the `Language.factory` decorator. This enables the creation
`Language.factory` decorator. This will enable the creation of the component with of the component with `nlp.add_pipe`, or via the config.
`nlp.add_pipe`, or via the config.
> ``` > ```
> >