tok2vec layer

This commit is contained in:
svlandeg 2020-10-04 00:08:02 +02:00
parent 2c4b2ee5e9
commit 08ad349a18
1 changed files with 58 additions and 29 deletions

View File

@ -489,51 +489,80 @@ with Model.define_operators({">>": chain}):
In addition to [swapping out](#swap-architectures) default models in built-in In addition to [swapping out](#swap-architectures) default models in built-in
components, you can also implement an entirely new, components, you can also implement an entirely new,
[trainable pipeline component](usage/processing-pipelines#trainable-components) [trainable pipeline component](usage/processing-pipelines#trainable-components)
from scratch. This can be done by creating a new class inheriting from [`Pipe`](/api/pipe), from scratch. This can be done by creating a new class inheriting from
and linking it up to your custom model implementation. [`Pipe`](/api/pipe), and linking it up to your custom model implementation.
### Example: Pipeline component for relation extraction {#component-rel} ### Example: Pipeline component for relation extraction {#component-rel}
This section will run through an example of implementing a novel relation extraction This section will run through an example of implementing a novel relation
component from scratch. As a first step, we need a method that will generate pairs of extraction component from scratch. As a first step, we need a method that will
entities that we want to classify as being related or not. These candidate pairs are generate pairs of entities that we want to classify as being related or not.
typically formed within one document, which means we'll have a function that takes a These candidate pairs are typically formed within one document, which means
`Doc` as input and outputs a `List` of `Span` tuples. In this example, we will focus we'll have a function that takes a `Doc` as input and outputs a `List` of `Span`
on binary relation extraction, i.e. the tuple will be of length 2. tuples. In this example, we will focus on binary relation extraction, i.e. the
tuple will be of length 2. For instance, a very straightforward implementation
We register this function in the 'misc' register so we can easily refer to it from the config, would be to just take any two entities from the same document:
and allow swapping it out for any candidate
generation function. For instance, a very straightforward implementation would be to just
take any two entities from the same document:
```python ```python
@registry.misc.register("rel_cand_generator.v1") def get_candidates(doc: "Doc") -> List[Tuple[Span, Span]]:
def create_candidate_indices() -> Callable[[Doc], List[Tuple[Span, Span]]]: candidates = []
def get_candidate_indices(doc: "Doc"):
indices = []
for ent1 in doc.ents: for ent1 in doc.ents:
for ent2 in doc.ents: for ent2 in doc.ents:
indices.append((ent1, ent2)) candidates.append((ent1, ent2))
return indices return candidates
return get_candidate_indices
``` ```
But we could also refine this further by excluding relations of an entity with itself, But we could also refine this further by excluding relations of an entity with
and posing a maximum distance (in number of tokens) between two entities: itself, and posing a maximum distance (in number of tokens) between two
entities. We'll also register this function in the
[`@misc` registry](/api/top-level#registry) so we can refer to it from the
config, and easily swap it out for any other candidate generation function.
> ```
> [get_candidates]
> @misc = "rel_cand_generator.v2"
> max_length = 6
> ```
```python ```python
### {highlight="1,2,7,8"} ### {highlight="1,2,7,8"}
@registry.misc.register("rel_cand_generator.v2") @registry.misc.register("rel_cand_generator.v2")
def create_candidate_indices(max_length: int) -> Callable[[Doc], List[Tuple[Span, Span]]]: def create_candidate_indices(max_length: int) -> Callable[[Doc], List[Tuple[Span, Span]]]:
def get_candidate_indices(doc: "Doc"): def get_candidates(doc: "Doc") -> List[Tuple[Span, Span]]:
indices = [] candidates = []
for ent1 in doc.ents: for ent1 in doc.ents:
for ent2 in doc.ents: for ent2 in doc.ents:
if ent1 != ent2: if ent1 != ent2:
if max_length and abs(ent2.start - ent1.start) <= max_length: if max_length and abs(ent2.start - ent1.start) <= max_length:
indices.append((ent1, ent2)) candidates.append((ent1, ent2))
return indices return candidates
return get_candidate_indices return get_candidates
```
> ```
> [tok2vec]
> @architectures = "spacy.HashEmbedCNN.v1"
> pretrained_vectors = null
> width = 96
> depth = 2
> embed_size = 300
> window_size = 1
> maxout_pieces = 3
> subword_features = true
> ```
Next, we'll assume we have access to an
[embedding layer](/usage/embeddings-transformers) such as a
[`Tok2Vec`](/api/tok2vec) component or [`Transformer`](/api/transformer). This
layer is assumed to be of type `Model[List["Doc"], List[Floats2d]]` as it
transforms a list of documents into a list of 2D vectors. Further, this
`tok2vec` component will be trainable, which means that, following the Thinc
paradigm, we'll apply it to some input, and receive the predicted results as
well as a callback to perform backpropagation:
```python
tok2vec = model.get_ref("tok2vec")
tokvecs, bp_tokvecs = tok2vec(docs, is_train=True)
``` ```