mirror of https://github.com/explosion/spaCy.git
tok2vec layer
This commit is contained in:
parent
2c4b2ee5e9
commit
08ad349a18
|
@ -489,51 +489,80 @@ with Model.define_operators({">>": chain}):
|
||||||
In addition to [swapping out](#swap-architectures) default models in built-in
|
In addition to [swapping out](#swap-architectures) default models in built-in
|
||||||
components, you can also implement an entirely new,
|
components, you can also implement an entirely new,
|
||||||
[trainable pipeline component](usage/processing-pipelines#trainable-components)
|
[trainable pipeline component](usage/processing-pipelines#trainable-components)
|
||||||
from scratch. This can be done by creating a new class inheriting from [`Pipe`](/api/pipe),
|
from scratch. This can be done by creating a new class inheriting from
|
||||||
and linking it up to your custom model implementation.
|
[`Pipe`](/api/pipe), and linking it up to your custom model implementation.
|
||||||
|
|
||||||
### Example: Pipeline component for relation extraction {#component-rel}
|
### Example: Pipeline component for relation extraction {#component-rel}
|
||||||
|
|
||||||
This section will run through an example of implementing a novel relation extraction
|
This section will run through an example of implementing a novel relation
|
||||||
component from scratch. As a first step, we need a method that will generate pairs of
|
extraction component from scratch. As a first step, we need a method that will
|
||||||
entities that we want to classify as being related or not. These candidate pairs are
|
generate pairs of entities that we want to classify as being related or not.
|
||||||
typically formed within one document, which means we'll have a function that takes a
|
These candidate pairs are typically formed within one document, which means
|
||||||
`Doc` as input and outputs a `List` of `Span` tuples. In this example, we will focus
|
we'll have a function that takes a `Doc` as input and outputs a `List` of `Span`
|
||||||
on binary relation extraction, i.e. the tuple will be of length 2.
|
tuples. In this example, we will focus on binary relation extraction, i.e. the
|
||||||
|
tuple will be of length 2. For instance, a very straightforward implementation
|
||||||
We register this function in the 'misc' register so we can easily refer to it from the config,
|
would be to just take any two entities from the same document:
|
||||||
and allow swapping it out for any candidate
|
|
||||||
generation function. For instance, a very straightforward implementation would be to just
|
|
||||||
take any two entities from the same document:
|
|
||||||
|
|
||||||
```python
|
```python
|
||||||
@registry.misc.register("rel_cand_generator.v1")
|
def get_candidates(doc: "Doc") -> List[Tuple[Span, Span]]:
|
||||||
def create_candidate_indices() -> Callable[[Doc], List[Tuple[Span, Span]]]:
|
candidates = []
|
||||||
def get_candidate_indices(doc: "Doc"):
|
|
||||||
indices = []
|
|
||||||
for ent1 in doc.ents:
|
for ent1 in doc.ents:
|
||||||
for ent2 in doc.ents:
|
for ent2 in doc.ents:
|
||||||
indices.append((ent1, ent2))
|
candidates.append((ent1, ent2))
|
||||||
return indices
|
return candidates
|
||||||
return get_candidate_indices
|
|
||||||
```
|
```
|
||||||
|
|
||||||
But we could also refine this further by excluding relations of an entity with itself,
|
But we could also refine this further by excluding relations of an entity with
|
||||||
and posing a maximum distance (in number of tokens) between two entities:
|
itself, and posing a maximum distance (in number of tokens) between two
|
||||||
|
entities. We'll also register this function in the
|
||||||
|
[`@misc` registry](/api/top-level#registry) so we can refer to it from the
|
||||||
|
config, and easily swap it out for any other candidate generation function.
|
||||||
|
|
||||||
|
> ```
|
||||||
|
> [get_candidates]
|
||||||
|
> @misc = "rel_cand_generator.v2"
|
||||||
|
> max_length = 6
|
||||||
|
> ```
|
||||||
|
|
||||||
```python
|
```python
|
||||||
### {highlight="1,2,7,8"}
|
### {highlight="1,2,7,8"}
|
||||||
@registry.misc.register("rel_cand_generator.v2")
|
@registry.misc.register("rel_cand_generator.v2")
|
||||||
def create_candidate_indices(max_length: int) -> Callable[[Doc], List[Tuple[Span, Span]]]:
|
def create_candidate_indices(max_length: int) -> Callable[[Doc], List[Tuple[Span, Span]]]:
|
||||||
def get_candidate_indices(doc: "Doc"):
|
def get_candidates(doc: "Doc") -> List[Tuple[Span, Span]]:
|
||||||
indices = []
|
candidates = []
|
||||||
for ent1 in doc.ents:
|
for ent1 in doc.ents:
|
||||||
for ent2 in doc.ents:
|
for ent2 in doc.ents:
|
||||||
if ent1 != ent2:
|
if ent1 != ent2:
|
||||||
if max_length and abs(ent2.start - ent1.start) <= max_length:
|
if max_length and abs(ent2.start - ent1.start) <= max_length:
|
||||||
indices.append((ent1, ent2))
|
candidates.append((ent1, ent2))
|
||||||
return indices
|
return candidates
|
||||||
return get_candidate_indices
|
return get_candidates
|
||||||
|
```
|
||||||
|
|
||||||
|
> ```
|
||||||
|
> [tok2vec]
|
||||||
|
> @architectures = "spacy.HashEmbedCNN.v1"
|
||||||
|
> pretrained_vectors = null
|
||||||
|
> width = 96
|
||||||
|
> depth = 2
|
||||||
|
> embed_size = 300
|
||||||
|
> window_size = 1
|
||||||
|
> maxout_pieces = 3
|
||||||
|
> subword_features = true
|
||||||
|
> ```
|
||||||
|
|
||||||
|
Next, we'll assume we have access to an
|
||||||
|
[embedding layer](/usage/embeddings-transformers) such as a
|
||||||
|
[`Tok2Vec`](/api/tok2vec) component or [`Transformer`](/api/transformer). This
|
||||||
|
layer is assumed to be of type `Model[List["Doc"], List[Floats2d]]` as it
|
||||||
|
transforms a list of documents into a list of 2D vectors. Further, this
|
||||||
|
`tok2vec` component will be trainable, which means that, following the Thinc
|
||||||
|
paradigm, we'll apply it to some input, and receive the predicted results as
|
||||||
|
well as a callback to perform backpropagation:
|
||||||
|
|
||||||
|
```python
|
||||||
|
tok2vec = model.get_ref("tok2vec")
|
||||||
|
tokvecs, bp_tokvecs = tok2vec(docs, is_train=True)
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
|
|
Loading…
Reference in New Issue