tok2vec layer

2020-10-04 00:08:02 +02:00 · 2020-10-04 00:08:02 +02:00 · 08ad349a18
parent 2c4b2ee5e9
commit 08ad349a18
1 changed files with 58 additions and 29 deletions
--- a/website/docs/usage/layers-architectures.md
+++ b/website/docs/usage/layers-architectures.md
@ -489,51 +489,80 @@ with Model.define_operators({">>": chain}):
 In addition to [swapping out](#swap-architectures) default models in built-in
 components, you can also implement an entirely new,
 [trainable pipeline component](usage/processing-pipelines#trainable-components)
-from scratch. This can be done by creating a new class inheriting from [`Pipe`](/api/pipe), 
+from scratch. This can be done by creating a new class inheriting from
-and linking it up to your custom model implementation.
+[`Pipe`](/api/pipe), and linking it up to your custom model implementation.
 ### Example: Pipeline component for relation extraction {#component-rel}
-This section will run through an example of implementing a novel relation extraction 
+This section will run through an example of implementing a novel relation
-component from scratch. As a first step, we need a method that will generate pairs of
+extraction component from scratch. As a first step, we need a method that will
-entities that we want to classify as being related or not. These candidate pairs are 
+generate pairs of entities that we want to classify as being related or not.
-typically formed within one document, which means we'll have a function that takes a 
+These candidate pairs are typically formed within one document, which means
-`Doc` as input and outputs a `List` of `Span` tuples. In this example, we will focus 
+we'll have a function that takes a `Doc` as input and outputs a `List` of `Span`
-on binary relation extraction, i.e. the tuple will be of length 2.
+tuples. In this example, we will focus on binary relation extraction, i.e. the
-
+tuple will be of length 2. For instance, a very straightforward implementation
-We register this function in the 'misc' register so we can easily refer to it from the config, 
+would be to just take any two entities from the same document:
 and allow swapping it out for any candidate 
 generation function. For instance, a very straightforward implementation would be to just 
 take any two entities from the same document:
 ```python
-@registry.misc.register("rel_cand_generator.v1")
+def get_candidates(doc: "Doc") -> List[Tuple[Span, Span]]:
-def create_candidate_indices() -> Callable[[Doc], List[Tuple[Span, Span]]]:
+    candidates = []
    def get_candidate_indices(doc: "Doc"):
        indices = []
    for ent1 in doc.ents:
        for ent2 in doc.ents:
-                indices.append((ent1, ent2))
+            candidates.append((ent1, ent2))
-        return indices
+    return candidates
    return get_candidate_indices
 ```
-But we could also refine this further by excluding relations of an entity with itself, 
+But we could also refine this further by excluding relations of an entity with
-and posing a maximum distance (in number of tokens) between two entities:
+itself, and posing a maximum distance (in number of tokens) between two
 entities. We'll also register this function in the
 [`@misc` registry](/api/top-level#registry) so we can refer to it from the
 config, and easily swap it out for any other candidate generation function.
 > ```
 > [get_candidates]
 > @misc = "rel_cand_generator.v2"
 > max_length = 6
 > ```
 ```python
 ### {highlight="1,2,7,8"}
@registry.misc.register("rel_cand_generator.v2")
 def create_candidate_indices(max_length: int) -> Callable[[Doc], List[Tuple[Span, Span]]]:
-    def get_candidate_indices(doc: "Doc"):
+    def get_candidates(doc: "Doc") -> List[Tuple[Span, Span]]:
-        indices = []
+        candidates = []
        for ent1 in doc.ents:
            for ent2 in doc.ents:
                if ent1 != ent2:
                    if max_length and abs(ent2.start - ent1.start) <= max_length:
-                        indices.append((ent1, ent2))
+                        candidates.append((ent1, ent2))
-        return indices
+        return candidates
-    return get_candidate_indices
+    return get_candidates
 ```
 > ```
 > [tok2vec]
 > @architectures = "spacy.HashEmbedCNN.v1"
 > pretrained_vectors = null
 > width = 96
 > depth = 2
 > embed_size = 300
 > window_size = 1
 > maxout_pieces = 3
 > subword_features = true
 > ```
 Next, we'll assume we have access to an
 [embedding layer](/usage/embeddings-transformers) such as a
 [`Tok2Vec`](/api/tok2vec) component or [`Transformer`](/api/transformer). This
 layer is assumed to be of type `Model[List["Doc"], List[Floats2d]]` as it
 transforms a list of documents into a list of 2D vectors. Further, this
 `tok2vec` component will be trainable, which means that, following the Thinc
 paradigm, we'll apply it to some input, and receive the predicted results as
 well as a callback to perform backpropagation:
 ```python
 tok2vec = model.get_ref("tok2vec")
 tokvecs, bp_tokvecs = tok2vec(docs, is_train=True)
 ```