various small fixes

2020-10-05 01:05:37 +02:00 · 2020-10-05 01:05:37 +02:00 · 9a6c9b133b
parent 52b660e9dc
commit 9a6c9b133b
1 changed files with 74 additions and 68 deletions
--- a/website/docs/usage/layers-architectures.md
+++ b/website/docs/usage/layers-architectures.md
@ -288,7 +288,7 @@ those parts of the network.
 To use our custom model including the PyTorch subnetwork, all we need to do is
 register the architecture using the
-[`architectures` registry](/api/top-level#registry). This will assign the
+[`architectures` registry](/api/top-level#registry). This assigns the
 architecture a name so spaCy knows how to find it, and allows passing in
 arguments like hyperparameters via the [config](/usage/training#config). The
 full example then becomes:
@ -488,27 +488,27 @@ with Model.define_operators({">>": chain}):
 In addition to [swapping out](#swap-architectures) default models in built-in
 components, you can also implement an entirely new,
-[trainable pipeline component](usage/processing-pipelines#trainable-components)
+[trainable pipeline component](/usage/processing-pipelines#trainable-components)
 from scratch. This can be done by creating a new class inheriting from
 [`Pipe`](/api/pipe), and linking it up to your custom model implementation.
 ### Example: Pipeline component for relation extraction {#component-rel}
 This section outlines an example use-case of implementing a novel relation
-extraction component from scratch. We assume we want to implement a binary
+extraction component from scratch. We'll implement a binary relation extraction
-relation extraction method that determines whether two entities in a document
+method that determines whether or not two entities in a document are related,
-are related or not, and if so, with what type of relation. We'll allow multiple
+and if so, what type of relation. We'll allow multiple types of relations
-types of relations between two such entities - i.e. it is a multi-label setting.
+between two such entities (multi-label setting).
 There are two major steps required: first, we need to
 [implement a machine learning model](#component-rel-model) specific to this
-task, and then we'll use this model to
+task, and subsequently we use this model to
 [implement a custom pipeline component](#component-rel-pipe).
 #### Step 1: Implementing the Model {#component-rel-model}
-We'll need to implement a [`Model`](https://thinc.ai/docs/api-model) that takes
+We need to implement a [`Model`](https://thinc.ai/docs/api-model) that takes a
-a list of documents as input, and outputs a two-dimensional matrix of scores:
+list of documents as input, and outputs a two-dimensional matrix of predictions:
 ```python
@registry.architectures.register("rel_model.v1")
@ -519,17 +519,16 @@ def create_relation_model(...) -> Model[List[Doc], Floats2d]:
 The first layer in this model will typically be an
 [embedding layer](/usage/embeddings-transformers) such as a
-[`Tok2Vec`](/api/tok2vec) component or [`Transformer`](/api/transformer). This
+[`Tok2Vec`](/api/tok2vec) component or a [`Transformer`](/api/transformer). This
-layer is assumed to be of type `Model[List["Doc"], List[Floats2d]]` as it
+layer is assumed to be of type ~~Model[List[Doc], List[Floats2d]]~~ as it
 transforms each document into a list of tokens, with each token being
 represented by its embedding in the vector space.
-Next, we need a method that will generate pairs of entities that we want to
+Next, we need a method that generates pairs of entities that we want to classify
-classify as being related or not. These candidate pairs are typically formed
+as being related or not. As these candidate pairs are typically formed within
-within one document, which means we'll have a function that takes a `Doc` as
+one document, this function takes a `Doc` as input and outputs a `List` of
-input and outputs a `List` of `Span` tuples. For instance, a very
+`Span` tuples. For instance, a very straightforward implementation would be to
-straightforward implementation would be to just take any two entities from the
+just take any two entities from the same document:
 same document:
 ```python
 def get_candidates(doc: "Doc") -> List[Tuple[Span, Span]]:
@ -549,12 +548,12 @@ def get_candidates(doc: "Doc") -> List[Tuple[Span, Span]]:
 >
 > [model.get_candidates]
 > @misc = "rel_cand_generator.v2"
-> max_length = 6
+> max_length = 20
 > ```
 But we could also refine this further by excluding relations of an entity with
 itself, and posing a maximum distance (in number of tokens) between two
-entities. We'll register this function in the
+entities. We register this function in the
 [`@misc` registry](/api/top-level#registry) so we can refer to it from the
 config, and easily swap it out for any other candidate generation function.
@ -573,10 +572,10 @@ def create_candidate_indices(max_length: int) -> Callable[[Doc], List[Tuple[Span
    return get_candidates
 ```
-Finally, we'll require a method that transforms the candidate pairs of entities
+Finally, we require a method that transforms the candidate entity pairs into a
-into a 2D tensor using the specified Tok2Vec function, and this `Floats2d`
+2D tensor using the specified `Tok2Vec` function. The resulting `Floats2d`
-object will then be processed by a final `output_layer` of the network. Taking
+object will then be processed by a final `output_layer` of the network. Putting
-all this together, we can define our relation model like this in the config:
+all this together, we can define our relation model in a config file as such:
 ```
 [model]
@ -588,7 +587,7 @@ all this together, we can define our relation model like this in the config:
 [model.get_candidates]
@misc = "rel_cand_generator.v2"
-max_length = 6
+max_length = 20
 [model.create_candidate_tensor]
@misc = "rel_cand_tensor.v1"
@ -600,7 +599,7 @@ max_length = 6
 <!-- TODO: Link to project for implementation details -->
-When creating this model, we'll store the custom functions as
+When creating this model, we store the custom functions as
 [attributes](https://thinc.ai/docs/api-model#properties) and the sublayers as
 references, so we can access them easily:
@ -614,7 +613,7 @@ get_candidates = model.attrs["get_candidates"]
 #### Step 2: Implementing the pipeline component {#component-rel-pipe}
 To use our new relation extraction model as part of a custom component, we
-create a subclass of [`Pipe`](/api/pipe) that will hold the model:
+create a subclass of [`Pipe`](/api/pipe) that holds the model:
 ```python
 from spacy.pipeline import Pipe
@ -624,6 +623,9 @@ class RelationExtractor(Pipe):
        self.model = model
        ...
    def update(self, examples, ...):
        ...
    def predict(self, docs):
        ...
@ -631,18 +633,19 @@ class RelationExtractor(Pipe):
         ...
 ```
-Before the model can be used however, it needs to be 
+Before the model can be used, it needs to be
-[initialized](/api/pipe#initialize). This function recieves either the full 
+[initialized](/api/pipe#initialize). This function receives either the full
-training data set, or a representative sample. The training data can be used 
+training data set, or a representative sample. This data set can be used to
-to deduce all relevant labels. Alternatively, a list of labels can be provided, 
+deduce all relevant labels. Alternatively, a list of labels can be provided, or
-or a script can call `rel_component.add_label()` to add each label separately.
+a script can call `rel_component.add_label()` directly.
-The number of labels will define the output dimensionality of the network, 
+The number of labels defines the output dimensionality of the network, and will
-and will be used to do 
+be used to do [shape inference](https://thinc.ai/docs/usage-models#validation)
-[shape inference](https://thinc.ai/docs/usage-models#validation) throughout 
+throughout the layers of the neural network. This is triggered by calling
-the layers of the neural network. This is triggerd by calling `model.initialize`.
+`model.initialize`.
 ```python
 ### {highlight="12,18,22"}
 from itertools import islice
 def initialize(
@ -667,17 +670,20 @@ def initialize(
    self.model.initialize(X=doc_sample, Y=label_sample)
 ```
-The `initialize` method will be triggered whenever this component is part of an 
+The `initialize` method is triggered whenever this component is part of an `nlp`
-`nlp` pipeline, and `nlp.initialize()` is invoked. After doing so, the pipeline 
+pipeline, and [`nlp.initialize()`](/api/language#initialize) is invoked. After
-component and its internal model can be trained and used to make predictions.
+doing so, the pipeline component and its internal model can be trained and used
 to make predictions.
-During training the function [`update`](/api/pipe#update) is invoked which delegates to 
+During training, the function [`update`](/api/pipe#update) is invoked which
-[`self.model.begin_update`](https://thinc.ai/docs/api-model#begin_update) and 
+delegates to
-needs a function [`get_loss`](/api/pipe#get_loss) that will calculate the 
+[`self.model.begin_update`](https://thinc.ai/docs/api-model#begin_update) and a
-loss for a batch of examples, as well as the gradient of loss that will be used to update 
+[`get_loss`](/api/pipe#get_loss) function that calculate the loss for a batch of
-the weights of the model layers.
+examples, as well as the gradient of loss that will be used to update the
 weights of the model layers.
 ```python
 ### {highlight="12-14"}
 def update(
    self,
    examples: Iterable[Example],
@ -697,13 +703,13 @@ def update(
    return losses
 ```
-Thinc provides some [loss functions](https://thinc.ai/docs/api-loss) that can be used 
+Thinc provides several [loss functions](https://thinc.ai/docs/api-loss) that can
-for the implementation of the `get_loss` function.
+be used for the implementation of the `get_loss` function.
-When the internal model is trained, the component can be used to make novel predictions. 
+When the internal model is trained, the component can be used to make novel
-The [`predict`](/api/pipe#predict) function needs to be implemented for each
+predictions. The [`predict`](/api/pipe#predict) function needs to be implemented
-subclass of `Pipe`. In our case, we can simply delegate to the internal model's
+for each subclass of `Pipe`. In our case, we can simply delegate to the internal
-[predict](https://thinc.ai/docs/api-model#predict) function:
+model's [predict](https://thinc.ai/docs/api-model#predict) function:
 ```python
 def predict(self, docs: Iterable[Doc]) -> Floats2d:
@ -711,24 +717,24 @@ def predict(self, docs: Iterable[Doc]) -> Floats2d:
    return self.model.ops.asarray(predictions)
 ```
-The other method that needs to be implemented, is
+The final method that needs to be implemented, is
-[`set_annotations`](/api/pipe#set_annotations). It takes the predicted scores,
+[`set_annotations`](/api/pipe#set_annotations). This function takes the
-and modifies the given `Doc` object in place to hold the predictions. For our
+predictions, and modifies the given `Doc` object in place to store them. For our
-relation extraction component, we'll store the data as a dictionary in a custom
+relation extraction component, we store the data as a dictionary in a custom
 extension attribute `doc._.rel`. As keys, we represent the candidate pair by the
 start offsets of each entity, as this defines an entity pair uniquely within one
 document.
-To interpret the scores predicted by the REL model correctly, we need to 
+To interpret the scores predicted by the REL model correctly, we need to refer
-refer to the model's `get_candidates` function that originally defined which 
+to the model's `get_candidates` function that defined which pairs of entities
-pairs of entities would be run through the model, so that the scores can be 
+were relevant candidates, so that the predictions can be linked to those exact
-related to those exact entities:
+entities:
 > #### Example output
 >
 > ```python
 > doc = nlp("Amsterdam is the capital of the Netherlands.")
-> print(f"spans: {[(e.start, e.text, e.label_) for e in doc.ents]}")
+> print(f"spans: [(e.start, e.text, e.label_) for e in doc.ents]")
 > for value, rel_dict in doc._.rel.items():
 >     print(f"{value}: {rel_dict}")
 > ```
@ -740,6 +746,7 @@ related to those exact entities:
 > ```
 ```python
 ###  {highlight="5-6,10"}
 def set_annotations(self, docs: Iterable[Doc], predictions: Floats2d):
    c = 0
    get_candidates = self.model.attrs["get_candidates"]
@ -753,8 +760,8 @@ def set_annotations(self, docs: Iterable[Doc], predictions: Floats2d):
            c += 1
 ```
-Under the hood, when the pipe is applied to a document, it will delegate to these 
+Under the hood, when the pipe is applied to a document, it delegates to the
-two methods: 
+`predict` and `set_annotations` functions:
 ```python
 def __call__(self, Doc doc):
@ -765,9 +772,8 @@ def __call__(self, Doc doc):
 Once our `Pipe` subclass is fully implemented, we can
 [register](http://localhost:8000/usage/processing-pipelines#custom-components-factories)
-the component with the 
+the component with the `Language.factory` decorator. This enables the creation
-`Language.factory` decorator. This will enable the creation of the component with 
+of the component with `nlp.add_pipe`, or via the config.
 `nlp.add_pipe`, or via the config.
 > ```
 >