Merge pull request #5996 from svlandeg/feature/docs-trf-examples [ci skip]

custom transformer examples
2020-08-29 12:37:57 +02:00 · 2020-08-29 12:37:57 +02:00 · b6ee284376
parent 58f19421b1 9f00a20ce4
commit b6ee284376
1 changed files with 45 additions and 27 deletions
--- a/website/docs/usage/embeddings-transformers.md
+++ b/website/docs/usage/embeddings-transformers.md
@ -179,7 +179,7 @@ interoperates with [PyTorch](https://pytorch.org) and the
 giving you access to thousands of pretrained models for your pipelines. There
 are many [great guides](http://jalammar.github.io/illustrated-transformer/) to
 transformer models, but for practical purposes, you can simply think of them as
-a drop-in replacement that let you achieve **higher accuracy** in exchange for
+drop-in replacements that let you achieve **higher accuracy** in exchange for
 **higher training and runtime costs**.

 ### Setup and installation {#transformers-installation}
@ -225,7 +225,7 @@ transformers as subnetworks directly, you can also use them via the

 ![The processing pipeline with the transformer component](../images/pipeline_transformer.svg)

-The `Transformer` component sets the
+By default, the `Transformer` component sets the
 [`Doc._.trf_data`](/api/transformer#custom_attributes) extension attribute,
 which lets you access the transformers outputs at runtime.

@ -249,8 +249,8 @@ for doc in nlp.pipe(["some text", "some other text"]):
    tokvecs = doc._.trf_data.tensors[-1]
 ```

-You can also customize how the [`Transformer`](/api/transformer) component sets
-annotations onto the [`Doc`](/api/doc), by customizing the `annotation_setter`.
+You can customize how the [`Transformer`](/api/transformer) component sets
+annotations onto the [`Doc`](/api/doc), by changing the `annotation_setter`.
 This callback will be called with the raw input and output data for the whole
 batch, along with the batch of `Doc` objects, allowing you to implement whatever
 you need. The annotation setter is called with a batch of [`Doc`](/api/doc)
@ -259,13 +259,15 @@ containing the transformers data for the batch.

 ```python
 def custom_annotation_setter(docs, trf_data):
-    # TODO:
-    ...
+    doc_data = list(trf_data.doc_data)
+    for doc, data in zip(docs, doc_data):
+        doc._.custom_attr = data

 nlp = spacy.load("en_core_trf_lg")
 nlp.get_pipe("transformer").annotation_setter = custom_annotation_setter
 doc = nlp("This is a text")
-print()  # TODO:
+assert isinstance(doc._.custom_attr, TransformerData)
+print(doc._.custom_attr.tensors)
 ```

 ### Training usage {#transformers-training}
@ -299,7 +301,7 @@ component:
 >
 > ```python
 > from spacy_transformers import Transformer, TransformerModel
-> from spacy_transformers.annotation_setters import null_annotation_setter
+> from spacy_transformers.annotation_setters import configure_trfdata_setter
 > from spacy_transformers.span_getters import get_doc_spans
 >
 > trf = Transformer(
@ -309,7 +311,7 @@ component:
 >         get_spans=get_doc_spans,
 >         tokenizer_config={"use_fast": True},
 >     ),
->     annotation_setter=null_annotation_setter,
+>     annotation_setter=configure_trfdata_setter(),
 >     max_batch_items=4096,
 > )
 > ```
@ -329,7 +331,7 @@ tokenizer_config = {"use_fast": true}
@span_getters = "doc_spans.v1"

 [components.transformer.annotation_setter]
-@annotation_setters = "spacy-transformers.null_annotation_setter.v1"
+@annotation_setters = "spacy-transformers.trfdata_setter.v1"

 ```

@ -343,9 +345,9 @@ in a block starts with `@`, it's **resolved to a function** and all other
 settings are passed to the function as arguments. In this case, `name`,
 `tokenizer_config` and `get_spans`.

-`get_spans` is a function that takes a batch of `Doc` object and returns lists
+`get_spans` is a function that takes a batch of `Doc` objects and returns lists
 of potentially overlapping `Span` objects to process by the transformer. Several
-[built-in functions](/api/transformer#span-getters) are available – for example,
+[built-in functions](/api/transformer#span_getters) are available – for example,
 to process the whole document or individual sentences. When the config is
 resolved, the function is created and passed into the model as an argument.

@ -366,13 +368,17 @@ To change any of the settings, you can edit the `config.cfg` and re-run the
 training. To change any of the functions, like the span getter, you can replace
 the name of the referenced function – e.g. `@span_getters = "sent_spans.v1"` to
 process sentences. You can also register your own functions using the
-`span_getters` registry:
+`span_getters` registry. For instance, the following custom function returns
+`Span` objects following sentence boundaries, unless a sentence succeeds a
+certain amount of tokens, in which case subsentences of at most `max_length`
+tokens are returned.

 > #### config.cfg
 >
 > ```ini
 > [components.transformer.model.get_spans]
 > @span_getters = "custom_sent_spans"
+> max_length = 25
 > ```

 ```python
@ -380,12 +386,23 @@ process sentences. You can also register your own functions using the
 import spacy_transformers

@spacy_transformers.registry.span_getters("custom_sent_spans")
-def configure_custom_sent_spans():
-    # TODO: write custom example
-    def get_sent_spans(docs):
-        return [list(doc.sents) for doc in docs]
+def configure_custom_sent_spans(max_length: int):
+    def get_custom_sent_spans(docs):
+        spans = []
+        for doc in docs:
+            spans.append([])
+            for sent in doc.sents:
+                start = 0
+                end = max_length
+                while end <= len(sent):
+                    spans[-1].append(sent[start:end])
+                    start += max_length
+                    end += max_length
+                if start < len(sent):
+                    spans[-1].append(sent[start:len(sent)])
+        return spans

-    return get_sent_spans
+    return get_custom_sent_spans
 ```

 To resolve the config during training, spaCy needs to know about your custom
@ -412,8 +429,8 @@ The same idea applies to task models that power the **downstream components**.
 Most of spaCy's built-in model creation functions support a `tok2vec` argument,
 which should be a Thinc layer of type ~~Model[List[Doc], List[Floats2d]]~~. This
 is where we'll plug in our transformer model, using the
-[Tok2VecListener](/api/architectures#Tok2VecListener) layer, which sneakily
-delegates to the `Transformer` pipeline component.
+[TransformerListener](/api/architectures#TransformerListener) layer, which
+sneakily delegates to the `Transformer` pipeline component.

 ```ini
 ### config.cfg (excerpt) {highlight="12"}
@ -428,18 +445,18 @@ maxout_pieces = 3
 use_upper = false

 [nlp.pipeline.ner.model.tok2vec]
-@architectures = "spacy-transformers.Tok2VecListener.v1"
+@architectures = "spacy-transformers.TransformerListener.v1"
 grad_factor = 1.0

 [nlp.pipeline.ner.model.tok2vec.pooling]
@layers = "reduce_mean.v1"
 ```

-The [Tok2VecListener](/api/architectures#Tok2VecListener) layer expects a
-[pooling layer](https://thinc.ai/docs/api-layers#reduction-ops) as the argument
-`pooling`, which needs to be of type ~~Model[Ragged, Floats2d]~~. This layer
-determines how the vector for each spaCy token will be computed from the zero or
-more source rows the token is aligned against. Here we use the
+The [TransformerListener](/api/architectures#TransformerListener) layer expects
+a [pooling layer](https://thinc.ai/docs/api-layers#reduction-ops) as the
+argument `pooling`, which needs to be of type ~~Model[Ragged, Floats2d]~~. This
+layer determines how the vector for each spaCy token will be computed from the
+zero or more source rows the token is aligned against. Here we use the
 [`reduce_mean`](https://thinc.ai/docs/api-layers#reduce_mean) layer, which
 averages the wordpiece rows. We could instead use
 [`reduce_max`](https://thinc.ai/docs/api-layers#reduce_max), or a custom
@ -535,8 +552,9 @@ vectors, but combines them via summation with a smaller table of learned
 embeddings.

 ```python
-from thinc.api import add, chain, remap_ids, Embed
+from thinc.api import add, chain, remap_ids, Embed, FeatureExtractor
 from spacy.ml.staticvectors import StaticVectors
+from spacy.util import registry

@registry.architectures("my_example.MyEmbedding.v1")
 def MyCustomVectors(