diff --git a/website/docs/api/transformer.md b/website/docs/api/transformer.md index c32651e02..0b51487ed 100644 --- a/website/docs/api/transformer.md +++ b/website/docs/api/transformer.md @@ -25,24 +25,23 @@ work out-of-the-box. -This pipeline component lets you use transformer models in your pipeline. -Supports all models that are available via the +This pipeline component lets you use transformer models in your pipeline. It +supports all models that are available via the [HuggingFace `transformers`](https://huggingface.co/transformers) library. Usually you will connect subsequent components to the shared transformer using the [TransformerListener](/api/architectures#TransformerListener) layer. This works similarly to spaCy's [Tok2Vec](/api/tok2vec) component and [Tok2VecListener](/api/architectures/Tok2VecListener) sublayer. -The component assigns the output of the transformer to the `Doc`'s extension -attributes. We also calculate an alignment between the word-piece tokens and the -spaCy tokenization, so that we can use the last hidden states to set the -`Doc.tensor` attribute. When multiple word-piece tokens align to the same spaCy -token, the spaCy token receives the sum of their values. To access the values, -you can use the custom [`Doc._.trf_data`](#custom-attributes) attribute. The -package also adds the function registries [`@span_getters`](#span_getters) and -[`@annotation_setters`](#annotation_setters) with several built-in registered -functions. For more details, see the -[usage documentation](/usage/embeddings-transformers). +We calculate an alignment between the word-piece tokens and the spaCy +tokenization, so that we can use the last hidden states to store the information +on the `Doc`. When multiple word-piece tokens align to the same spaCy token, the +spaCy token receives the sum of their values. By default, the information is +written to the [`Doc._.trf_data`](#custom-attributes) extension attribute, but +you can implement a custom [`@annotation_setter`](#annotation_setters) to change +this behaviour. The package also adds the function registry +[`@span_getters`](#span_getters) with several built-in registered functions. For +more details, see the [usage documentation](/usage/embeddings-transformers). ## Config and implementation {#config} @@ -61,11 +60,11 @@ architectures and their arguments and hyperparameters. > nlp.add_pipe("transformer", config=DEFAULT_CONFIG) > ``` -| Setting | Description | -| ------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `max_batch_items` | Maximum size of a padded batch. Defaults to `4096`. ~~int~~ | -| `annotation_setter` | Function that takes a batch of `Doc` objects and transformer outputs can set additional annotations on the `Doc`. The `Doc._.transformer_data` attribute is set prior to calling the callback. Defaults to `null_annotation_setter` (no additional annotations). ~~Callable[[List[Doc], FullTransformerBatch], None]~~ | -| `model` | The Thinc [`Model`](https://thinc.ai/docs/api-model) wrapping the transformer. Defaults to [TransformerModel](/api/architectures#TransformerModel). ~~Model[List[Doc], FullTransformerBatch]~~ | +| Setting | Description | +| ------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `max_batch_items` | Maximum size of a padded batch. Defaults to `4096`. ~~int~~ | +| `annotation_setter` | Function that takes a batch of `Doc` objects and transformer outputs to store the annotations on the `Doc`. Defaults to `trfdata_setter` which sets the `Doc._.transformer_data` attribute. ~~Callable[[List[Doc], FullTransformerBatch], None]~~ | +| `model` | The Thinc [`Model`](https://thinc.ai/docs/api-model) wrapping the transformer. Defaults to [TransformerModel](/api/architectures#TransformerModel). ~~Model[List[Doc], FullTransformerBatch]~~ | ```python https://github.com/explosion/spacy-transformers/blob/master/spacy_transformers/pipeline_component.py @@ -518,19 +517,23 @@ right context. ## Annotation setters {#annotation_setters tag="registered functions" source="github.com/explosion/spacy-transformers/blob/master/spacy_transformers/annotation_setters.py"} -Annotation setters are functions that that take a batch of `Doc` objects and a -[`FullTransformerBatch`](/api/transformer#fulltransformerbatch) and can set -additional annotations on the `Doc`, e.g. to set custom or built-in attributes. -You can register custom annotation setters using the -`@registry.annotation_setters` decorator. +Annotation setters are functions that take a batch of `Doc` objects and a +[`FullTransformerBatch`](/api/transformer#fulltransformerbatch) and store the +annotations on the `Doc`, e.g. to set custom or built-in attributes. You can +register custom annotation setters using the `@registry.annotation_setters` +decorator. The default annotation setter used by the `Transformer` pipeline +component is `trfdata_setter`, which sets the custom `Doc._.transformer_data` +attribute. > #### Example > > ```python -> @registry.annotation_setters("spacy-transformers.null_annotation_setter.v1") -> def configure_null_annotation_setter() -> Callable: +> @registry.annotation_setters("spacy-transformers.trfdata_setter.v1") +> def configure_trfdata_setter() -> Callable: > def setter(docs: List[Doc], trf_data: FullTransformerBatch) -> None: -> pass +> doc_data = list(trf_data.doc_data) +> for doc, data in zip(docs, doc_data): +> doc._.trf_data = data > > return setter > ``` @@ -542,9 +545,9 @@ You can register custom annotation setters using the The following built-in functions are available: -| Name | Description | -| ---------------------------------------------- | ------------------------------------- | -| `spacy-transformers.null_annotation_setter.v1` | Don't set any additional annotations. | +| Name | Description | +| -------------------------------------- | ------------------------------------------------------------- | +| `spacy-transformers.trfdata_setter.v1` | Set the annotations to the custom attribute `doc._.trf_data`. | ## Custom attributes {#custom-attributes} diff --git a/website/docs/usage/embeddings-transformers.md b/website/docs/usage/embeddings-transformers.md index 62336a826..fbae1da82 100644 --- a/website/docs/usage/embeddings-transformers.md +++ b/website/docs/usage/embeddings-transformers.md @@ -299,7 +299,7 @@ component: > > ```python > from spacy_transformers import Transformer, TransformerModel -> from spacy_transformers.annotation_setters import null_annotation_setter +> from spacy_transformers.annotation_setters import configure_trfdata_setter > from spacy_transformers.span_getters import get_doc_spans > > trf = Transformer( @@ -309,7 +309,7 @@ component: > get_spans=get_doc_spans, > tokenizer_config={"use_fast": True}, > ), -> annotation_setter=null_annotation_setter, +> annotation_setter=configure_trfdata_setter(), > max_batch_items=4096, > ) > ``` @@ -329,7 +329,7 @@ tokenizer_config = {"use_fast": true} @span_getters = "doc_spans.v1" [components.transformer.annotation_setter] -@annotation_setters = "spacy-transformers.null_annotation_setter.v1" +@annotation_setters = "spacy-transformers.trfdata_setter.v1" ```