mirror of https://github.com/explosion/spaCy.git
adjust references to null_annotation_setter to trfdata_setter
This commit is contained in:
parent
ec069627fe
commit
559b65f2e0
|
@ -25,24 +25,23 @@ work out-of-the-box.
|
|||
|
||||
</Infobox>
|
||||
|
||||
This pipeline component lets you use transformer models in your pipeline.
|
||||
Supports all models that are available via the
|
||||
This pipeline component lets you use transformer models in your pipeline. It
|
||||
supports all models that are available via the
|
||||
[HuggingFace `transformers`](https://huggingface.co/transformers) library.
|
||||
Usually you will connect subsequent components to the shared transformer using
|
||||
the [TransformerListener](/api/architectures#TransformerListener) layer. This
|
||||
works similarly to spaCy's [Tok2Vec](/api/tok2vec) component and
|
||||
[Tok2VecListener](/api/architectures/Tok2VecListener) sublayer.
|
||||
|
||||
The component assigns the output of the transformer to the `Doc`'s extension
|
||||
attributes. We also calculate an alignment between the word-piece tokens and the
|
||||
spaCy tokenization, so that we can use the last hidden states to set the
|
||||
`Doc.tensor` attribute. When multiple word-piece tokens align to the same spaCy
|
||||
token, the spaCy token receives the sum of their values. To access the values,
|
||||
you can use the custom [`Doc._.trf_data`](#custom-attributes) attribute. The
|
||||
package also adds the function registries [`@span_getters`](#span_getters) and
|
||||
[`@annotation_setters`](#annotation_setters) with several built-in registered
|
||||
functions. For more details, see the
|
||||
[usage documentation](/usage/embeddings-transformers).
|
||||
We calculate an alignment between the word-piece tokens and the spaCy
|
||||
tokenization, so that we can use the last hidden states to store the information
|
||||
on the `Doc`. When multiple word-piece tokens align to the same spaCy token, the
|
||||
spaCy token receives the sum of their values. By default, the information is
|
||||
written to the [`Doc._.trf_data`](#custom-attributes) extension attribute, but
|
||||
you can implement a custom [`@annotation_setter`](#annotation_setters) to change
|
||||
this behaviour. The package also adds the function registry
|
||||
[`@span_getters`](#span_getters) with several built-in registered functions. For
|
||||
more details, see the [usage documentation](/usage/embeddings-transformers).
|
||||
|
||||
## Config and implementation {#config}
|
||||
|
||||
|
@ -61,11 +60,11 @@ architectures and their arguments and hyperparameters.
|
|||
> nlp.add_pipe("transformer", config=DEFAULT_CONFIG)
|
||||
> ```
|
||||
|
||||
| Setting | Description |
|
||||
| ------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `max_batch_items` | Maximum size of a padded batch. Defaults to `4096`. ~~int~~ |
|
||||
| `annotation_setter` | Function that takes a batch of `Doc` objects and transformer outputs can set additional annotations on the `Doc`. The `Doc._.transformer_data` attribute is set prior to calling the callback. Defaults to `null_annotation_setter` (no additional annotations). ~~Callable[[List[Doc], FullTransformerBatch], None]~~ |
|
||||
| `model` | The Thinc [`Model`](https://thinc.ai/docs/api-model) wrapping the transformer. Defaults to [TransformerModel](/api/architectures#TransformerModel). ~~Model[List[Doc], FullTransformerBatch]~~ |
|
||||
| Setting | Description |
|
||||
| ------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `max_batch_items` | Maximum size of a padded batch. Defaults to `4096`. ~~int~~ |
|
||||
| `annotation_setter` | Function that takes a batch of `Doc` objects and transformer outputs to store the annotations on the `Doc`. Defaults to `trfdata_setter` which sets the `Doc._.transformer_data` attribute. ~~Callable[[List[Doc], FullTransformerBatch], None]~~ |
|
||||
| `model` | The Thinc [`Model`](https://thinc.ai/docs/api-model) wrapping the transformer. Defaults to [TransformerModel](/api/architectures#TransformerModel). ~~Model[List[Doc], FullTransformerBatch]~~ |
|
||||
|
||||
```python
|
||||
https://github.com/explosion/spacy-transformers/blob/master/spacy_transformers/pipeline_component.py
|
||||
|
@ -518,19 +517,23 @@ right context.
|
|||
|
||||
## Annotation setters {#annotation_setters tag="registered functions" source="github.com/explosion/spacy-transformers/blob/master/spacy_transformers/annotation_setters.py"}
|
||||
|
||||
Annotation setters are functions that that take a batch of `Doc` objects and a
|
||||
[`FullTransformerBatch`](/api/transformer#fulltransformerbatch) and can set
|
||||
additional annotations on the `Doc`, e.g. to set custom or built-in attributes.
|
||||
You can register custom annotation setters using the
|
||||
`@registry.annotation_setters` decorator.
|
||||
Annotation setters are functions that take a batch of `Doc` objects and a
|
||||
[`FullTransformerBatch`](/api/transformer#fulltransformerbatch) and store the
|
||||
annotations on the `Doc`, e.g. to set custom or built-in attributes. You can
|
||||
register custom annotation setters using the `@registry.annotation_setters`
|
||||
decorator. The default annotation setter used by the `Transformer` pipeline
|
||||
component is `trfdata_setter`, which sets the custom `Doc._.transformer_data`
|
||||
attribute.
|
||||
|
||||
> #### Example
|
||||
>
|
||||
> ```python
|
||||
> @registry.annotation_setters("spacy-transformers.null_annotation_setter.v1")
|
||||
> def configure_null_annotation_setter() -> Callable:
|
||||
> @registry.annotation_setters("spacy-transformers.trfdata_setter.v1")
|
||||
> def configure_trfdata_setter() -> Callable:
|
||||
> def setter(docs: List[Doc], trf_data: FullTransformerBatch) -> None:
|
||||
> pass
|
||||
> doc_data = list(trf_data.doc_data)
|
||||
> for doc, data in zip(docs, doc_data):
|
||||
> doc._.trf_data = data
|
||||
>
|
||||
> return setter
|
||||
> ```
|
||||
|
@ -542,9 +545,9 @@ You can register custom annotation setters using the
|
|||
|
||||
The following built-in functions are available:
|
||||
|
||||
| Name | Description |
|
||||
| ---------------------------------------------- | ------------------------------------- |
|
||||
| `spacy-transformers.null_annotation_setter.v1` | Don't set any additional annotations. |
|
||||
| Name | Description |
|
||||
| -------------------------------------- | ------------------------------------------------------------- |
|
||||
| `spacy-transformers.trfdata_setter.v1` | Set the annotations to the custom attribute `doc._.trf_data`. |
|
||||
|
||||
## Custom attributes {#custom-attributes}
|
||||
|
||||
|
|
|
@ -299,7 +299,7 @@ component:
|
|||
>
|
||||
> ```python
|
||||
> from spacy_transformers import Transformer, TransformerModel
|
||||
> from spacy_transformers.annotation_setters import null_annotation_setter
|
||||
> from spacy_transformers.annotation_setters import configure_trfdata_setter
|
||||
> from spacy_transformers.span_getters import get_doc_spans
|
||||
>
|
||||
> trf = Transformer(
|
||||
|
@ -309,7 +309,7 @@ component:
|
|||
> get_spans=get_doc_spans,
|
||||
> tokenizer_config={"use_fast": True},
|
||||
> ),
|
||||
> annotation_setter=null_annotation_setter,
|
||||
> annotation_setter=configure_trfdata_setter(),
|
||||
> max_batch_items=4096,
|
||||
> )
|
||||
> ```
|
||||
|
@ -329,7 +329,7 @@ tokenizer_config = {"use_fast": true}
|
|||
@span_getters = "doc_spans.v1"
|
||||
|
||||
[components.transformer.annotation_setter]
|
||||
@annotation_setters = "spacy-transformers.null_annotation_setter.v1"
|
||||
@annotation_setters = "spacy-transformers.trfdata_setter.v1"
|
||||
|
||||
```
|
||||
|
||||
|
|
Loading…
Reference in New Issue