various fixes

This commit is contained in:
svlandeg 2020-08-27 19:24:44 +02:00
parent 329e490560
commit 556e975a30
2 changed files with 38 additions and 37 deletions

View File

@ -49,8 +49,8 @@ The default config is defined by the pipeline component factory and describes
how the component should be configured. You can override its settings via the
`config` argument on [`nlp.add_pipe`](/api/language#add_pipe) or in your
[`config.cfg` for training](/usage/training#config). See the
[model architectures](/api/architectures) documentation for details on the
architectures and their arguments and hyperparameters.
[model architectures](/api/architectures#transformers) documentation for details
on the transformer architectures and their arguments and hyperparameters.
> #### Example
>
@ -60,11 +60,11 @@ architectures and their arguments and hyperparameters.
> nlp.add_pipe("transformer", config=DEFAULT_CONFIG)
> ```
| Setting | Description |
| ------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `max_batch_items` | Maximum size of a padded batch. Defaults to `4096`. ~~int~~ |
| `annotation_setter` | Function that takes a batch of `Doc` objects and transformer outputs to store the annotations on the `Doc`. Defaults to `trfdata_setter` which sets the `Doc._.transformer_data` attribute. ~~Callable[[List[Doc], FullTransformerBatch], None]~~ |
| `model` | The Thinc [`Model`](https://thinc.ai/docs/api-model) wrapping the transformer. Defaults to [TransformerModel](/api/architectures#TransformerModel). ~~Model[List[Doc], FullTransformerBatch]~~ |
| Setting | Description |
| ------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `max_batch_items` | Maximum size of a padded batch. Defaults to `4096`. ~~int~~ |
| `annotation_setter` | Function that takes a batch of `Doc` objects and transformer outputs to store the annotations on the `Doc`. Defaults to `trfdata_setter` which sets the `Doc._.trf_data` attribute. ~~Callable[[List[Doc], FullTransformerBatch], None]~~ |
| `model` | The Thinc [`Model`](https://thinc.ai/docs/api-model) wrapping the transformer. Defaults to [TransformerModel](/api/architectures#TransformerModel). ~~Model[List[Doc], FullTransformerBatch]~~ |
```python
https://github.com/explosion/spacy-transformers/blob/master/spacy_transformers/pipeline_component.py
@ -97,18 +97,19 @@ Construct a `Transformer` component. One or more subsequent spaCy components can
use the transformer outputs as features in its model, with gradients
backpropagated to the single shared weights. The activations from the
transformer are saved in the [`Doc._.trf_data`](#custom-attributes) extension
attribute. You can also provide a callback to set additional annotations. In
your application, you would normally use a shortcut for this and instantiate the
component using its string name and [`nlp.add_pipe`](/api/language#create_pipe).
attribute by default, but you can provide a different `annotation_setter` to
customize this behaviour. In your application, you would normally use a shortcut
and instantiate the component using its string name and
[`nlp.add_pipe`](/api/language#create_pipe).
| Name | Description |
| ------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `vocab` | The shared vocabulary. ~~Vocab~~ |
| `model` | The Thinc [`Model`](https://thinc.ai/docs/api-model) wrapping the transformer. Usually you will want to use the [TransformerModel](/api/architectures#TransformerModel) layer for this. ~~Model[List[Doc], FullTransformerBatch]~~ |
| `annotation_setter` | Function that takes a batch of `Doc` objects and transformer outputs can set additional annotations on the `Doc`. The `Doc._.transformer_data` attribute is set prior to calling the callback. By default, no annotations are set. ~~Callable[[List[Doc], FullTransformerBatch], None]~~ |
| _keyword-only_ | |
| `name` | String name of the component instance. Used to add entries to the `losses` during training. ~~str~~ |
| `max_batch_items` | Maximum size of a padded batch. Defaults to `128*32`. ~~int~~ |
| Name | Description |
| ------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `vocab` | The shared vocabulary. ~~Vocab~~ |
| `model` | The Thinc [`Model`](https://thinc.ai/docs/api-model) wrapping the transformer. Usually you will want to use the [TransformerModel](/api/architectures#TransformerModel) layer for this. ~~Model[List[Doc], FullTransformerBatch]~~ |
| `annotation_setter` | Function that takes a batch of `Doc` objects and transformer outputs and stores the annotations on the `Doc`. By default, the function `trfdata_setter` sets the `Doc._.trf_data` attribute. ~~Callable[[List[Doc], FullTransformerBatch], None]~~ |
| _keyword-only_ | |
| `name` | String name of the component instance. Used to add entries to the `losses` during training. ~~str~~ |
| `max_batch_items` | Maximum size of a padded batch. Defaults to `128*32`. ~~int~~ |
## Transformer.\_\_call\_\_ {#call tag="method"}
@ -204,8 +205,9 @@ modifying them.
Assign the extracted features to the Doc objects. By default, the
[`TransformerData`](/api/transformer#transformerdata) object is written to the
[`Doc._.trf_data`](#custom-attributes) attribute. Your annotation_setter
callback is then called, if provided.
[`Doc._.trf_data`](#custom-attributes) attribute. This behaviour can be
customized by providing a different `annotation_setter` argument upon
construction.
> #### Example
>
@ -382,9 +384,8 @@ return tensors that refer to a whole padded batch of documents. These tensors
are wrapped into the
[FullTransformerBatch](/api/transformer#fulltransformerbatch) object. The
`FullTransformerBatch` then splits out the per-document data, which is handled
by this class. Instances of this class
are`typically assigned to the [Doc._.trf_data`](/api/transformer#custom-attributes)
extension attribute.
by this class. Instances of this class are typically assigned to the
[`Doc._.trf_data`](/api/transformer#custom-attributes) extension attribute.
| Name | Description |
| --------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
@ -446,8 +447,9 @@ overlap, and you can also omit sections of the Doc if they are not relevant.
Span getters can be referenced in the `[components.transformer.model.get_spans]`
block of the config to customize the sequences processed by the transformer. You
can also register custom span getters using the `@spacy.registry.span_getters`
decorator.
can also register
[custom span getters](/usage/embeddings-transformers#transformers-training-custom-settings)
using the `@spacy.registry.span_getters` decorator.
> #### Example
>
@ -522,8 +524,7 @@ Annotation setters are functions that take a batch of `Doc` objects and a
annotations on the `Doc`, e.g. to set custom or built-in attributes. You can
register custom annotation setters using the `@registry.annotation_setters`
decorator. The default annotation setter used by the `Transformer` pipeline
component is `trfdata_setter`, which sets the custom `Doc._.transformer_data`
attribute.
component is `trfdata_setter`, which sets the custom `Doc._.trf_data` attribute.
> #### Example
>
@ -554,6 +555,6 @@ The following built-in functions are available:
The component sets the following
[custom extension attributes](/usage/processing-pipeline#custom-components-attributes):
| Name | Description |
| -------------- | ------------------------------------------------------------------------ |
| `Doc.trf_data` | Transformer tokens and outputs for the `Doc` object. ~~TransformerData~~ |
| Name | Description |
| ---------------- | ------------------------------------------------------------------------ |
| `Doc._.trf_data` | Transformer tokens and outputs for the `Doc` object. ~~TransformerData~~ |

View File

@ -429,8 +429,8 @@ The same idea applies to task models that power the **downstream components**.
Most of spaCy's built-in model creation functions support a `tok2vec` argument,
which should be a Thinc layer of type ~~Model[List[Doc], List[Floats2d]]~~. This
is where we'll plug in our transformer model, using the
[TransformerListener](/api/architectures#TransformerListener) layer, which sneakily
delegates to the `Transformer` pipeline component.
[TransformerListener](/api/architectures#TransformerListener) layer, which
sneakily delegates to the `Transformer` pipeline component.
```ini
### config.cfg (excerpt) {highlight="12"}
@ -452,11 +452,11 @@ grad_factor = 1.0
@layers = "reduce_mean.v1"
```
The [TransformerListener](/api/architectures#TransformerListener) layer expects a
[pooling layer](https://thinc.ai/docs/api-layers#reduction-ops) as the argument
`pooling`, which needs to be of type ~~Model[Ragged, Floats2d]~~. This layer
determines how the vector for each spaCy token will be computed from the zero or
more source rows the token is aligned against. Here we use the
The [TransformerListener](/api/architectures#TransformerListener) layer expects
a [pooling layer](https://thinc.ai/docs/api-layers#reduction-ops) as the
argument `pooling`, which needs to be of type ~~Model[Ragged, Floats2d]~~. This
layer determines how the vector for each spaCy token will be computed from the
zero or more source rows the token is aligned against. Here we use the
[`reduce_mean`](https://thinc.ai/docs/api-layers#reduce_mean) layer, which
averages the wordpiece rows. We could instead use
[`reduce_max`](https://thinc.ai/docs/api-layers#reduce_max), or a custom