Add note on batch contract for listeners (#9691)

* Add note on batch contract

Using listeners requires batches to be consistent. This is obvious if
you understand how the listener works, but it wasn't clearly stated in
the Docs, and was subtle enough that the EntityLinker missed it.

There is probably a clearer way to explain what the actual requirement
is, but I figure this is a good start.

* Rewrite to clarify role of caching
This commit is contained in:
Paul O'Leary McCann 2021-11-22 10:06:07 +00:00 committed by GitHub
parent 13645dcbf5
commit 52b8c2d2e0
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
1 changed files with 8 additions and 0 deletions

View File

@ -124,6 +124,14 @@ Instead of defining its own `Tok2Vec` instance, a model architecture like
[Tagger](/api/architectures#tagger) can define a listener as its `tok2vec`
argument that connects to the shared `tok2vec` component in the pipeline.
Listeners work by caching the `Tok2Vec` output for a given batch of `Doc`s. This
means that in order for a component to work with the listener, the batch of
`Doc`s passed to the listener must be the same as the batch of `Doc`s passed to
the `Tok2Vec`. As a result, any manipulation of the `Doc`s which would affect
`Tok2Vec` output, such as to create special contexts or remove `Doc`s for which
no prediction can be made, must happen inside the model, **after** the call to
the `Tok2Vec` component.
| Name | Description |
| ----------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `width` | The width of the vectors produced by the "upstream" [`Tok2Vec`](/api/tok2vec) component. ~~int~~ |