Remove transformers model max length section (#6807)

This commit is contained in:
Adriane Boyd 2021-01-25 12:59:34 +01:00 committed by GitHub
parent ffc371350a
commit 61c9f8bf24
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
2 changed files with 1 additions and 46 deletions

View File

@ -113,8 +113,7 @@ end of the pipeline and after all other components.
Split tokens longer than a minimum length into shorter tokens. Intended for use
with transformer pipelines where long spaCy tokens lead to input text that
exceed the transformer model max length. See
[managing transformer model max length limitations](/usage/embeddings-transformers#transformer-max-length).
exceed the transformer model max length.
> #### Example
>

View File

@ -481,50 +481,6 @@ custom learning rate for each component. Instead of a constant, you can also
provide a schedule, allowing you to freeze the shared parameters at the start of
training.
### Managing transformer model max length limitations {#transformer-max-length}
Many transformer models have a limit on the maximum number of tokens that the
model can process, for example BERT models are limited to 512 tokens. This limit
refers to the number of transformer tokens (BPE, WordPiece, etc.), not the
number of spaCy tokens.
To be able to process longer texts, the spaCy [`transformer`](/api/transformer)
component uses [`span_getters`](/api/transformer#span_getters) to convert a
batch of [`Doc`](/api/doc) objects into lists of [`Span`](/api/span) objects. A
span may correspond to a doc (for `doc_spans`), a sentence (for `sent_spans`) or
a window of spaCy tokens (`strided_spans`). If a single span corresponds to more
transformer tokens than the transformer model supports, the spaCy pipeline can't
process the text because some spaCy tokens would be left without an analysis.
In general, it is up to the transformer pipeline user to manage the input texts
so that the model max length is not exceeded. If you're training a **new
pipeline**, you have a number of options to handle the max length limit:
- Use `doc_spans` with short texts only
- Use `sent_spans` with short sentences only
- For `strided_spans`, lower the `window` size to be short enough for your input
texts (and don't forget to lower the `stride` correspondingly)
- Implement a [custom span getter](#transformers-training-custom-settings)
You may still run into the max length limit if a single spaCy token is very
long, like a long URL or a noisy string, or if you're using a **pretrained
pipeline** like `en_core_web_trf` with a fixed `window` size for
`strided_spans`. In this case, you need to modify either your texts or your
pipeline so that you have shorter spaCy tokens. Some options:
- Preprocess your texts to clean up noise and split long tokens with whitespace
- Add a `token_splitter` to the beginning of your pipeline to break up
tokens that are longer than a specified length:
```python
config={"min_length": 20, "split_length": 5}
nlp.add_pipe("token_splitter", config=config, first=True)
```
In this example, tokens that are at least 20 characters long will be split up
into smaller tokens of 5 characters each, resulting in strided spans that
correspond to fewer transformer tokens.
## Static vectors {#static-vectors}
If your pipeline includes a **word vectors table**, you'll be able to use the