mirror of https://github.com/explosion/spaCy.git
Remove transformers model max length section (#6807)
This commit is contained in:
parent
ffc371350a
commit
61c9f8bf24
|
@ -113,8 +113,7 @@ end of the pipeline and after all other components.
|
|||
|
||||
Split tokens longer than a minimum length into shorter tokens. Intended for use
|
||||
with transformer pipelines where long spaCy tokens lead to input text that
|
||||
exceed the transformer model max length. See
|
||||
[managing transformer model max length limitations](/usage/embeddings-transformers#transformer-max-length).
|
||||
exceed the transformer model max length.
|
||||
|
||||
> #### Example
|
||||
>
|
||||
|
|
|
@ -481,50 +481,6 @@ custom learning rate for each component. Instead of a constant, you can also
|
|||
provide a schedule, allowing you to freeze the shared parameters at the start of
|
||||
training.
|
||||
|
||||
### Managing transformer model max length limitations {#transformer-max-length}
|
||||
|
||||
Many transformer models have a limit on the maximum number of tokens that the
|
||||
model can process, for example BERT models are limited to 512 tokens. This limit
|
||||
refers to the number of transformer tokens (BPE, WordPiece, etc.), not the
|
||||
number of spaCy tokens.
|
||||
|
||||
To be able to process longer texts, the spaCy [`transformer`](/api/transformer)
|
||||
component uses [`span_getters`](/api/transformer#span_getters) to convert a
|
||||
batch of [`Doc`](/api/doc) objects into lists of [`Span`](/api/span) objects. A
|
||||
span may correspond to a doc (for `doc_spans`), a sentence (for `sent_spans`) or
|
||||
a window of spaCy tokens (`strided_spans`). If a single span corresponds to more
|
||||
transformer tokens than the transformer model supports, the spaCy pipeline can't
|
||||
process the text because some spaCy tokens would be left without an analysis.
|
||||
|
||||
In general, it is up to the transformer pipeline user to manage the input texts
|
||||
so that the model max length is not exceeded. If you're training a **new
|
||||
pipeline**, you have a number of options to handle the max length limit:
|
||||
|
||||
- Use `doc_spans` with short texts only
|
||||
- Use `sent_spans` with short sentences only
|
||||
- For `strided_spans`, lower the `window` size to be short enough for your input
|
||||
texts (and don't forget to lower the `stride` correspondingly)
|
||||
- Implement a [custom span getter](#transformers-training-custom-settings)
|
||||
|
||||
You may still run into the max length limit if a single spaCy token is very
|
||||
long, like a long URL or a noisy string, or if you're using a **pretrained
|
||||
pipeline** like `en_core_web_trf` with a fixed `window` size for
|
||||
`strided_spans`. In this case, you need to modify either your texts or your
|
||||
pipeline so that you have shorter spaCy tokens. Some options:
|
||||
|
||||
- Preprocess your texts to clean up noise and split long tokens with whitespace
|
||||
- Add a `token_splitter` to the beginning of your pipeline to break up
|
||||
tokens that are longer than a specified length:
|
||||
|
||||
```python
|
||||
config={"min_length": 20, "split_length": 5}
|
||||
nlp.add_pipe("token_splitter", config=config, first=True)
|
||||
```
|
||||
|
||||
In this example, tokens that are at least 20 characters long will be split up
|
||||
into smaller tokens of 5 characters each, resulting in strided spans that
|
||||
correspond to fewer transformer tokens.
|
||||
|
||||
## Static vectors {#static-vectors}
|
||||
|
||||
If your pipeline includes a **word vectors table**, you'll be able to use the
|
||||
|
|
Loading…
Reference in New Issue