mirror of https://github.com/explosion/spaCy.git
Remove transformers model max length section (#6807)
This commit is contained in:
parent
ffc371350a
commit
61c9f8bf24
|
@ -113,8 +113,7 @@ end of the pipeline and after all other components.
|
||||||
|
|
||||||
Split tokens longer than a minimum length into shorter tokens. Intended for use
|
Split tokens longer than a minimum length into shorter tokens. Intended for use
|
||||||
with transformer pipelines where long spaCy tokens lead to input text that
|
with transformer pipelines where long spaCy tokens lead to input text that
|
||||||
exceed the transformer model max length. See
|
exceed the transformer model max length.
|
||||||
[managing transformer model max length limitations](/usage/embeddings-transformers#transformer-max-length).
|
|
||||||
|
|
||||||
> #### Example
|
> #### Example
|
||||||
>
|
>
|
||||||
|
|
|
@ -481,50 +481,6 @@ custom learning rate for each component. Instead of a constant, you can also
|
||||||
provide a schedule, allowing you to freeze the shared parameters at the start of
|
provide a schedule, allowing you to freeze the shared parameters at the start of
|
||||||
training.
|
training.
|
||||||
|
|
||||||
### Managing transformer model max length limitations {#transformer-max-length}
|
|
||||||
|
|
||||||
Many transformer models have a limit on the maximum number of tokens that the
|
|
||||||
model can process, for example BERT models are limited to 512 tokens. This limit
|
|
||||||
refers to the number of transformer tokens (BPE, WordPiece, etc.), not the
|
|
||||||
number of spaCy tokens.
|
|
||||||
|
|
||||||
To be able to process longer texts, the spaCy [`transformer`](/api/transformer)
|
|
||||||
component uses [`span_getters`](/api/transformer#span_getters) to convert a
|
|
||||||
batch of [`Doc`](/api/doc) objects into lists of [`Span`](/api/span) objects. A
|
|
||||||
span may correspond to a doc (for `doc_spans`), a sentence (for `sent_spans`) or
|
|
||||||
a window of spaCy tokens (`strided_spans`). If a single span corresponds to more
|
|
||||||
transformer tokens than the transformer model supports, the spaCy pipeline can't
|
|
||||||
process the text because some spaCy tokens would be left without an analysis.
|
|
||||||
|
|
||||||
In general, it is up to the transformer pipeline user to manage the input texts
|
|
||||||
so that the model max length is not exceeded. If you're training a **new
|
|
||||||
pipeline**, you have a number of options to handle the max length limit:
|
|
||||||
|
|
||||||
- Use `doc_spans` with short texts only
|
|
||||||
- Use `sent_spans` with short sentences only
|
|
||||||
- For `strided_spans`, lower the `window` size to be short enough for your input
|
|
||||||
texts (and don't forget to lower the `stride` correspondingly)
|
|
||||||
- Implement a [custom span getter](#transformers-training-custom-settings)
|
|
||||||
|
|
||||||
You may still run into the max length limit if a single spaCy token is very
|
|
||||||
long, like a long URL or a noisy string, or if you're using a **pretrained
|
|
||||||
pipeline** like `en_core_web_trf` with a fixed `window` size for
|
|
||||||
`strided_spans`. In this case, you need to modify either your texts or your
|
|
||||||
pipeline so that you have shorter spaCy tokens. Some options:
|
|
||||||
|
|
||||||
- Preprocess your texts to clean up noise and split long tokens with whitespace
|
|
||||||
- Add a `token_splitter` to the beginning of your pipeline to break up
|
|
||||||
tokens that are longer than a specified length:
|
|
||||||
|
|
||||||
```python
|
|
||||||
config={"min_length": 20, "split_length": 5}
|
|
||||||
nlp.add_pipe("token_splitter", config=config, first=True)
|
|
||||||
```
|
|
||||||
|
|
||||||
In this example, tokens that are at least 20 characters long will be split up
|
|
||||||
into smaller tokens of 5 characters each, resulting in strided spans that
|
|
||||||
correspond to fewer transformer tokens.
|
|
||||||
|
|
||||||
## Static vectors {#static-vectors}
|
## Static vectors {#static-vectors}
|
||||||
|
|
||||||
If your pipeline includes a **word vectors table**, you'll be able to use the
|
If your pipeline includes a **word vectors table**, you'll be able to use the
|
||||||
|
|
Loading…
Reference in New Issue