diff --git a/website/docs/api/pipeline-functions.md b/website/docs/api/pipeline-functions.md index 628d36000..a776eca9b 100644 --- a/website/docs/api/pipeline-functions.md +++ b/website/docs/api/pipeline-functions.md @@ -113,8 +113,7 @@ end of the pipeline and after all other components. Split tokens longer than a minimum length into shorter tokens. Intended for use with transformer pipelines where long spaCy tokens lead to input text that -exceed the transformer model max length. See -[managing transformer model max length limitations](/usage/embeddings-transformers#transformer-max-length). +exceed the transformer model max length. > #### Example > diff --git a/website/docs/usage/embeddings-transformers.md b/website/docs/usage/embeddings-transformers.md index fdf15d187..7e47ac9d2 100644 --- a/website/docs/usage/embeddings-transformers.md +++ b/website/docs/usage/embeddings-transformers.md @@ -481,50 +481,6 @@ custom learning rate for each component. Instead of a constant, you can also provide a schedule, allowing you to freeze the shared parameters at the start of training. -### Managing transformer model max length limitations {#transformer-max-length} - -Many transformer models have a limit on the maximum number of tokens that the -model can process, for example BERT models are limited to 512 tokens. This limit -refers to the number of transformer tokens (BPE, WordPiece, etc.), not the -number of spaCy tokens. - -To be able to process longer texts, the spaCy [`transformer`](/api/transformer) -component uses [`span_getters`](/api/transformer#span_getters) to convert a -batch of [`Doc`](/api/doc) objects into lists of [`Span`](/api/span) objects. A -span may correspond to a doc (for `doc_spans`), a sentence (for `sent_spans`) or -a window of spaCy tokens (`strided_spans`). If a single span corresponds to more -transformer tokens than the transformer model supports, the spaCy pipeline can't -process the text because some spaCy tokens would be left without an analysis. - -In general, it is up to the transformer pipeline user to manage the input texts -so that the model max length is not exceeded. If you're training a **new -pipeline**, you have a number of options to handle the max length limit: - -- Use `doc_spans` with short texts only -- Use `sent_spans` with short sentences only -- For `strided_spans`, lower the `window` size to be short enough for your input - texts (and don't forget to lower the `stride` correspondingly) -- Implement a [custom span getter](#transformers-training-custom-settings) - -You may still run into the max length limit if a single spaCy token is very -long, like a long URL or a noisy string, or if you're using a **pretrained -pipeline** like `en_core_web_trf` with a fixed `window` size for -`strided_spans`. In this case, you need to modify either your texts or your -pipeline so that you have shorter spaCy tokens. Some options: - -- Preprocess your texts to clean up noise and split long tokens with whitespace -- Add a `token_splitter` to the beginning of your pipeline to break up - tokens that are longer than a specified length: - - ```python - config={"min_length": 20, "split_length": 5} - nlp.add_pipe("token_splitter", config=config, first=True) - ``` - - In this example, tokens that are at least 20 characters long will be split up - into smaller tokens of 5 characters each, resulting in strided spans that - correspond to fewer transformer tokens. - ## Static vectors {#static-vectors} If your pipeline includes a **word vectors table**, you'll be able to use the