Remove transformers model max length section (#6807)

2021-01-25 12:59:34 +01:00 · 2021-01-25 12:59:34 +01:00 · 61c9f8bf24
parent ffc371350a
commit 61c9f8bf24
2 changed files with 1 additions and 46 deletions
--- a/website/docs/api/pipeline-functions.md
+++ b/website/docs/api/pipeline-functions.md
@ -113,8 +113,7 @@ end of the pipeline and after all other components.
 Split tokens longer than a minimum length into shorter tokens. Intended for use
 with transformer pipelines where long spaCy tokens lead to input text that
-exceed the transformer model max length. See
+exceed the transformer model max length.
 [managing transformer model max length limitations](/usage/embeddings-transformers#transformer-max-length).
 > #### Example
 >
--- a/website/docs/usage/embeddings-transformers.md
+++ b/website/docs/usage/embeddings-transformers.md
@ -481,50 +481,6 @@ custom learning rate for each component. Instead of a constant, you can also
 provide a schedule, allowing you to freeze the shared parameters at the start of
 training.
 ### Managing transformer model max length limitations {#transformer-max-length}
 Many transformer models have a limit on the maximum number of tokens that the
 model can process, for example BERT models are limited to 512 tokens. This limit
 refers to the number of transformer tokens (BPE, WordPiece, etc.), not the
 number of spaCy tokens.
 To be able to process longer texts, the spaCy [`transformer`](/api/transformer)
 component uses [`span_getters`](/api/transformer#span_getters) to convert a
 batch of [`Doc`](/api/doc) objects into lists of [`Span`](/api/span) objects. A
 span may correspond to a doc (for `doc_spans`), a sentence (for `sent_spans`) or
 a window of spaCy tokens (`strided_spans`). If a single span corresponds to more
 transformer tokens than the transformer model supports, the spaCy pipeline can't
 process the text because some spaCy tokens would be left without an analysis.
 In general, it is up to the transformer pipeline user to manage the input texts
 so that the model max length is not exceeded. If you're training a **new
 pipeline**, you have a number of options to handle the max length limit:
 - Use `doc_spans` with short texts only
 - Use `sent_spans` with short sentences only
 - For `strided_spans`, lower the `window` size to be short enough for your input
  texts (and don't forget to lower the `stride` correspondingly)
 - Implement a [custom span getter](#transformers-training-custom-settings)
 You may still run into the max length limit if a single spaCy token is very
 long, like a long URL or a noisy string, or if you're using a **pretrained
 pipeline** like `en_core_web_trf` with a fixed `window` size for
 `strided_spans`. In this case, you need to modify either your texts or your
 pipeline so that you have shorter spaCy tokens. Some options:
 - Preprocess your texts to clean up noise and split long tokens with whitespace
 - Add a `token_splitter` to the beginning of your pipeline to break up
  tokens that are longer than a specified length:
  ```python
  config={"min_length": 20, "split_length": 5}
  nlp.add_pipe("token_splitter", config=config, first=True)
  ```
  In this example, tokens that are at least 20 characters long will be split up
  into smaller tokens of 5 characters each, resulting in strided spans that
  correspond to fewer transformer tokens.
 ## Static vectors {#static-vectors}
 If your pipeline includes a **word vectors table**, you'll be able to use the