diff --git a/website/docs/api/language.md b/website/docs/api/language.md
index be402532c..9ab25597d 100644
--- a/website/docs/api/language.md
+++ b/website/docs/api/language.md
@@ -49,11 +49,11 @@ contain arbitrary whitespace. Alignment into the original string is preserved.
> assert (doc[0].text, doc[0].head.tag_) == ("An", "NN")
> ```
-| Name | Type | Description |
-| ----------- | ----- | --------------------------------------------------------------------------------- |
-| `text` | str | The text to be processed. |
-| `disable` | `List[str]` | Names of pipeline components to [disable](/usage/processing-pipelines#disabling). |
-| **RETURNS** | `Doc` | A container for accessing the annotations. |
+| Name | Type | Description |
+| ----------- | ----------- | --------------------------------------------------------------------------------- |
+| `text` | str | The text to be processed. |
+| `disable` | `List[str]` | Names of pipeline components to [disable](/usage/processing-pipelines#disabling). |
+| **RETURNS** | `Doc` | A container for accessing the annotations. |
## Language.pipe {#pipe tag="method"}
@@ -112,14 +112,14 @@ Evaluate a model's pipeline components.
> print(scores)
> ```
-| Name | Type | Description |
-| -------------------------------------------- | ------------------- | ------------------------------------------------------------------------------------- |
-| `examples` | `Iterable[Example]` | A batch of [`Example`](/api/example) objects to learn from. |
-| `verbose` | bool | Print debugging information. |
-| `batch_size` | int | The batch size to use. |
-| `scorer` | `Scorer` | Optional [`Scorer`](/api/scorer) to use. If not passed in, a new one will be created. |
-| `component_cfg` 2.1 | `Dict[str, Dict]` | Config parameters for specific pipeline components, keyed by component name. |
-| **RETURNS** | `Dict[str, Union[float, Dict]]` | A dictionary of evaluation scores. |
+| Name | Type | Description |
+| -------------------------------------------- | ------------------------------- | ------------------------------------------------------------------------------------- |
+| `examples` | `Iterable[Example]` | A batch of [`Example`](/api/example) objects to learn from. |
+| `verbose` | bool | Print debugging information. |
+| `batch_size` | int | The batch size to use. |
+| `scorer` | `Scorer` | Optional [`Scorer`](/api/scorer) to use. If not passed in, a new one will be created. |
+| `component_cfg` 2.1 | `Dict[str, Dict]` | Config parameters for specific pipeline components, keyed by component name. |
+| **RETURNS** | `Dict[str, Union[float, Dict]]` | A dictionary of evaluation scores. |
## Language.begin_training {#begin_training tag="method"}
@@ -418,11 +418,70 @@ available to the loaded object.
## Class attributes {#class-attributes}
-| Name | Type | Description |
-| -------------------------------------- | ----- | ----------------------------------------------------------------------------------------------------------------------------------- |
-| `Defaults` | class | Settings, data and factory methods for creating the `nlp` object and processing pipeline. |
-| `lang` | str | Two-letter language ID, i.e. [ISO code](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes). |
-| `factories` 2 | dict | Factories that create pre-defined pipeline components, e.g. the tagger, parser or entity recognizer, keyed by their component name. |
+| Name | Type | Description |
+| ---------- | ----- | ----------------------------------------------------------------------------------------------- |
+| `Defaults` | class | Settings, data and factory methods for creating the `nlp` object and processing pipeline. |
+| `lang` | str | Two-letter language ID, i.e. [ISO code](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes). |
+
+## Defaults {#defaults}
+
+The following attributes can be set on the `Language.Defaults` class to
+customize the default language data:
+
+> #### Example
+>
+> ```python
+> from spacy.language import language
+> from spacy.lang.tokenizer_exceptions import URL_MATCH
+> from thinc.api import Config
+>
+> DEFAULT_CONFIFG = """
+> [nlp.tokenizer]
+> @tokenizers = "MyCustomTokenizer.v1"
+> """
+>
+> class Defaults(Language.Defaults):
+> stop_words = set()
+> tokenizer_exceptions = {}
+> prefixes = tuple()
+> suffixes = tuple()
+> infixes = tuple()
+> token_match = None
+> url_match = URL_MATCH
+> lex_attr_getters = {}
+> syntax_iterators = {}
+> writing_system = {"direction": "ltr", "has_case": True, "has_letters": True}
+> config = Config().from_str(DEFAULT_CONFIG)
+> ```
+
+| Name | Description |
+| --------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
+| `stop_words` | List of stop words, used for `Token.is_stop`.
**Example:** [`stop_words.py`][stop_words.py] |
+| `tokenizer_exceptions` | Tokenizer exception rules, string mapped to list of token attributes.
**Example:** [`de/tokenizer_exceptions.py`][de/tokenizer_exceptions.py] |
+| `prefixes`, `suffixes`, `infixes` | Prefix, suffix and infix rules for the default tokenizer.
**Example:** [`puncutation.py`][punctuation.py] |
+| `token_match` | Optional regex for matching strings that should never be split, overriding the infix rules.
**Example:** [`fr/tokenizer_exceptions.py`][fr/tokenizer_exceptions.py] |
+| `url_match` | Regular expression for matching URLs. Prefixes and suffixes are removed before applying the match.
**Example:** [`tokenizer_exceptions.py`][tokenizer_exceptions.py] |
+| `lex_attr_getters` | Custom functions for setting lexical attributes on tokens, e.g. `like_num`.
**Example:** [`lex_attrs.py`][lex_attrs.py] |
+| `syntax_iterators` | Functions that compute views of a `Doc` object based on its syntax. At the moment, only used for [noun chunks](/usage/linguistic-features#noun-chunks).
**Example:** [`syntax_iterators.py`][syntax_iterators.py]. |
+| `writing_system` | Information about the language's writing system, available via `Vocab.writing_system`. Defaults to: `{"direction": "ltr", "has_case": True, "has_letters": True}.`.
**Example:** [`zh/__init__.py`][zh/__init__.py] |
+| `config` | Default [config](/usage/training#config) added to `nlp.config`. This can include references to custom tokenizers or lemmatizers.
**Example:** [`zh/__init__.py`][zh/__init__.py] |
+
+[stop_words.py]:
+ https://github.com/explosion/spaCy/tree/master/spacy/lang/en/stop_words.py
+[tokenizer_exceptions.py]:
+ https://github.com/explosion/spaCy/tree/master/spacy/lang/tokenizer_exceptions.py
+[de/tokenizer_exceptions.py]:
+ https://github.com/explosion/spaCy/tree/master/spacy/lang/de/tokenizer_exceptions.py
+[fr/tokenizer_exceptions.py]:
+ https://github.com/explosion/spaCy/tree/master/spacy/lang/fr/tokenizer_exceptions.py
+[punctuation.py]:
+ https://github.com/explosion/spaCy/tree/master/spacy/lang/punctuation.py
+[lex_attrs.py]:
+ https://github.com/explosion/spaCy/tree/master/spacy/lang/en/lex_attrs.py
+[syntax_iterators.py]:
+ https://github.com/explosion/spaCy/tree/master/spacy/lang/en/syntax_iterators.py
+[zh/__init__.py]:
+ https://github.com/explosion/spaCy/tree/master/spacy/lang/zh/__init__.py
## Serialization fields {#serialization-fields}
diff --git a/website/docs/usage/101/_language-data.md b/website/docs/usage/101/_language-data.md
index 31bfe53ab..2917b19c4 100644
--- a/website/docs/usage/101/_language-data.md
+++ b/website/docs/usage/101/_language-data.md
@@ -8,12 +8,10 @@ makes the data easy to update and extend.
The **shared language data** in the directory root includes rules that can be
generalized across languages – for example, rules for basic punctuation, emoji,
-emoticons, single-letter abbreviations and norms for equivalent tokens with
-different spellings, like `"` and `”`. This helps the models make more accurate
-predictions. The **individual language data** in a submodule contains rules that
-are only relevant to a particular language. It also takes care of putting
-together all components and creating the `Language` subclass – for example,
-`English` or `German`.
+emoticons and single-letter abbreviations. The **individual language data** in a
+submodule contains rules that are only relevant to a particular language. It
+also takes care of putting together all components and creating the `Language`
+subclass – for example, `English` or `German`.
> ```python
> from spacy.lang.en import English
@@ -23,27 +21,28 @@ together all components and creating the `Language` subclass – for example,
> nlp_de = German() # Includes German data
> ```
+
+
+
+
| Name | Description |
| ---------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Stop words**
[`stop_words.py`][stop_words.py] | List of most common words of a language that are often useful to filter out, for example "and" or "I". Matching tokens will return `True` for `is_stop`. |
| **Tokenizer exceptions**
[`tokenizer_exceptions.py`][tokenizer_exceptions.py] | Special-case rules for the tokenizer, for example, contractions like "can't" and abbreviations with punctuation, like "U.K.". |
-| **Norm exceptions**
[`norm_exceptions.py`][norm_exceptions.py] | Special-case rules for normalizing tokens to improve the model's predictions, for example on American vs. British spelling. |
| **Punctuation rules**
[`punctuation.py`][punctuation.py] | Regular expressions for splitting tokens, e.g. on punctuation or special characters like emoji. Includes rules for prefixes, suffixes and infixes. |
| **Character classes**
[`char_classes.py`][char_classes.py] | Character classes to be used in regular expressions, for example, latin characters, quotes, hyphens or icons. |
| **Lexical attributes**
[`lex_attrs.py`][lex_attrs.py] | Custom functions for setting lexical attributes on tokens, e.g. `like_num`, which includes language-specific words like "ten" or "hundred". |
| **Syntax iterators**
[`syntax_iterators.py`][syntax_iterators.py] | Functions that compute views of a `Doc` object based on its syntax. At the moment, only used for [noun chunks](/usage/linguistic-features#noun-chunks). |
-| **Tag map**
[`tag_map.py`][tag_map.py] | Dictionary mapping strings in your tag set to [Universal Dependencies](http://universaldependencies.org/u/pos/all.html) tags. |
-| **Morph rules**
[`morph_rules.py`][morph_rules.py] | Exception rules for morphological analysis of irregular words like personal pronouns. |
| **Lemmatizer**
[`spacy-lookups-data`][spacy-lookups-data] | Lemmatization rules or a lookup-based lemmatization table to assign base forms, for example "be" for "was". |
[stop_words.py]:
https://github.com/explosion/spaCy/tree/master/spacy/lang/en/stop_words.py
[tokenizer_exceptions.py]:
https://github.com/explosion/spaCy/tree/master/spacy/lang/de/tokenizer_exceptions.py
-[norm_exceptions.py]:
- https://github.com/explosion/spaCy/tree/master/spacy/lang/norm_exceptions.py
[punctuation.py]:
https://github.com/explosion/spaCy/tree/master/spacy/lang/punctuation.py
[char_classes.py]:
@@ -52,8 +51,4 @@ together all components and creating the `Language` subclass – for example,
https://github.com/explosion/spaCy/tree/master/spacy/lang/en/lex_attrs.py
[syntax_iterators.py]:
https://github.com/explosion/spaCy/tree/master/spacy/lang/en/syntax_iterators.py
-[tag_map.py]:
- https://github.com/explosion/spaCy/tree/master/spacy/lang/en/tag_map.py
-[morph_rules.py]:
- https://github.com/explosion/spaCy/tree/master/spacy/lang/en/morph_rules.py
[spacy-lookups-data]: https://github.com/explosion/spacy-lookups-data
diff --git a/website/docs/usage/linguistic-features.md b/website/docs/usage/linguistic-features.md
index 27512f61b..18c33c7bb 100644
--- a/website/docs/usage/linguistic-features.md
+++ b/website/docs/usage/linguistic-features.md
@@ -602,7 +602,95 @@ import Tokenization101 from 'usage/101/\_tokenization.md'
-### Tokenizer data {#101-data}
+
+
+spaCy introduces a novel tokenization algorithm, that gives a better balance
+between performance, ease of definition, and ease of alignment into the original
+string.
+
+After consuming a prefix or suffix, we consult the special cases again. We want
+the special cases to handle things like "don't" in English, and we want the same
+rule to work for "(don't)!". We do this by splitting off the open bracket, then
+the exclamation, then the close bracket, and finally matching the special case.
+Here's an implementation of the algorithm in Python, optimized for readability
+rather than performance:
+
+```python
+def tokenizer_pseudo_code(
+ special_cases,
+ prefix_search,
+ suffix_search,
+ infix_finditer,
+ token_match,
+ url_match
+):
+ tokens = []
+ for substring in text.split():
+ suffixes = []
+ while substring:
+ while prefix_search(substring) or suffix_search(substring):
+ if token_match(substring):
+ tokens.append(substring)
+ substring = ""
+ break
+ if substring in special_cases:
+ tokens.extend(special_cases[substring])
+ substring = ""
+ break
+ if prefix_search(substring):
+ split = prefix_search(substring).end()
+ tokens.append(substring[:split])
+ substring = substring[split:]
+ if substring in special_cases:
+ continue
+ if suffix_search(substring):
+ split = suffix_search(substring).start()
+ suffixes.append(substring[split:])
+ substring = substring[:split]
+ if token_match(substring):
+ tokens.append(substring)
+ substring = ""
+ elif url_match(substring):
+ tokens.append(substring)
+ substring = ""
+ elif substring in special_cases:
+ tokens.extend(special_cases[substring])
+ substring = ""
+ elif list(infix_finditer(substring)):
+ infixes = infix_finditer(substring)
+ offset = 0
+ for match in infixes:
+ tokens.append(substring[offset : match.start()])
+ tokens.append(substring[match.start() : match.end()])
+ offset = match.end()
+ if substring[offset:]:
+ tokens.append(substring[offset:])
+ substring = ""
+ elif substring:
+ tokens.append(substring)
+ substring = ""
+ tokens.extend(reversed(suffixes))
+ return tokens
+```
+
+The algorithm can be summarized as follows:
+
+1. Iterate over whitespace-separated substrings.
+2. Look for a token match. If there is a match, stop processing and keep this
+ token.
+3. Check whether we have an explicitly defined special case for this substring.
+ If we do, use it.
+4. Otherwise, try to consume one prefix. If we consumed a prefix, go back to #2,
+ so that the token match and special cases always get priority.
+5. If we didn't consume a prefix, try to consume a suffix and then go back to
+ #2.
+6. If we can't consume a prefix or a suffix, look for a URL match.
+7. If there's no URL match, then look for a special case.
+8. Look for "infixes" — stuff like hyphens etc. and split the substring into
+ tokens on all infixes.
+9. Once we can't consume any more of the string, handle it as a single token.
+
+
**Global** and **language-specific** tokenizer data is supplied via the language
data in
@@ -613,15 +701,6 @@ The prefixes, suffixes and infixes mostly define punctuation rules – for
example, when to split off periods (at the end of a sentence), and when to leave
tokens containing periods intact (abbreviations like "U.S.").
-![Language data architecture](../images/language_data.svg)
-
-
-
-For more details on the language-specific data, see the usage guide on
-[adding languages](/usage/adding-languages).
-
-
-
Tokenization rules that are specific to one language, but can be **generalized
@@ -637,6 +716,14 @@ subclass.
---
+
+
### Adding special case tokenization rules {#special-cases}
Most domains have at least some idiosyncrasies that require custom tokenization
@@ -677,88 +764,6 @@ nlp.tokenizer.add_special_case("...gimme...?", [{"ORTH": "...gimme...?"}])
assert len(nlp("...gimme...?")) == 1
```
-### How spaCy's tokenizer works {#how-tokenizer-works}
-
-spaCy introduces a novel tokenization algorithm, that gives a better balance
-between performance, ease of definition, and ease of alignment into the original
-string.
-
-After consuming a prefix or suffix, we consult the special cases again. We want
-the special cases to handle things like "don't" in English, and we want the same
-rule to work for "(don't)!". We do this by splitting off the open bracket, then
-the exclamation, then the close bracket, and finally matching the special case.
-Here's an implementation of the algorithm in Python, optimized for readability
-rather than performance:
-
-```python
-def tokenizer_pseudo_code(self, special_cases, prefix_search, suffix_search,
- infix_finditer, token_match, url_match):
- tokens = []
- for substring in text.split():
- suffixes = []
- while substring:
- while prefix_search(substring) or suffix_search(substring):
- if token_match(substring):
- tokens.append(substring)
- substring = ''
- break
- if substring in special_cases:
- tokens.extend(special_cases[substring])
- substring = ''
- break
- if prefix_search(substring):
- split = prefix_search(substring).end()
- tokens.append(substring[:split])
- substring = substring[split:]
- if substring in special_cases:
- continue
- if suffix_search(substring):
- split = suffix_search(substring).start()
- suffixes.append(substring[split:])
- substring = substring[:split]
- if token_match(substring):
- tokens.append(substring)
- substring = ''
- elif url_match(substring):
- tokens.append(substring)
- substring = ''
- elif substring in special_cases:
- tokens.extend(special_cases[substring])
- substring = ''
- elif list(infix_finditer(substring)):
- infixes = infix_finditer(substring)
- offset = 0
- for match in infixes:
- tokens.append(substring[offset : match.start()])
- tokens.append(substring[match.start() : match.end()])
- offset = match.end()
- if substring[offset:]:
- tokens.append(substring[offset:])
- substring = ''
- elif substring:
- tokens.append(substring)
- substring = ''
- tokens.extend(reversed(suffixes))
- return tokens
-```
-
-The algorithm can be summarized as follows:
-
-1. Iterate over whitespace-separated substrings.
-2. Look for a token match. If there is a match, stop processing and keep this
- token.
-3. Check whether we have an explicitly defined special case for this substring.
- If we do, use it.
-4. Otherwise, try to consume one prefix. If we consumed a prefix, go back to #2,
- so that the token match and special cases always get priority.
-5. If we didn't consume a prefix, try to consume a suffix and then go back to
- #2.
-6. If we can't consume a prefix or a suffix, look for a URL match.
-7. If there's no URL match, then look for a special case.
-8. Look for "infixes" — stuff like hyphens etc. and split the substring into
- tokens on all infixes.
-9. Once we can't consume any more of the string, handle it as a single token.
-
#### Debugging the tokenizer {#tokenizer-debug new="2.2.3"}
A working implementation of the pseudo-code above is available for debugging as
@@ -766,6 +771,17 @@ A working implementation of the pseudo-code above is available for debugging as
tuples showing which tokenizer rule or pattern was matched for each token. The
tokens produced are identical to `nlp.tokenizer()` except for whitespace tokens:
+> #### Expected output
+>
+> ```
+> " PREFIX
+> Let SPECIAL-1
+> 's SPECIAL-2
+> go TOKEN
+> ! SUFFIX
+> " SUFFIX
+> ```
+
```python
### {executable="true"}
from spacy.lang.en import English
@@ -777,13 +793,6 @@ tok_exp = nlp.tokenizer.explain(text)
assert [t.text for t in doc if not t.is_space] == [t[1] for t in tok_exp]
for t in tok_exp:
print(t[1], "\\t", t[0])
-
-# " PREFIX
-# Let SPECIAL-1
-# 's SPECIAL-2
-# go TOKEN
-# ! SUFFIX
-# " SUFFIX
```
### Customizing spaCy's Tokenizer class {#native-tokenizers}
@@ -1437,3 +1446,73 @@ print("After:", [sent.text for sent in doc.sents])
import LanguageData101 from 'usage/101/\_language-data.md'
+
+### Creating a custom language subclass {#language-subclass}
+
+If you want to customize multiple components of the language data or add support
+for a custom language or domain-specific "dialect", you can also implement your
+own language subclass. The subclass should define two attributes: the `lang`
+(unique language code) and the `Defaults` defining the language data. For an
+overview of the available attributes that can be overwritten, see the
+[`Language.Defaults`](/api/language#defaults) documentation.
+
+```python
+### {executable="true"}
+from spacy.lang.en import English
+
+class CustomEnglishDefaults(English.Defaults):
+ stop_words = set(["custom", "stop"])
+
+class CustomEnglish(English):
+ lang = "custom_en"
+ Defaults = CustomEnglishDefaults
+
+nlp1 = English()
+nlp2 = CustomEnglish()
+
+print(nlp1.lang, [token.is_stop for token in nlp1("custom stop")])
+print(nlp2.lang, [token.is_stop for token in nlp2("custom stop")])
+```
+
+The [`@spacy.registry.languages`](/api/top-level#registry) decorator lets you
+register a custom language class and assign it a string name. This means that
+you can call [`spacy.blank`](/api/top-level#spacy.blank) with your custom
+language name, and even train models with it and refer to it in your
+[training config](/usage/training#config).
+
+> #### Config usage
+>
+> After registering your custom language class using the `languages` registry,
+> you can refer to it in your [training config](/usage/training#config). This
+> means spaCy will train your model using the custom subclass.
+>
+> ```ini
+> [nlp]
+> lang = "custom_en"
+> ```
+>
+> In order to resolve `"custom_en"` to your subclass, the registered function
+> needs to be available during training. You can load a Python file containing
+> the code using the `--code` argument:
+>
+> ```bash
+> ### {wrap="true"}
+> $ python -m spacy train train.spacy dev.spacy config.cfg --code code.py
+> ```
+
+```python
+### Registering a custom language {highlight="7,12-13"}
+import spacy
+from spacy.lang.en import English
+
+class CustomEnglishDefaults(English.Defaults):
+ stop_words = set(["custom", "stop"])
+
+@spacy.registry.languages("custom_en")
+class CustomEnglish(English):
+ lang = "custom_en"
+ Defaults = CustomEnglishDefaults
+
+# This now works! 🎉
+nlp = spacy.blank("custom_en")
+```
diff --git a/website/docs/usage/projects.md b/website/docs/usage/projects.md
index c56044be0..ff8f91683 100644
--- a/website/docs/usage/projects.md
+++ b/website/docs/usage/projects.md
@@ -618,7 +618,9 @@ mattis pretium.
[FastAPI](https://fastapi.tiangolo.com/) is a modern high-performance framework
for building REST APIs with Python, based on Python
[type hints](https://fastapi.tiangolo.com/python-types/). It's become a popular
-library for serving machine learning models and
+library for serving machine learning models and you can use it in your spaCy
+projects to quickly serve up a trained model and make it available behind a REST
+API.
```python
# TODO: show an example that addresses some of the main concerns for serving ML (workers etc.)
diff --git a/website/docs/usage/training.md b/website/docs/usage/training.md
index d8290a7a1..b45788e34 100644
--- a/website/docs/usage/training.md
+++ b/website/docs/usage/training.md
@@ -74,7 +74,7 @@ When you train a model using the [`spacy train`](/api/cli#train) command, you'll
see a table showing metrics after each pass over the data. Here's what those
metrics means:
-
+
| Name | Description |
| ---------- | ------------------------------------------------------------------------------------------------- |
@@ -116,7 +116,7 @@ integrate custom models and architectures, written in your framework of choice.
Some of the main advantages and features of spaCy's training config are:
- **Structured sections.** The config is grouped into sections, and nested
- sections are defined using the `.` notation. For example, `[nlp.pipeline.ner]`
+ sections are defined using the `.` notation. For example, `[components.ner]`
defines the settings for the pipeline's named entity recognizer. The config
can be loaded as a Python dict.
- **References to registered functions.** Sections can refer to registered
@@ -136,10 +136,8 @@ Some of the main advantages and features of spaCy's training config are:
Python [type hints](https://docs.python.org/3/library/typing.html) to tell the
config which types of data to expect.
-
-
```ini
-https://github.com/explosion/spaCy/blob/develop/examples/experiments/onto-joint/defaults.cfg
+https://github.com/explosion/spaCy/blob/develop/spacy/default_config.cfg
```
Under the hood, the config is parsed into a dictionary. It's divided into
@@ -151,11 +149,12 @@ not just define static settings, but also construct objects like architectures,
schedules, optimizers or any other custom components. The main top-level
sections of a config file are:
-| Section | Description |
-| ------------- | ----------------------------------------------------------------------------------------------------- |
-| `training` | Settings and controls for the training and evaluation process. |
-| `pretraining` | Optional settings and controls for the [language model pretraining](#pretraining). |
-| `nlp` | Definition of the [processing pipeline](/docs/processing-pipelines), its components and their models. |
+| Section | Description |
+| ------------- | -------------------------------------------------------------------------------------------------------------------- |
+| `training` | Settings and controls for the training and evaluation process. |
+| `pretraining` | Optional settings and controls for the [language model pretraining](#pretraining). |
+| `nlp` | Definition of the `nlp` object, its tokenizer and [processing pipeline](/docs/processing-pipelines) component names. |
+| `components` | Definitions of the [pipeline components](/docs/processing-pipelines) and their models. |
@@ -176,16 +175,16 @@ a consistent format. There are no command-line arguments that need to be set,
and no hidden defaults. However, there can still be scenarios where you may want
to override config settings when you run [`spacy train`](/api/cli#train). This
includes **file paths** to vectors or other resources that shouldn't be
-hard-code in a config file, or **system-dependent settings** like the GPU ID.
+hard-code in a config file, or **system-dependent settings**.
For cases like this, you can set additional command-line options starting with
`--` that correspond to the config section and value to override. For example,
-`--training.use_gpu 1` sets the `use_gpu` value in the `[training]` block to
-`1`.
+`--training.batch_size 128` sets the `batch_size` value in the `[training]`
+block to `128`.
```bash
$ python -m spacy train train.spacy dev.spacy config.cfg
---training.use_gpu 1 --nlp.vectors /path/to/vectors
+--training.batch_size 128 --nlp.vectors /path/to/vectors
```
Only existing sections and values in the config can be overwritten. At the end
diff --git a/website/docs/usage/v3.md b/website/docs/usage/v3.md
index 049462553..13f6e67af 100644
--- a/website/docs/usage/v3.md
+++ b/website/docs/usage/v3.md
@@ -14,4 +14,20 @@ menu:
## Backwards Incompatibilities {#incompat}
+### Removed deprecated methods, attributes and arguments {#incompat-removed}
+
+The following deprecated methods, attributes and arguments were removed in v3.0.
+Most of them have been deprecated for quite a while now and many would
+previously raise errors. Many of them were also mostly internals. If you've been
+working with more recent versions of spaCy v2.x, it's unlikely that your code
+relied on them.
+
+| Class | Removed |
+| --------------------- | ------------------------------------------------------- |
+| [`Doc`](/api/doc) | `Doc.tokens_from_list`, `Doc.merge` |
+| [`Span`](/api/span) | `Span.merge`, `Span.upper`, `Span.lower`, `Span.string` |
+| [`Token`](/api/token) | `Token.string` |
+
+
+
## Migrating from v2.x {#migrating}