Update docs [ci skip]

2020-07-25 18:51:12 +02:00 · 2020-07-25 18:51:12 +02:00 · c288dba8e7
parent 1346ee06d4
commit c288dba8e7
6 changed files with 297 additions and 147 deletions
--- a/website/docs/api/language.md
+++ b/website/docs/api/language.md
@ -49,11 +49,11 @@ contain arbitrary whitespace. Alignment into the original string is preserved.
 > assert (doc[0].text, doc[0].head.tag_) == ("An", "NN")
 > ```

-| Name        | Type  | Description                                                                       |
-| ----------- | ----- | --------------------------------------------------------------------------------- |
-| `text`      | str   | The text to be processed.                                                         |
-| `disable`   | `List[str]`  | Names of pipeline components to [disable](/usage/processing-pipelines#disabling). |
-| **RETURNS** | `Doc` | A container for accessing the annotations.                                        |
+| Name        | Type        | Description                                                                       |
+| ----------- | ----------- | --------------------------------------------------------------------------------- |
+| `text`      | str         | The text to be processed.                                                         |
+| `disable`   | `List[str]` | Names of pipeline components to [disable](/usage/processing-pipelines#disabling). |
+| **RETURNS** | `Doc`       | A container for accessing the annotations.                                        |

 ## Language.pipe {#pipe tag="method"}

@ -112,14 +112,14 @@ Evaluate a model's pipeline components.
 > print(scores)
 > ```

-| Name                                         | Type                | Description                                                                           |
-| -------------------------------------------- | ------------------- | ------------------------------------------------------------------------------------- |
-| `examples`                                   | `Iterable[Example]` | A batch of [`Example`](/api/example) objects to learn from.                           |
-| `verbose`                                    | bool                | Print debugging information.                                                          |
-| `batch_size`                                 | int                 | The batch size to use.                                                                |
-| `scorer`                                     | `Scorer`            | Optional [`Scorer`](/api/scorer) to use. If not passed in, a new one will be created. |
-| `component_cfg` <Tag variant="new">2.1</Tag> | `Dict[str, Dict]`   | Config parameters for specific pipeline components, keyed by component name.          |
-| **RETURNS**                                  | `Dict[str, Union[float, Dict]]` | A dictionary of evaluation scores.                                        |
+| Name                                         | Type                            | Description                                                                           |
+| -------------------------------------------- | ------------------------------- | ------------------------------------------------------------------------------------- |
+| `examples`                                   | `Iterable[Example]`             | A batch of [`Example`](/api/example) objects to learn from.                           |
+| `verbose`                                    | bool                            | Print debugging information.                                                          |
+| `batch_size`                                 | int                             | The batch size to use.                                                                |
+| `scorer`                                     | `Scorer`                        | Optional [`Scorer`](/api/scorer) to use. If not passed in, a new one will be created. |
+| `component_cfg` <Tag variant="new">2.1</Tag> | `Dict[str, Dict]`               | Config parameters for specific pipeline components, keyed by component name.          |
+| **RETURNS**                                  | `Dict[str, Union[float, Dict]]` | A dictionary of evaluation scores.                                                    |

 ## Language.begin_training {#begin_training tag="method"}

@ -418,11 +418,70 @@ available to the loaded object.

 ## Class attributes {#class-attributes}

-| Name                                   | Type  | Description                                                                                                                         |
-| -------------------------------------- | ----- | ----------------------------------------------------------------------------------------------------------------------------------- |
-| `Defaults`                             | class | Settings, data and factory methods for creating the `nlp` object and processing pipeline.                                           |
-| `lang`                                 | str   | Two-letter language ID, i.e. [ISO code](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes).                                     |
-| `factories` <Tag variant="new">2</Tag> | dict  | Factories that create pre-defined pipeline components, e.g. the tagger, parser or entity recognizer, keyed by their component name. |
+| Name       | Type  | Description                                                                                     |
+| ---------- | ----- | ----------------------------------------------------------------------------------------------- |
+| `Defaults` | class | Settings, data and factory methods for creating the `nlp` object and processing pipeline.       |
+| `lang`     | str   | Two-letter language ID, i.e. [ISO code](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes). |
+
+## Defaults {#defaults}
+
+The following attributes can be set on the `Language.Defaults` class to
+customize the default language data:
+
+> #### Example
+>
+> ```python
+> from spacy.language import language
+> from spacy.lang.tokenizer_exceptions import URL_MATCH
+> from thinc.api import Config
+>
+> DEFAULT_CONFIFG = """
+> [nlp.tokenizer]
+> @tokenizers = "MyCustomTokenizer.v1"
+> """
+>
+> class Defaults(Language.Defaults):
+>    stop_words = set()
+>    tokenizer_exceptions = {}
+>    prefixes = tuple()
+>    suffixes = tuple()
+>    infixes = tuple()
+>    token_match = None
+>    url_match = URL_MATCH
+>    lex_attr_getters = {}
+>    syntax_iterators = {}
+>    writing_system = {"direction": "ltr", "has_case": True, "has_letters": True}
+>    config = Config().from_str(DEFAULT_CONFIG)
+> ```
+
+| Name                              | Description                                                                                                                                                                                                              |
+| --------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
+| `stop_words`                      | List of stop words, used for `Token.is_stop`.<br />**Example:** [`stop_words.py`][stop_words.py]                                                                                                                         |
+| `tokenizer_exceptions`            | Tokenizer exception rules, string mapped to list of token attributes.<br />**Example:** [`de/tokenizer_exceptions.py`][de/tokenizer_exceptions.py]                                                                       |
+| `prefixes`, `suffixes`, `infixes` | Prefix, suffix and infix rules for the default tokenizer.<br />**Example:** [`puncutation.py`][punctuation.py]                                                                                                           |
+| `token_match`                     | Optional regex for matching strings that should never be split, overriding the infix rules.<br />**Example:** [`fr/tokenizer_exceptions.py`][fr/tokenizer_exceptions.py]                                                 |
+| `url_match`                       | Regular expression for matching URLs. Prefixes and suffixes are removed before applying the match.<br />**Example:** [`tokenizer_exceptions.py`][tokenizer_exceptions.py]                                                |
+| `lex_attr_getters`                | Custom functions for setting lexical attributes on tokens, e.g. `like_num`.<br />**Example:** [`lex_attrs.py`][lex_attrs.py]                                                                                             |
+| `syntax_iterators`                | Functions that compute views of a `Doc` object based on its syntax. At the moment, only used for [noun chunks](/usage/linguistic-features#noun-chunks).<br />**Example:** [`syntax_iterators.py`][syntax_iterators.py].  |
+| `writing_system`                  | Information about the language's writing system, available via `Vocab.writing_system`. Defaults to: `{"direction": "ltr", "has_case": True, "has_letters": True}.`.<br />**Example:** [`zh/__init__.py`][zh/__init__.py] |
+| `config`                          | Default [config](/usage/training#config) added to `nlp.config`. This can include references to custom tokenizers or lemmatizers.<br />**Example:** [`zh/__init__.py`][zh/__init__.py]                                    |
+
+[stop_words.py]:
+  https://github.com/explosion/spaCy/tree/master/spacy/lang/en/stop_words.py
+[tokenizer_exceptions.py]:
+  https://github.com/explosion/spaCy/tree/master/spacy/lang/tokenizer_exceptions.py
+[de/tokenizer_exceptions.py]:
+  https://github.com/explosion/spaCy/tree/master/spacy/lang/de/tokenizer_exceptions.py
+[fr/tokenizer_exceptions.py]:
+  https://github.com/explosion/spaCy/tree/master/spacy/lang/fr/tokenizer_exceptions.py
+[punctuation.py]:
+  https://github.com/explosion/spaCy/tree/master/spacy/lang/punctuation.py
+[lex_attrs.py]:
+  https://github.com/explosion/spaCy/tree/master/spacy/lang/en/lex_attrs.py
+[syntax_iterators.py]:
+  https://github.com/explosion/spaCy/tree/master/spacy/lang/en/syntax_iterators.py
+[zh/__init__.py]:
+  https://github.com/explosion/spaCy/tree/master/spacy/lang/zh/__init__.py

 ## Serialization fields {#serialization-fields}

--- a/website/docs/usage/101/_language-data.md
+++ b/website/docs/usage/101/_language-data.md
@ -8,12 +8,10 @@ makes the data easy to update and extend.

 The **shared language data** in the directory root includes rules that can be
 generalized across languages – for example, rules for basic punctuation, emoji,
-emoticons, single-letter abbreviations and norms for equivalent tokens with
-different spellings, like `"` and `”`. This helps the models make more accurate
-predictions. The **individual language data** in a submodule contains rules that
-are only relevant to a particular language. It also takes care of putting
-together all components and creating the `Language` subclass – for example,
-`English` or `German`.
+emoticons and single-letter abbreviations. The **individual language data** in a
+submodule contains rules that are only relevant to a particular language. It
+also takes care of putting together all components and creating the `Language`
+subclass – for example, `English` or `German`.

 > ```python
 > from spacy.lang.en import English
@ -23,27 +21,28 @@ together all components and creating the `Language` subclass – for example,
 > nlp_de = German()  # Includes German data
 > ```

+<!-- TODO: upgrade graphic
+
 ![Language data architecture](../../images/language_data.svg)

+-->
+
+<!-- TODO: remove this table in favor of more specific Language.Defaults docs in linguistic features? -->
+
 | Name                                                                               | Description                                                                                                                                              |
 | ---------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------- |
 | **Stop words**<br />[`stop_words.py`][stop_words.py]                               | List of most common words of a language that are often useful to filter out, for example "and" or "I". Matching tokens will return `True` for `is_stop`. |
 | **Tokenizer exceptions**<br />[`tokenizer_exceptions.py`][tokenizer_exceptions.py] | Special-case rules for the tokenizer, for example, contractions like "can't" and abbreviations with punctuation, like "U.K.".                            |
-| **Norm exceptions**<br />[`norm_exceptions.py`][norm_exceptions.py]                | Special-case rules for normalizing tokens to improve the model's predictions, for example on American vs. British spelling.                              |
 | **Punctuation rules**<br />[`punctuation.py`][punctuation.py]                      | Regular expressions for splitting tokens, e.g. on punctuation or special characters like emoji. Includes rules for prefixes, suffixes and infixes.       |
 | **Character classes**<br />[`char_classes.py`][char_classes.py]                    | Character classes to be used in regular expressions, for example, latin characters, quotes, hyphens or icons.                                            |
 | **Lexical attributes**<br />[`lex_attrs.py`][lex_attrs.py]                         | Custom functions for setting lexical attributes on tokens, e.g. `like_num`, which includes language-specific words like "ten" or "hundred".              |
 | **Syntax iterators**<br />[`syntax_iterators.py`][syntax_iterators.py]             | Functions that compute views of a `Doc` object based on its syntax. At the moment, only used for [noun chunks](/usage/linguistic-features#noun-chunks).  |
-| **Tag map**<br />[`tag_map.py`][tag_map.py]                                        | Dictionary mapping strings in your tag set to [Universal Dependencies](http://universaldependencies.org/u/pos/all.html) tags.                            |
-| **Morph rules**<br />[`morph_rules.py`][morph_rules.py]                            | Exception rules for morphological analysis of irregular words like personal pronouns.                                                                    |
 | **Lemmatizer**<br />[`spacy-lookups-data`][spacy-lookups-data]                     | Lemmatization rules or a lookup-based lemmatization table to assign base forms, for example "be" for "was".                                              |

 [stop_words.py]:
  https://github.com/explosion/spaCy/tree/master/spacy/lang/en/stop_words.py
 [tokenizer_exceptions.py]:
  https://github.com/explosion/spaCy/tree/master/spacy/lang/de/tokenizer_exceptions.py
-[norm_exceptions.py]:
-  https://github.com/explosion/spaCy/tree/master/spacy/lang/norm_exceptions.py
 [punctuation.py]:
  https://github.com/explosion/spaCy/tree/master/spacy/lang/punctuation.py
 [char_classes.py]:
@ -52,8 +51,4 @@ together all components and creating the `Language` subclass – for example,
  https://github.com/explosion/spaCy/tree/master/spacy/lang/en/lex_attrs.py
 [syntax_iterators.py]:
  https://github.com/explosion/spaCy/tree/master/spacy/lang/en/syntax_iterators.py
-[tag_map.py]:
-  https://github.com/explosion/spaCy/tree/master/spacy/lang/en/tag_map.py
-[morph_rules.py]:
-  https://github.com/explosion/spaCy/tree/master/spacy/lang/en/morph_rules.py
 [spacy-lookups-data]: https://github.com/explosion/spacy-lookups-data
--- a/website/docs/usage/linguistic-features.md
+++ b/website/docs/usage/linguistic-features.md
@ -602,7 +602,95 @@ import Tokenization101 from 'usage/101/\_tokenization.md'

 <Tokenization101 />

-### Tokenizer data {#101-data}
+<Accordion title="Algorithm details: How spaCy's tokenizer works" id="how-tokenizer-works" spaced>
+
+spaCy introduces a novel tokenization algorithm, that gives a better balance
+between performance, ease of definition, and ease of alignment into the original
+string.
+
+After consuming a prefix or suffix, we consult the special cases again. We want
+the special cases to handle things like "don't" in English, and we want the same
+rule to work for "(don't)!". We do this by splitting off the open bracket, then
+the exclamation, then the close bracket, and finally matching the special case.
+Here's an implementation of the algorithm in Python, optimized for readability
+rather than performance:
+
+```python
+def tokenizer_pseudo_code(
+    special_cases,
+    prefix_search,
+    suffix_search,
+    infix_finditer,
+    token_match,
+    url_match
+):
+    tokens = []
+    for substring in text.split():
+        suffixes = []
+        while substring:
+            while prefix_search(substring) or suffix_search(substring):
+                if token_match(substring):
+                    tokens.append(substring)
+                    substring = ""
+                    break
+                if substring in special_cases:
+                    tokens.extend(special_cases[substring])
+                    substring = ""
+                    break
+                if prefix_search(substring):
+                    split = prefix_search(substring).end()
+                    tokens.append(substring[:split])
+                    substring = substring[split:]
+                    if substring in special_cases:
+                        continue
+                if suffix_search(substring):
+                    split = suffix_search(substring).start()
+                    suffixes.append(substring[split:])
+                    substring = substring[:split]
+            if token_match(substring):
+                tokens.append(substring)
+                substring = ""
+            elif url_match(substring):
+                tokens.append(substring)
+                substring = ""
+            elif substring in special_cases:
+                tokens.extend(special_cases[substring])
+                substring = ""
+            elif list(infix_finditer(substring)):
+                infixes = infix_finditer(substring)
+                offset = 0
+                for match in infixes:
+                    tokens.append(substring[offset : match.start()])
+                    tokens.append(substring[match.start() : match.end()])
+                    offset = match.end()
+                if substring[offset:]:
+                    tokens.append(substring[offset:])
+                substring = ""
+            elif substring:
+                tokens.append(substring)
+                substring = ""
+        tokens.extend(reversed(suffixes))
+    return tokens
+```
+
+The algorithm can be summarized as follows:
+
+1. Iterate over whitespace-separated substrings.
+2. Look for a token match. If there is a match, stop processing and keep this
+   token.
+3. Check whether we have an explicitly defined special case for this substring.
+   If we do, use it.
+4. Otherwise, try to consume one prefix. If we consumed a prefix, go back to #2,
+   so that the token match and special cases always get priority.
+5. If we didn't consume a prefix, try to consume a suffix and then go back to
+   #2.
+6. If we can't consume a prefix or a suffix, look for a URL match.
+7. If there's no URL match, then look for a special case.
+8. Look for "infixes" — stuff like hyphens etc. and split the substring into
+   tokens on all infixes.
+9. Once we can't consume any more of the string, handle it as a single token.
+
+</Accordion>

 **Global** and **language-specific** tokenizer data is supplied via the language
 data in
@ -613,15 +701,6 @@ The prefixes, suffixes and infixes mostly define punctuation rules – for
 example, when to split off periods (at the end of a sentence), and when to leave
 tokens containing periods intact (abbreviations like "U.S.").

-![Language data architecture](../images/language_data.svg)
-
-<Infobox title="Language data" emoji="📖">
-
-For more details on the language-specific data, see the usage guide on
-[adding languages](/usage/adding-languages).
-
-</Infobox>
-
 <Accordion title="Should I change the language data or add custom tokenizer rules?" id="lang-data-vs-tokenizer">

 Tokenization rules that are specific to one language, but can be **generalized
@ -637,6 +716,14 @@ subclass.

 ---

+<!--
+
+### Customizing the tokenizer {#tokenizer-custom}
+
+TODO: rewrite the docs on custom tokenization in a more user-friendly order, including details on how to integrate a fully custom tokenizer, representing a tokenizer in the config etc.
+
+-->
+
 ### Adding special case tokenization rules {#special-cases}

 Most domains have at least some idiosyncrasies that require custom tokenization
@ -677,88 +764,6 @@ nlp.tokenizer.add_special_case("...gimme...?", [{"ORTH": "...gimme...?"}])
 assert len(nlp("...gimme...?")) == 1
 ```

-### How spaCy's tokenizer works {#how-tokenizer-works}
-
-spaCy introduces a novel tokenization algorithm, that gives a better balance
-between performance, ease of definition, and ease of alignment into the original
-string.
-
-After consuming a prefix or suffix, we consult the special cases again. We want
-the special cases to handle things like "don't" in English, and we want the same
-rule to work for "(don't)!". We do this by splitting off the open bracket, then
-the exclamation, then the close bracket, and finally matching the special case.
-Here's an implementation of the algorithm in Python, optimized for readability
-rather than performance:
-
-```python
-def tokenizer_pseudo_code(self, special_cases, prefix_search, suffix_search,
-                          infix_finditer, token_match, url_match):
-    tokens = []
-    for substring in text.split():
-        suffixes = []
-        while substring:
-            while prefix_search(substring) or suffix_search(substring):
-                if token_match(substring):
-                    tokens.append(substring)
-                    substring = ''
-                    break
-                if substring in special_cases:
-                    tokens.extend(special_cases[substring])
-                    substring = ''
-                    break
-                if prefix_search(substring):
-                    split = prefix_search(substring).end()
-                    tokens.append(substring[:split])
-                    substring = substring[split:]
-                    if substring in special_cases:
-                        continue
-                if suffix_search(substring):
-                    split = suffix_search(substring).start()
-                    suffixes.append(substring[split:])
-                    substring = substring[:split]
-            if token_match(substring):
-                tokens.append(substring)
-                substring = ''
-            elif url_match(substring):
-                tokens.append(substring)
-                substring = ''
-            elif substring in special_cases:
-                tokens.extend(special_cases[substring])
-                substring = ''
-            elif list(infix_finditer(substring)):
-                infixes = infix_finditer(substring)
-                offset = 0
-                for match in infixes:
-                    tokens.append(substring[offset : match.start()])
-                    tokens.append(substring[match.start() : match.end()])
-                    offset = match.end()
-                if substring[offset:]:
-                    tokens.append(substring[offset:])
-                substring = ''
-            elif substring:
-                tokens.append(substring)
-                substring = ''
-        tokens.extend(reversed(suffixes))
-    return tokens
-```
-
-The algorithm can be summarized as follows:
-
-1. Iterate over whitespace-separated substrings.
-2. Look for a token match. If there is a match, stop processing and keep this
-   token.
-3. Check whether we have an explicitly defined special case for this substring.
-   If we do, use it.
-4. Otherwise, try to consume one prefix. If we consumed a prefix, go back to #2,
-   so that the token match and special cases always get priority.
-5. If we didn't consume a prefix, try to consume a suffix and then go back to
-   #2.
-6. If we can't consume a prefix or a suffix, look for a URL match.
-7. If there's no URL match, then look for a special case.
-8. Look for "infixes" — stuff like hyphens etc. and split the substring into
-   tokens on all infixes.
-9. Once we can't consume any more of the string, handle it as a single token.
-
 #### Debugging the tokenizer {#tokenizer-debug new="2.2.3"}

 A working implementation of the pseudo-code above is available for debugging as
@ -766,6 +771,17 @@ A working implementation of the pseudo-code above is available for debugging as
 tuples showing which tokenizer rule or pattern was matched for each token. The
 tokens produced are identical to `nlp.tokenizer()` except for whitespace tokens:

+> #### Expected output
+>
+> ```
+> "      PREFIX
+> Let    SPECIAL-1
+> 's     SPECIAL-2
+> go     TOKEN
+> !      SUFFIX
+> "      SUFFIX
+> ```
+
 ```python
 ### {executable="true"}
 from spacy.lang.en import English
@ -777,13 +793,6 @@ tok_exp = nlp.tokenizer.explain(text)
 assert [t.text for t in doc if not t.is_space] == [t[1] for t in tok_exp]
 for t in tok_exp:
    print(t[1], "\\t", t[0])
-
-# " 	 PREFIX
-# Let 	 SPECIAL-1
-# 's 	 SPECIAL-2
-# go 	 TOKEN
-# ! 	 SUFFIX
-# " 	 SUFFIX
 ```

 ### Customizing spaCy's Tokenizer class {#native-tokenizers}
@ -1437,3 +1446,73 @@ print("After:", [sent.text for sent in doc.sents])
 import LanguageData101 from 'usage/101/\_language-data.md'

 <LanguageData101 />
+
+### Creating a custom language subclass {#language-subclass}
+
+If you want to customize multiple components of the language data or add support
+for a custom language or domain-specific "dialect", you can also implement your
+own language subclass. The subclass should define two attributes: the `lang`
+(unique language code) and the `Defaults` defining the language data. For an
+overview of the available attributes that can be overwritten, see the
+[`Language.Defaults`](/api/language#defaults) documentation.
+
+```python
+### {executable="true"}
+from spacy.lang.en import English
+
+class CustomEnglishDefaults(English.Defaults):
+    stop_words = set(["custom", "stop"])
+
+class CustomEnglish(English):
+    lang = "custom_en"
+    Defaults = CustomEnglishDefaults
+
+nlp1 = English()
+nlp2 = CustomEnglish()
+
+print(nlp1.lang, [token.is_stop for token in nlp1("custom stop")])
+print(nlp2.lang, [token.is_stop for token in nlp2("custom stop")])
+```
+
+The [`@spacy.registry.languages`](/api/top-level#registry) decorator lets you
+register a custom language class and assign it a string name. This means that
+you can call [`spacy.blank`](/api/top-level#spacy.blank) with your custom
+language name, and even train models with it and refer to it in your
+[training config](/usage/training#config).
+
+> #### Config usage
+>
+> After registering your custom language class using the `languages` registry,
+> you can refer to it in your [training config](/usage/training#config). This
+> means spaCy will train your model using the custom subclass.
+>
+> ```ini
+> [nlp]
+> lang = "custom_en"
+> ```
+>
+> In order to resolve `"custom_en"` to your subclass, the registered function
+> needs to be available during training. You can load a Python file containing
+> the code using the `--code` argument:
+>
+> ```bash
+> ### {wrap="true"}
+> $ python -m spacy train train.spacy dev.spacy config.cfg --code code.py
+> ```
+
+```python
+### Registering a custom language {highlight="7,12-13"}
+import spacy
+from spacy.lang.en import English
+
+class CustomEnglishDefaults(English.Defaults):
+    stop_words = set(["custom", "stop"])
+
+@spacy.registry.languages("custom_en")
+class CustomEnglish(English):
+    lang = "custom_en"
+    Defaults = CustomEnglishDefaults
+
+# This now works! 🎉
+nlp = spacy.blank("custom_en")
+```
--- a/website/docs/usage/projects.md
+++ b/website/docs/usage/projects.md
@ -618,7 +618,9 @@ mattis pretium.
 [FastAPI](https://fastapi.tiangolo.com/) is a modern high-performance framework
 for building REST APIs with Python, based on Python
 [type hints](https://fastapi.tiangolo.com/python-types/). It's become a popular
-library for serving machine learning models and
+library for serving machine learning models and you can use it in your spaCy
+projects to quickly serve up a trained model and make it available behind a REST
+API.

 ```python
 # TODO: show an example that addresses some of the main concerns for serving ML (workers etc.)
--- a/website/docs/usage/training.md
+++ b/website/docs/usage/training.md
@ -74,7 +74,7 @@ When you train a model using the [`spacy train`](/api/cli#train) command, you'll
 see a table showing metrics after each pass over the data. Here's what those
 metrics means:

-<!-- TODO: update table below with updated metrics if needed -->
+<!-- TODO: update table below and include note about scores in config -->

 | Name       | Description                                                                                       |
 | ---------- | ------------------------------------------------------------------------------------------------- |
@ -116,7 +116,7 @@ integrate custom models and architectures, written in your framework of choice.
 Some of the main advantages and features of spaCy's training config are:

 - **Structured sections.** The config is grouped into sections, and nested
-  sections are defined using the `.` notation. For example, `[nlp.pipeline.ner]`
+  sections are defined using the `.` notation. For example, `[components.ner]`
  defines the settings for the pipeline's named entity recognizer. The config
  can be loaded as a Python dict.
 - **References to registered functions.** Sections can refer to registered
@ -136,10 +136,8 @@ Some of the main advantages and features of spaCy's training config are:
  Python [type hints](https://docs.python.org/3/library/typing.html) to tell the
  config which types of data to expect.

-<!-- TODO: update this config? -->
-
 ```ini
-https://github.com/explosion/spaCy/blob/develop/examples/experiments/onto-joint/defaults.cfg
+https://github.com/explosion/spaCy/blob/develop/spacy/default_config.cfg
 ```

 Under the hood, the config is parsed into a dictionary. It's divided into
@ -151,11 +149,12 @@ not just define static settings, but also construct objects like architectures,
 schedules, optimizers or any other custom components. The main top-level
 sections of a config file are:

-| Section       | Description                                                                                           |
-| ------------- | ----------------------------------------------------------------------------------------------------- |
-| `training`    | Settings and controls for the training and evaluation process.                                        |
-| `pretraining` | Optional settings and controls for the [language model pretraining](#pretraining).                    |
-| `nlp`         | Definition of the [processing pipeline](/docs/processing-pipelines), its components and their models. |
+| Section       | Description                                                                                                          |
+| ------------- | -------------------------------------------------------------------------------------------------------------------- |
+| `training`    | Settings and controls for the training and evaluation process.                                                       |
+| `pretraining` | Optional settings and controls for the [language model pretraining](#pretraining).                                   |
+| `nlp`         | Definition of the `nlp` object, its tokenizer and [processing pipeline](/docs/processing-pipelines) component names. |
+| `components`  | Definitions of the [pipeline components](/docs/processing-pipelines) and their models.                               |

 <Infobox title="Config format and settings" emoji="📖">

@ -176,16 +175,16 @@ a consistent format. There are no command-line arguments that need to be set,
 and no hidden defaults. However, there can still be scenarios where you may want
 to override config settings when you run [`spacy train`](/api/cli#train). This
 includes **file paths** to vectors or other resources that shouldn't be
-hard-code in a config file, or **system-dependent settings** like the GPU ID.
+hard-code in a config file, or **system-dependent settings**.

 For cases like this, you can set additional command-line options starting with
 `--` that correspond to the config section and value to override. For example,
-`--training.use_gpu 1` sets the `use_gpu` value in the `[training]` block to
-`1`.
+`--training.batch_size 128` sets the `batch_size` value in the `[training]`
+block to `128`.

 ```bash
 $ python -m spacy train train.spacy dev.spacy config.cfg
--training.use_gpu 1 --nlp.vectors /path/to/vectors
+--training.batch_size 128 --nlp.vectors /path/to/vectors
 ```

 Only existing sections and values in the config can be overwritten. At the end
--- a/website/docs/usage/v3.md
+++ b/website/docs/usage/v3.md
@ -14,4 +14,20 @@ menu:

 ## Backwards Incompatibilities {#incompat}

+### Removed deprecated methods, attributes and arguments {#incompat-removed}
+
+The following deprecated methods, attributes and arguments were removed in v3.0.
+Most of them have been deprecated for quite a while now and many would
+previously raise errors. Many of them were also mostly internals. If you've been
+working with more recent versions of spaCy v2.x, it's unlikely that your code
+relied on them.
+
+| Class                 | Removed                                                 |
+| --------------------- | ------------------------------------------------------- |
+| [`Doc`](/api/doc)     | `Doc.tokens_from_list`, `Doc.merge`                     |
+| [`Span`](/api/span)   | `Span.merge`, `Span.upper`, `Span.lower`, `Span.string` |
+| [`Token`](/api/token) | `Token.string`                                          |
+
+<!-- TODO: complete (see release notes Dropbox Paper doc) -->
+
 ## Migrating from v2.x {#migrating}