From c288dba8e79906e632f6771f6647bfaa84f14bd8 Mon Sep 17 00:00:00 2001 From: Ines Montani Date: Sat, 25 Jul 2020 18:51:12 +0200 Subject: [PATCH] Update docs [ci skip] --- website/docs/api/language.md | 95 ++++++-- website/docs/usage/101/_language-data.md | 25 +- website/docs/usage/linguistic-features.md | 277 ++++++++++++++-------- website/docs/usage/projects.md | 4 +- website/docs/usage/training.md | 27 +-- website/docs/usage/v3.md | 16 ++ 6 files changed, 297 insertions(+), 147 deletions(-) diff --git a/website/docs/api/language.md b/website/docs/api/language.md index be402532c..9ab25597d 100644 --- a/website/docs/api/language.md +++ b/website/docs/api/language.md @@ -49,11 +49,11 @@ contain arbitrary whitespace. Alignment into the original string is preserved. > assert (doc[0].text, doc[0].head.tag_) == ("An", "NN") > ``` -| Name | Type | Description | -| ----------- | ----- | --------------------------------------------------------------------------------- | -| `text` | str | The text to be processed. | -| `disable` | `List[str]` | Names of pipeline components to [disable](/usage/processing-pipelines#disabling). | -| **RETURNS** | `Doc` | A container for accessing the annotations. | +| Name | Type | Description | +| ----------- | ----------- | --------------------------------------------------------------------------------- | +| `text` | str | The text to be processed. | +| `disable` | `List[str]` | Names of pipeline components to [disable](/usage/processing-pipelines#disabling). | +| **RETURNS** | `Doc` | A container for accessing the annotations. | ## Language.pipe {#pipe tag="method"} @@ -112,14 +112,14 @@ Evaluate a model's pipeline components. > print(scores) > ``` -| Name | Type | Description | -| -------------------------------------------- | ------------------- | ------------------------------------------------------------------------------------- | -| `examples` | `Iterable[Example]` | A batch of [`Example`](/api/example) objects to learn from. | -| `verbose` | bool | Print debugging information. | -| `batch_size` | int | The batch size to use. | -| `scorer` | `Scorer` | Optional [`Scorer`](/api/scorer) to use. If not passed in, a new one will be created. | -| `component_cfg` 2.1 | `Dict[str, Dict]` | Config parameters for specific pipeline components, keyed by component name. | -| **RETURNS** | `Dict[str, Union[float, Dict]]` | A dictionary of evaluation scores. | +| Name | Type | Description | +| -------------------------------------------- | ------------------------------- | ------------------------------------------------------------------------------------- | +| `examples` | `Iterable[Example]` | A batch of [`Example`](/api/example) objects to learn from. | +| `verbose` | bool | Print debugging information. | +| `batch_size` | int | The batch size to use. | +| `scorer` | `Scorer` | Optional [`Scorer`](/api/scorer) to use. If not passed in, a new one will be created. | +| `component_cfg` 2.1 | `Dict[str, Dict]` | Config parameters for specific pipeline components, keyed by component name. | +| **RETURNS** | `Dict[str, Union[float, Dict]]` | A dictionary of evaluation scores. | ## Language.begin_training {#begin_training tag="method"} @@ -418,11 +418,70 @@ available to the loaded object. ## Class attributes {#class-attributes} -| Name | Type | Description | -| -------------------------------------- | ----- | ----------------------------------------------------------------------------------------------------------------------------------- | -| `Defaults` | class | Settings, data and factory methods for creating the `nlp` object and processing pipeline. | -| `lang` | str | Two-letter language ID, i.e. [ISO code](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes). | -| `factories` 2 | dict | Factories that create pre-defined pipeline components, e.g. the tagger, parser or entity recognizer, keyed by their component name. | +| Name | Type | Description | +| ---------- | ----- | ----------------------------------------------------------------------------------------------- | +| `Defaults` | class | Settings, data and factory methods for creating the `nlp` object and processing pipeline. | +| `lang` | str | Two-letter language ID, i.e. [ISO code](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes). | + +## Defaults {#defaults} + +The following attributes can be set on the `Language.Defaults` class to +customize the default language data: + +> #### Example +> +> ```python +> from spacy.language import language +> from spacy.lang.tokenizer_exceptions import URL_MATCH +> from thinc.api import Config +> +> DEFAULT_CONFIFG = """ +> [nlp.tokenizer] +> @tokenizers = "MyCustomTokenizer.v1" +> """ +> +> class Defaults(Language.Defaults): +> stop_words = set() +> tokenizer_exceptions = {} +> prefixes = tuple() +> suffixes = tuple() +> infixes = tuple() +> token_match = None +> url_match = URL_MATCH +> lex_attr_getters = {} +> syntax_iterators = {} +> writing_system = {"direction": "ltr", "has_case": True, "has_letters": True} +> config = Config().from_str(DEFAULT_CONFIG) +> ``` + +| Name | Description | +| --------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | +| `stop_words` | List of stop words, used for `Token.is_stop`.
**Example:** [`stop_words.py`][stop_words.py] | +| `tokenizer_exceptions` | Tokenizer exception rules, string mapped to list of token attributes.
**Example:** [`de/tokenizer_exceptions.py`][de/tokenizer_exceptions.py] | +| `prefixes`, `suffixes`, `infixes` | Prefix, suffix and infix rules for the default tokenizer.
**Example:** [`puncutation.py`][punctuation.py] | +| `token_match` | Optional regex for matching strings that should never be split, overriding the infix rules.
**Example:** [`fr/tokenizer_exceptions.py`][fr/tokenizer_exceptions.py] | +| `url_match` | Regular expression for matching URLs. Prefixes and suffixes are removed before applying the match.
**Example:** [`tokenizer_exceptions.py`][tokenizer_exceptions.py] | +| `lex_attr_getters` | Custom functions for setting lexical attributes on tokens, e.g. `like_num`.
**Example:** [`lex_attrs.py`][lex_attrs.py] | +| `syntax_iterators` | Functions that compute views of a `Doc` object based on its syntax. At the moment, only used for [noun chunks](/usage/linguistic-features#noun-chunks).
**Example:** [`syntax_iterators.py`][syntax_iterators.py]. | +| `writing_system` | Information about the language's writing system, available via `Vocab.writing_system`. Defaults to: `{"direction": "ltr", "has_case": True, "has_letters": True}.`.
**Example:** [`zh/__init__.py`][zh/__init__.py] | +| `config` | Default [config](/usage/training#config) added to `nlp.config`. This can include references to custom tokenizers or lemmatizers.
**Example:** [`zh/__init__.py`][zh/__init__.py] | + +[stop_words.py]: + https://github.com/explosion/spaCy/tree/master/spacy/lang/en/stop_words.py +[tokenizer_exceptions.py]: + https://github.com/explosion/spaCy/tree/master/spacy/lang/tokenizer_exceptions.py +[de/tokenizer_exceptions.py]: + https://github.com/explosion/spaCy/tree/master/spacy/lang/de/tokenizer_exceptions.py +[fr/tokenizer_exceptions.py]: + https://github.com/explosion/spaCy/tree/master/spacy/lang/fr/tokenizer_exceptions.py +[punctuation.py]: + https://github.com/explosion/spaCy/tree/master/spacy/lang/punctuation.py +[lex_attrs.py]: + https://github.com/explosion/spaCy/tree/master/spacy/lang/en/lex_attrs.py +[syntax_iterators.py]: + https://github.com/explosion/spaCy/tree/master/spacy/lang/en/syntax_iterators.py +[zh/__init__.py]: + https://github.com/explosion/spaCy/tree/master/spacy/lang/zh/__init__.py ## Serialization fields {#serialization-fields} diff --git a/website/docs/usage/101/_language-data.md b/website/docs/usage/101/_language-data.md index 31bfe53ab..2917b19c4 100644 --- a/website/docs/usage/101/_language-data.md +++ b/website/docs/usage/101/_language-data.md @@ -8,12 +8,10 @@ makes the data easy to update and extend. The **shared language data** in the directory root includes rules that can be generalized across languages – for example, rules for basic punctuation, emoji, -emoticons, single-letter abbreviations and norms for equivalent tokens with -different spellings, like `"` and `”`. This helps the models make more accurate -predictions. The **individual language data** in a submodule contains rules that -are only relevant to a particular language. It also takes care of putting -together all components and creating the `Language` subclass – for example, -`English` or `German`. +emoticons and single-letter abbreviations. The **individual language data** in a +submodule contains rules that are only relevant to a particular language. It +also takes care of putting together all components and creating the `Language` +subclass – for example, `English` or `German`. > ```python > from spacy.lang.en import English @@ -23,27 +21,28 @@ together all components and creating the `Language` subclass – for example, > nlp_de = German() # Includes German data > ``` + + + + | Name | Description | | ---------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------- | | **Stop words**
[`stop_words.py`][stop_words.py] | List of most common words of a language that are often useful to filter out, for example "and" or "I". Matching tokens will return `True` for `is_stop`. | | **Tokenizer exceptions**
[`tokenizer_exceptions.py`][tokenizer_exceptions.py] | Special-case rules for the tokenizer, for example, contractions like "can't" and abbreviations with punctuation, like "U.K.". | -| **Norm exceptions**
[`norm_exceptions.py`][norm_exceptions.py] | Special-case rules for normalizing tokens to improve the model's predictions, for example on American vs. British spelling. | | **Punctuation rules**
[`punctuation.py`][punctuation.py] | Regular expressions for splitting tokens, e.g. on punctuation or special characters like emoji. Includes rules for prefixes, suffixes and infixes. | | **Character classes**
[`char_classes.py`][char_classes.py] | Character classes to be used in regular expressions, for example, latin characters, quotes, hyphens or icons. | | **Lexical attributes**
[`lex_attrs.py`][lex_attrs.py] | Custom functions for setting lexical attributes on tokens, e.g. `like_num`, which includes language-specific words like "ten" or "hundred". | | **Syntax iterators**
[`syntax_iterators.py`][syntax_iterators.py] | Functions that compute views of a `Doc` object based on its syntax. At the moment, only used for [noun chunks](/usage/linguistic-features#noun-chunks). | -| **Tag map**
[`tag_map.py`][tag_map.py] | Dictionary mapping strings in your tag set to [Universal Dependencies](http://universaldependencies.org/u/pos/all.html) tags. | -| **Morph rules**
[`morph_rules.py`][morph_rules.py] | Exception rules for morphological analysis of irregular words like personal pronouns. | | **Lemmatizer**
[`spacy-lookups-data`][spacy-lookups-data] | Lemmatization rules or a lookup-based lemmatization table to assign base forms, for example "be" for "was". | [stop_words.py]: https://github.com/explosion/spaCy/tree/master/spacy/lang/en/stop_words.py [tokenizer_exceptions.py]: https://github.com/explosion/spaCy/tree/master/spacy/lang/de/tokenizer_exceptions.py -[norm_exceptions.py]: - https://github.com/explosion/spaCy/tree/master/spacy/lang/norm_exceptions.py [punctuation.py]: https://github.com/explosion/spaCy/tree/master/spacy/lang/punctuation.py [char_classes.py]: @@ -52,8 +51,4 @@ together all components and creating the `Language` subclass – for example, https://github.com/explosion/spaCy/tree/master/spacy/lang/en/lex_attrs.py [syntax_iterators.py]: https://github.com/explosion/spaCy/tree/master/spacy/lang/en/syntax_iterators.py -[tag_map.py]: - https://github.com/explosion/spaCy/tree/master/spacy/lang/en/tag_map.py -[morph_rules.py]: - https://github.com/explosion/spaCy/tree/master/spacy/lang/en/morph_rules.py [spacy-lookups-data]: https://github.com/explosion/spacy-lookups-data diff --git a/website/docs/usage/linguistic-features.md b/website/docs/usage/linguistic-features.md index 27512f61b..18c33c7bb 100644 --- a/website/docs/usage/linguistic-features.md +++ b/website/docs/usage/linguistic-features.md @@ -602,7 +602,95 @@ import Tokenization101 from 'usage/101/\_tokenization.md' -### Tokenizer data {#101-data} + + +spaCy introduces a novel tokenization algorithm, that gives a better balance +between performance, ease of definition, and ease of alignment into the original +string. + +After consuming a prefix or suffix, we consult the special cases again. We want +the special cases to handle things like "don't" in English, and we want the same +rule to work for "(don't)!". We do this by splitting off the open bracket, then +the exclamation, then the close bracket, and finally matching the special case. +Here's an implementation of the algorithm in Python, optimized for readability +rather than performance: + +```python +def tokenizer_pseudo_code( + special_cases, + prefix_search, + suffix_search, + infix_finditer, + token_match, + url_match +): + tokens = [] + for substring in text.split(): + suffixes = [] + while substring: + while prefix_search(substring) or suffix_search(substring): + if token_match(substring): + tokens.append(substring) + substring = "" + break + if substring in special_cases: + tokens.extend(special_cases[substring]) + substring = "" + break + if prefix_search(substring): + split = prefix_search(substring).end() + tokens.append(substring[:split]) + substring = substring[split:] + if substring in special_cases: + continue + if suffix_search(substring): + split = suffix_search(substring).start() + suffixes.append(substring[split:]) + substring = substring[:split] + if token_match(substring): + tokens.append(substring) + substring = "" + elif url_match(substring): + tokens.append(substring) + substring = "" + elif substring in special_cases: + tokens.extend(special_cases[substring]) + substring = "" + elif list(infix_finditer(substring)): + infixes = infix_finditer(substring) + offset = 0 + for match in infixes: + tokens.append(substring[offset : match.start()]) + tokens.append(substring[match.start() : match.end()]) + offset = match.end() + if substring[offset:]: + tokens.append(substring[offset:]) + substring = "" + elif substring: + tokens.append(substring) + substring = "" + tokens.extend(reversed(suffixes)) + return tokens +``` + +The algorithm can be summarized as follows: + +1. Iterate over whitespace-separated substrings. +2. Look for a token match. If there is a match, stop processing and keep this + token. +3. Check whether we have an explicitly defined special case for this substring. + If we do, use it. +4. Otherwise, try to consume one prefix. If we consumed a prefix, go back to #2, + so that the token match and special cases always get priority. +5. If we didn't consume a prefix, try to consume a suffix and then go back to + #2. +6. If we can't consume a prefix or a suffix, look for a URL match. +7. If there's no URL match, then look for a special case. +8. Look for "infixes" — stuff like hyphens etc. and split the substring into + tokens on all infixes. +9. Once we can't consume any more of the string, handle it as a single token. + + **Global** and **language-specific** tokenizer data is supplied via the language data in @@ -613,15 +701,6 @@ The prefixes, suffixes and infixes mostly define punctuation rules – for example, when to split off periods (at the end of a sentence), and when to leave tokens containing periods intact (abbreviations like "U.S."). -![Language data architecture](../images/language_data.svg) - - - -For more details on the language-specific data, see the usage guide on -[adding languages](/usage/adding-languages). - - - Tokenization rules that are specific to one language, but can be **generalized @@ -637,6 +716,14 @@ subclass. --- + + ### Adding special case tokenization rules {#special-cases} Most domains have at least some idiosyncrasies that require custom tokenization @@ -677,88 +764,6 @@ nlp.tokenizer.add_special_case("...gimme...?", [{"ORTH": "...gimme...?"}]) assert len(nlp("...gimme...?")) == 1 ``` -### How spaCy's tokenizer works {#how-tokenizer-works} - -spaCy introduces a novel tokenization algorithm, that gives a better balance -between performance, ease of definition, and ease of alignment into the original -string. - -After consuming a prefix or suffix, we consult the special cases again. We want -the special cases to handle things like "don't" in English, and we want the same -rule to work for "(don't)!". We do this by splitting off the open bracket, then -the exclamation, then the close bracket, and finally matching the special case. -Here's an implementation of the algorithm in Python, optimized for readability -rather than performance: - -```python -def tokenizer_pseudo_code(self, special_cases, prefix_search, suffix_search, - infix_finditer, token_match, url_match): - tokens = [] - for substring in text.split(): - suffixes = [] - while substring: - while prefix_search(substring) or suffix_search(substring): - if token_match(substring): - tokens.append(substring) - substring = '' - break - if substring in special_cases: - tokens.extend(special_cases[substring]) - substring = '' - break - if prefix_search(substring): - split = prefix_search(substring).end() - tokens.append(substring[:split]) - substring = substring[split:] - if substring in special_cases: - continue - if suffix_search(substring): - split = suffix_search(substring).start() - suffixes.append(substring[split:]) - substring = substring[:split] - if token_match(substring): - tokens.append(substring) - substring = '' - elif url_match(substring): - tokens.append(substring) - substring = '' - elif substring in special_cases: - tokens.extend(special_cases[substring]) - substring = '' - elif list(infix_finditer(substring)): - infixes = infix_finditer(substring) - offset = 0 - for match in infixes: - tokens.append(substring[offset : match.start()]) - tokens.append(substring[match.start() : match.end()]) - offset = match.end() - if substring[offset:]: - tokens.append(substring[offset:]) - substring = '' - elif substring: - tokens.append(substring) - substring = '' - tokens.extend(reversed(suffixes)) - return tokens -``` - -The algorithm can be summarized as follows: - -1. Iterate over whitespace-separated substrings. -2. Look for a token match. If there is a match, stop processing and keep this - token. -3. Check whether we have an explicitly defined special case for this substring. - If we do, use it. -4. Otherwise, try to consume one prefix. If we consumed a prefix, go back to #2, - so that the token match and special cases always get priority. -5. If we didn't consume a prefix, try to consume a suffix and then go back to - #2. -6. If we can't consume a prefix or a suffix, look for a URL match. -7. If there's no URL match, then look for a special case. -8. Look for "infixes" — stuff like hyphens etc. and split the substring into - tokens on all infixes. -9. Once we can't consume any more of the string, handle it as a single token. - #### Debugging the tokenizer {#tokenizer-debug new="2.2.3"} A working implementation of the pseudo-code above is available for debugging as @@ -766,6 +771,17 @@ A working implementation of the pseudo-code above is available for debugging as tuples showing which tokenizer rule or pattern was matched for each token. The tokens produced are identical to `nlp.tokenizer()` except for whitespace tokens: +> #### Expected output +> +> ``` +> " PREFIX +> Let SPECIAL-1 +> 's SPECIAL-2 +> go TOKEN +> ! SUFFIX +> " SUFFIX +> ``` + ```python ### {executable="true"} from spacy.lang.en import English @@ -777,13 +793,6 @@ tok_exp = nlp.tokenizer.explain(text) assert [t.text for t in doc if not t.is_space] == [t[1] for t in tok_exp] for t in tok_exp: print(t[1], "\\t", t[0]) - -# " PREFIX -# Let SPECIAL-1 -# 's SPECIAL-2 -# go TOKEN -# ! SUFFIX -# " SUFFIX ``` ### Customizing spaCy's Tokenizer class {#native-tokenizers} @@ -1437,3 +1446,73 @@ print("After:", [sent.text for sent in doc.sents]) import LanguageData101 from 'usage/101/\_language-data.md' + +### Creating a custom language subclass {#language-subclass} + +If you want to customize multiple components of the language data or add support +for a custom language or domain-specific "dialect", you can also implement your +own language subclass. The subclass should define two attributes: the `lang` +(unique language code) and the `Defaults` defining the language data. For an +overview of the available attributes that can be overwritten, see the +[`Language.Defaults`](/api/language#defaults) documentation. + +```python +### {executable="true"} +from spacy.lang.en import English + +class CustomEnglishDefaults(English.Defaults): + stop_words = set(["custom", "stop"]) + +class CustomEnglish(English): + lang = "custom_en" + Defaults = CustomEnglishDefaults + +nlp1 = English() +nlp2 = CustomEnglish() + +print(nlp1.lang, [token.is_stop for token in nlp1("custom stop")]) +print(nlp2.lang, [token.is_stop for token in nlp2("custom stop")]) +``` + +The [`@spacy.registry.languages`](/api/top-level#registry) decorator lets you +register a custom language class and assign it a string name. This means that +you can call [`spacy.blank`](/api/top-level#spacy.blank) with your custom +language name, and even train models with it and refer to it in your +[training config](/usage/training#config). + +> #### Config usage +> +> After registering your custom language class using the `languages` registry, +> you can refer to it in your [training config](/usage/training#config). This +> means spaCy will train your model using the custom subclass. +> +> ```ini +> [nlp] +> lang = "custom_en" +> ``` +> +> In order to resolve `"custom_en"` to your subclass, the registered function +> needs to be available during training. You can load a Python file containing +> the code using the `--code` argument: +> +> ```bash +> ### {wrap="true"} +> $ python -m spacy train train.spacy dev.spacy config.cfg --code code.py +> ``` + +```python +### Registering a custom language {highlight="7,12-13"} +import spacy +from spacy.lang.en import English + +class CustomEnglishDefaults(English.Defaults): + stop_words = set(["custom", "stop"]) + +@spacy.registry.languages("custom_en") +class CustomEnglish(English): + lang = "custom_en" + Defaults = CustomEnglishDefaults + +# This now works! 🎉 +nlp = spacy.blank("custom_en") +``` diff --git a/website/docs/usage/projects.md b/website/docs/usage/projects.md index c56044be0..ff8f91683 100644 --- a/website/docs/usage/projects.md +++ b/website/docs/usage/projects.md @@ -618,7 +618,9 @@ mattis pretium. [FastAPI](https://fastapi.tiangolo.com/) is a modern high-performance framework for building REST APIs with Python, based on Python [type hints](https://fastapi.tiangolo.com/python-types/). It's become a popular -library for serving machine learning models and +library for serving machine learning models and you can use it in your spaCy +projects to quickly serve up a trained model and make it available behind a REST +API. ```python # TODO: show an example that addresses some of the main concerns for serving ML (workers etc.) diff --git a/website/docs/usage/training.md b/website/docs/usage/training.md index d8290a7a1..b45788e34 100644 --- a/website/docs/usage/training.md +++ b/website/docs/usage/training.md @@ -74,7 +74,7 @@ When you train a model using the [`spacy train`](/api/cli#train) command, you'll see a table showing metrics after each pass over the data. Here's what those metrics means: - + | Name | Description | | ---------- | ------------------------------------------------------------------------------------------------- | @@ -116,7 +116,7 @@ integrate custom models and architectures, written in your framework of choice. Some of the main advantages and features of spaCy's training config are: - **Structured sections.** The config is grouped into sections, and nested - sections are defined using the `.` notation. For example, `[nlp.pipeline.ner]` + sections are defined using the `.` notation. For example, `[components.ner]` defines the settings for the pipeline's named entity recognizer. The config can be loaded as a Python dict. - **References to registered functions.** Sections can refer to registered @@ -136,10 +136,8 @@ Some of the main advantages and features of spaCy's training config are: Python [type hints](https://docs.python.org/3/library/typing.html) to tell the config which types of data to expect. - - ```ini -https://github.com/explosion/spaCy/blob/develop/examples/experiments/onto-joint/defaults.cfg +https://github.com/explosion/spaCy/blob/develop/spacy/default_config.cfg ``` Under the hood, the config is parsed into a dictionary. It's divided into @@ -151,11 +149,12 @@ not just define static settings, but also construct objects like architectures, schedules, optimizers or any other custom components. The main top-level sections of a config file are: -| Section | Description | -| ------------- | ----------------------------------------------------------------------------------------------------- | -| `training` | Settings and controls for the training and evaluation process. | -| `pretraining` | Optional settings and controls for the [language model pretraining](#pretraining). | -| `nlp` | Definition of the [processing pipeline](/docs/processing-pipelines), its components and their models. | +| Section | Description | +| ------------- | -------------------------------------------------------------------------------------------------------------------- | +| `training` | Settings and controls for the training and evaluation process. | +| `pretraining` | Optional settings and controls for the [language model pretraining](#pretraining). | +| `nlp` | Definition of the `nlp` object, its tokenizer and [processing pipeline](/docs/processing-pipelines) component names. | +| `components` | Definitions of the [pipeline components](/docs/processing-pipelines) and their models. | @@ -176,16 +175,16 @@ a consistent format. There are no command-line arguments that need to be set, and no hidden defaults. However, there can still be scenarios where you may want to override config settings when you run [`spacy train`](/api/cli#train). This includes **file paths** to vectors or other resources that shouldn't be -hard-code in a config file, or **system-dependent settings** like the GPU ID. +hard-code in a config file, or **system-dependent settings**. For cases like this, you can set additional command-line options starting with `--` that correspond to the config section and value to override. For example, -`--training.use_gpu 1` sets the `use_gpu` value in the `[training]` block to -`1`. +`--training.batch_size 128` sets the `batch_size` value in the `[training]` +block to `128`. ```bash $ python -m spacy train train.spacy dev.spacy config.cfg ---training.use_gpu 1 --nlp.vectors /path/to/vectors +--training.batch_size 128 --nlp.vectors /path/to/vectors ``` Only existing sections and values in the config can be overwritten. At the end diff --git a/website/docs/usage/v3.md b/website/docs/usage/v3.md index 049462553..13f6e67af 100644 --- a/website/docs/usage/v3.md +++ b/website/docs/usage/v3.md @@ -14,4 +14,20 @@ menu: ## Backwards Incompatibilities {#incompat} +### Removed deprecated methods, attributes and arguments {#incompat-removed} + +The following deprecated methods, attributes and arguments were removed in v3.0. +Most of them have been deprecated for quite a while now and many would +previously raise errors. Many of them were also mostly internals. If you've been +working with more recent versions of spaCy v2.x, it's unlikely that your code +relied on them. + +| Class | Removed | +| --------------------- | ------------------------------------------------------- | +| [`Doc`](/api/doc) | `Doc.tokens_from_list`, `Doc.merge` | +| [`Span`](/api/span) | `Span.merge`, `Span.upper`, `Span.lower`, `Span.string` | +| [`Token`](/api/token) | `Token.string` | + + + ## Migrating from v2.x {#migrating}