From 52bd3a8b48d354de57c1eb59d78abdb32c3e3a30 Mon Sep 17 00:00:00 2001 From: Ines Montani Date: Fri, 21 Aug 2020 13:22:59 +0200 Subject: [PATCH] Update docs [ci skip] --- website/docs/api/morphology.md | 2 +- website/docs/api/token.md | 4 +- website/docs/api/top-level.md | 17 ++ website/docs/usage/embeddings-transformers.md | 4 + website/docs/usage/v3.md | 182 ++++++++++++++++-- 5 files changed, 190 insertions(+), 19 deletions(-) diff --git a/website/docs/api/morphology.md b/website/docs/api/morphology.md index 1b2e159d0..5d5324061 100644 --- a/website/docs/api/morphology.md +++ b/website/docs/api/morphology.md @@ -7,7 +7,7 @@ source: spacy/morphology.pyx Store the possible morphological analyses for a language, and index them by hash. To save space on each token, tokens only know the hash of their morphological analysis, so queries of morphological attributes are delegated to -this class. See [`MorphAnalysis`](/api/morphology#morphansalysis) for the +this class. See [`MorphAnalysis`](/api/morphology#morphanalysis) for the container storing a single morphological analysis. ## Morphology.\_\_init\_\_ {#init tag="method"} diff --git a/website/docs/api/token.md b/website/docs/api/token.md index 4a8e6eba7..0860797aa 100644 --- a/website/docs/api/token.md +++ b/website/docs/api/token.md @@ -450,8 +450,8 @@ The L2 norm of the token's vector representation. | `pos_` | Coarse-grained part-of-speech from the [Universal POS tag set](https://universaldependencies.org/docs/u/pos/). ~~str~~ | | `tag` | Fine-grained part-of-speech. ~~int~~ | | `tag_` | Fine-grained part-of-speech. ~~str~~ | -| `morph` | Morphological analysis. ~~MorphAnalysis~~ | -| `morph_` | Morphological analysis in the Universal Dependencies [FEATS]https://universaldependencies.org/format.html#morphological-annotation format. ~~str~~ | +| `morph` 3 | Morphological analysis. ~~MorphAnalysis~~ | +| `morph_` 3 | Morphological analysis in the Universal Dependencies [FEATS]https://universaldependencies.org/format.html#morphological-annotation format. ~~str~~ | | `dep` | Syntactic dependency relation. ~~int~~ | | `dep_` | Syntactic dependency relation. ~~str~~ | | `lang` | Language of the parent document's vocabulary. ~~int~~ | diff --git a/website/docs/api/top-level.md b/website/docs/api/top-level.md index 89c53cce3..9c65b2982 100644 --- a/website/docs/api/top-level.md +++ b/website/docs/api/top-level.md @@ -632,6 +632,23 @@ validate its contents. | `path` | Path to the model's `meta.json`. ~~Union[str, Path]~~ | | **RETURNS** | The model's meta data. ~~Dict[str, Any]~~ | +### util.get_installed_models {#util.get_installed_models tag="function" new="3"} + +List all model packages installed in the current environment. This will include +any spaCy model that was packaged with [`spacy package`](/api/cli#package). +Under the hood, model packages expose a Python entry point that spaCy can check, +without having to load the model. + +> #### Example +> +> ```python +> model_names = util.get_installed_models() +> ``` + +| Name | Description | +| ----------- | ---------------------------------------------------------------------------------- | +| **RETURNS** | The string names of the models installed in the current environment. ~~List[str]~~ | + ### util.is_package {#util.is_package tag="function"} Check if string maps to a package installed via pip. Mainly used to validate diff --git a/website/docs/usage/embeddings-transformers.md b/website/docs/usage/embeddings-transformers.md index c2727f5b1..70562cf7e 100644 --- a/website/docs/usage/embeddings-transformers.md +++ b/website/docs/usage/embeddings-transformers.md @@ -11,6 +11,10 @@ next: /usage/training +If you're looking for details on using word vectors and semantic similarity, +check out the +[linguistic features docs](/usage/linguistic-features#vectors-similarity). + The key difference between [word vectors](#word-vectors) and contextual language diff --git a/website/docs/usage/v3.md b/website/docs/usage/v3.md index 837818a83..3111bf38e 100644 --- a/website/docs/usage/v3.md +++ b/website/docs/usage/v3.md @@ -10,6 +10,32 @@ menu: ## Summary {#summary} + + +
+ +
+ + + +- [Summary](#summary) +- [New features](#features) +- [Training & config system](#features-training) +- [Transformer-based pipelines](#features-transformers) +- [Custom models](#features-custom-models) +- [End-to-end project workflows](#features-projects) +- [New built-in components](#features-pipeline-components) +- [New custom component API](#features-components) +- [Python type hints](#features-types) +- [New methods & attributes](#new-methods) +- [New & updated documentation](#new-docs) +- [Backwards incompatibilities](#incompat) +- [Migrating from spaCy v2.x](#migrating) + + + +
+ ## New Features {#features} ### New training workflow and config system {#features-training} @@ -28,6 +54,8 @@ menu: ### Transformer-based pipelines {#features-transformers} +![Pipeline components listening to shared embedding component](../images/tok2vec-listener.svg) + - **Usage:** [Embeddings & Transformers](/usage/embeddings-transformers), @@ -46,8 +74,53 @@ menu: ### Custom models using any framework {#features-custom-models} + + + + +- **Thinc: ** + [Wrapping PyTorch, TensorFlow & MXNet](https://thinc.ai/docs/usage-frameworks) +- **API:** [Model architectures](/api/architectures), [`Pipe`](/api/pipe) + + + ### Manage end-to-end workflows with projects {#features-projects} + + +> #### Example +> +> ```cli +> # Clone a project template +> $ python -m spacy project clone example +> $ cd example +> # Download data assets +> $ python -m spacy project assets +> # Run a workflow +> $ python -m spacy project run train +> ``` + +spaCy projects let you manage and share **end-to-end spaCy workflows** for +different **use cases and domains**, and orchestrate training, packaging and +serving your custom models. You can start off by cloning a pre-defined project +template, adjust it to fit your needs, load in your data, train a model, export +it as a Python package and share the project templates with your team. spaCy +projects also make it easy to **integrate with other tools** in the data science +and machine learning ecosystem, including [DVC](/usage/projects#dvc) for data +version control, [Prodigy](/usage/projects#prodigy) for creating labelled data, +[Streamlit](/usage/projects#streamlit) for building interactive apps, +[FastAPI](/usage/projects#fastapi) for serving models in production, +[Ray](/usage/projects#ray) for parallel training, +[Weights & Biases](/usage/projects#wandb) for experiment tracking, and more! + + + - **Usage:** [spaCy projects](/usage/projects), @@ -59,6 +132,16 @@ menu: ### New built-in pipeline components {#features-pipeline-components} +spaCy v3.0 includes several new trainable and rule-based components that you can +add to your pipeline and customize for your use case: + +> #### Example +> +> ```python +> nlp = spacy.blank("en") +> nlp.add_pipe("lemmatizer") +> ``` + | Name | Description | | ----------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | [`SentenceRecognizer`](/api/sentencerecognizer) | Trainable component for sentence segmentation. | @@ -78,15 +161,37 @@ menu: ### New and improved pipeline component APIs {#features-components} -- `Language.factory`, `Language.component` -- `Language.analyze_pipes` -- Adding components from other models +> #### Example +> +> ```python +> @Language.component("my_component") +> def my_component(doc): +> return doc +> +> nlp.add_pipe("my_component") +> nlp.add_pipe("ner", source=other_nlp) +> nlp.analyze_pipes(pretty=True) +> ``` + +Defining, configuring, reusing, training and analyzing pipeline components is +now easier and more convenient. The `@Language.component` and +`@Language.factory` decorators let you register your component, define its +default configuration and meta data, like the attribute values it assigns and +requires. Any custom component can be included during training, and sourcing +components from existing pretrained models lets you **mix and match custom +pipelines**. The `nlp.analyze_pipes` method outputs structured information about +the current pipeline and its components, including the attributes they assign, +the scores they compute during training and whether any required attributes +aren't set. - **Usage:** [Custom components](/usage/processing-pipelines#custom_components), - [Defining components during training](/usage/training#config-components) -- **API:** [`Language`](/api/language) + [Defining components for training](/usage/training#config-components) +- **API:** [`@Language.component`](/api/language#component), + [`@Language.factory`](/api/language#factory), + [`Language.add_pipe`](/api/language#add_pipe), + [`Language.analyze_pipes`](/api/language#analyze_pipes) - **Implementation:** [`spacy/language.py`](https://github.com/explosion/spaCy/tree/develop/spacy/language.py) @@ -136,13 +241,14 @@ in your config and see validation errors if the argument values don't match. -### New methods, attributes and commands +### New methods, attributes and commands {#new-methods} The following methods, attributes and commands are new in spaCy v3.0. | Name | Description | | ----------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | | [`Token.lex`](/api/token#attributes) | Access a token's [`Lexeme`](/api/lexeme). | +| [`Token.morph`](/api/token#attributes) [`Token.morph_`](/api/token#attributes) | Access a token's morphological analysis. | | [`Language.select_pipes`](/api/language#select_pipes) | Contextmanager for enabling or disabling specific pipeline components for a block. | | [`Language.analyze_pipes`](/api/language#analyze_pipes) | [Analyze](/usage/processing-pipelines#analysis) components and their interdependencies. | | [`Language.resume_training`](/api/language#resume_training) | Experimental: continue training a pretrained model and initialize "rehearsal" for components that implement a `rehearse` method to prevent catastrophic forgetting. | @@ -153,9 +259,52 @@ The following methods, attributes and commands are new in spaCy v3.0. | [`Pipe.score`](/api/pipe#score) | Method on trainable pipeline components that returns a dictionary of evaluation scores. | | [`registry`](/api/top-level#registry) | Function registry to map functions to string names that can be referenced in [configs](/usage/training#config). | | [`util.load_meta`](/api/top-level#util.load_meta) [`util.load_config`](/api/top-level#util.load_config) | Updated helpers for loading a model's [`meta.json`](/api/data-formats#meta) and [`config.cfg`](/api/data-formats#config). | +| [`util.get_installed_models`](/api/top-level#util.get_installed_models) | Names of all models installed in the environment. | | [`init config`](/api/cli#init-config) [`init fill-config`](/api/cli#init-fill-config) [`debug config`](/api/cli#debug-config) | CLI commands for initializing, auto-filling and debugging [training configs](/usage/training). | | [`project`](/api/cli#project) | Suite of CLI commands for cloning, running and managing [spaCy projects](/usage/projects). | +### New and updated documentation {#new-docs} + + + +
+ +To help you get started with spaCy v3.0 and the new features, we've added +several new or rewritten documentation pages, including a new usage guide on +[embeddings, transformers and transfer learning](/usage/embeddings-transformers), +a guide on [training models](/usage/training) rewritten from scratch, a page +explaining the new [spaCy projects](/usage/projects) and updated usage +documentation on +[custom pipeline components](/usage/processing-pipelines#custom-components). +We've also added a bunch of new illustrations and new API reference pages +documenting spaCy's machine learning [model architectures](/api/architectures) +and the expected [data formats](/api/data-formats). API pages about +[pipeline components](/api/#architecture-pipeline) now include more information, +like the default config and implementation, and we've adopted a more detailed +format for documenting argument and return types. + +
+ +[![Library architecture](../images/architecture.svg)](/api) + +
+ + + +- **Usage: ** [Embeddings & Transformers](/usage/embeddings-transformers), + [Training models](/usage/training), [Projects](/usage/projects), + [Custom pipeline components](/usage/processing-pipelines#custom-components) +- **API Reference: ** [Library architecture](/api), + [Model architectures](/api/architectures), [Data formats](/api/data-formats) +- **New Classes: ** [`Example`](/api/example), [`Tok2Vec`](/api/tok2vec), + [`Transformer`](/api/transformer), [`Lemmatizer`](/api/lemmatizer), + [`Morphologizer`](/api/morphologizer), + [`AttributeRuler`](/api/attributeruler), + [`SentenceRecognizer`](/api/sentencerecognizer), [`Pipe`](/api/pipe), + [`Corpus`](/api/corpus) + + + ## Backwards Incompatibilities {#incompat} As always, we've tried to keep the breaking changes to a minimum and focus on @@ -212,15 +361,16 @@ Note that spaCy v3.0 now requires **Python 3.6+**. ### Removed or renamed API {#incompat-removed} -| Removed | Replacement | -| ------------------------------------------------------ | ----------------------------------------------------------------------------------------- | -| `Language.disable_pipes` | [`Language.select_pipes`](/api/language#select_pipes) | -| `GoldParse` | [`Example`](/api/example) | -| `GoldCorpus` | [`Corpus`](/api/corpus) | -| `KnowledgeBase.load_bulk` `KnowledgeBase.dump` | [`KnowledgeBase.from_disk`](/api/kb#from_disk) [`KnowledgeBase.to_disk`](/api/kb#to_disk) | -| `spacy debug-data` | [`spacy debug data`](/api/cli#debug-data) | -| `spacy profile` | [`spacy debug profile`](/api/cli#debug-profile) | -| `spacy link` `util.set_data_path` `util.get_data_path` | not needed, model symlinks are deprecated | +| Removed | Replacement | +| -------------------------------------------------------- | ----------------------------------------------------------------------------------------- | +| `Language.disable_pipes` | [`Language.select_pipes`](/api/language#select_pipes) | +| `GoldParse` | [`Example`](/api/example) | +| `GoldCorpus` | [`Corpus`](/api/corpus) | +| `KnowledgeBase.load_bulk` `KnowledgeBase.dump` | [`KnowledgeBase.from_disk`](/api/kb#from_disk) [`KnowledgeBase.to_disk`](/api/kb#to_disk) | +| `spacy init-model` | [`spacy init model`](/api/cli#init-model) | +| `spacy debug-data` | [`spacy debug data`](/api/cli#debug-data) | +| `spacy profile` | [`spacy debug profile`](/api/cli#debug-profile) | +| `spacy link`, `util.set_data_path`, `util.get_data_path` | not needed, model symlinks are deprecated | The following deprecated methods, attributes and arguments were removed in v3.0. Most of them have been **deprecated for a while** and many would previously @@ -236,7 +386,7 @@ on them. | `Language.tagger`, `Language.parser`, `Language.entity` | [`Language.get_pipe`](/api/language#get_pipe) | | keyword-arguments like `vocab=False` on `to_disk`, `from_disk`, `to_bytes`, `from_bytes` | `exclude=["vocab"]` | | `n_threads` argument on [`Tokenizer`](/api/tokenizer), [`Matcher`](/api/matcher), [`PhraseMatcher`](/api/phrasematcher) | `n_process` | -| `verbose` argument on [`Language.evaluate`] | logging | +| `verbose` argument on [`Language.evaluate`](/api/language#evaluate) | logging (`DEBUG`) | | `SentenceSegmenter` hook, `SimilarityHook` | [user hooks](/usage/processing-pipelines#custom-components-user-hooks), [`Sentencizer`](/api/sentencizer), [`SentenceRecognizer`](/api/sentenceregognizer) | ## Migrating from v2.x {#migrating}