💫 Update website (#3285)
<!--- Provide a general summary of your changes in the title. -->
## Description
The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.
This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.
### Types of change
enhancement
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-02-17 18:31:19 +00:00
|
|
|
---
|
2020-09-03 11:13:03 +00:00
|
|
|
title: Trained Models & Pipelines
|
|
|
|
teaser: Downloadable trained pipelines and weights for spaCy
|
💫 Update website (#3285)
<!--- Provide a general summary of your changes in the title. -->
## Description
The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.
This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.
### Types of change
enhancement
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-02-17 18:31:19 +00:00
|
|
|
menu:
|
|
|
|
- ['Quickstart', 'quickstart']
|
|
|
|
- ['Conventions', 'conventions']
|
2021-03-17 10:29:57 +00:00
|
|
|
- ['Pipeline Design', 'design']
|
💫 Update website (#3285)
<!--- Provide a general summary of your changes in the title. -->
## Description
The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.
This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.
### Types of change
enhancement
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-02-17 18:31:19 +00:00
|
|
|
---
|
|
|
|
|
2020-10-08 14:23:12 +00:00
|
|
|
<!-- TODO: include interactive demo -->
|
💫 Update website (#3285)
<!--- Provide a general summary of your changes in the title. -->
## Description
The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.
This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.
### Types of change
enhancement
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-02-17 18:31:19 +00:00
|
|
|
|
|
|
|
### Quickstart {hidden="true"}
|
|
|
|
|
2020-10-08 14:23:12 +00:00
|
|
|
> #### 📖 Installation and usage
|
|
|
|
>
|
|
|
|
> For more details on how to use trained pipelines with spaCy, see the
|
|
|
|
> [usage guide](/usage/models).
|
💫 Update website (#3285)
<!--- Provide a general summary of your changes in the title. -->
## Description
The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.
This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.
### Types of change
enhancement
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-02-17 18:31:19 +00:00
|
|
|
|
2020-10-08 14:23:12 +00:00
|
|
|
import QuickstartModels from 'widgets/quickstart-models.js'
|
💫 Update website (#3285)
<!--- Provide a general summary of your changes in the title. -->
## Description
The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.
This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.
### Types of change
enhancement
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-02-17 18:31:19 +00:00
|
|
|
|
2020-10-08 14:23:12 +00:00
|
|
|
<QuickstartModels id="quickstart" />
|
💫 Update website (#3285)
<!--- Provide a general summary of your changes in the title. -->
## Description
The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.
This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.
### Types of change
enhancement
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-02-17 18:31:19 +00:00
|
|
|
|
2020-09-03 11:13:03 +00:00
|
|
|
## Package naming conventions {#conventions}
|
💫 Update website (#3285)
<!--- Provide a general summary of your changes in the title. -->
## Description
The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.
This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.
### Types of change
enhancement
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-02-17 18:31:19 +00:00
|
|
|
|
2020-09-03 11:13:03 +00:00
|
|
|
In general, spaCy expects all pipeline packages to follow the naming convention
|
2020-10-14 18:50:23 +00:00
|
|
|
of `[lang]\_[name]`. For spaCy's pipelines, we also chose to divide the name
|
2020-09-03 11:13:03 +00:00
|
|
|
into three components:
|
💫 Update website (#3285)
<!--- Provide a general summary of your changes in the title. -->
## Description
The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.
This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.
### Types of change
enhancement
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-02-17 18:31:19 +00:00
|
|
|
|
2020-09-03 11:13:03 +00:00
|
|
|
1. **Type:** Capabilities (e.g. `core` for general-purpose pipeline with
|
2020-10-14 18:50:23 +00:00
|
|
|
vocabulary, syntax, entities and word vectors, or `dep` for only vocab and
|
|
|
|
syntax).
|
2020-09-03 11:13:03 +00:00
|
|
|
2. **Genre:** Type of text the pipeline is trained on, e.g. `web` or `news`.
|
|
|
|
3. **Size:** Package size indicator, `sm`, `md` or `lg`.
|
💫 Update website (#3285)
<!--- Provide a general summary of your changes in the title. -->
## Description
The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.
This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.
### Types of change
enhancement
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-02-17 18:31:19 +00:00
|
|
|
|
2020-08-11 18:57:23 +00:00
|
|
|
For example, [`en_core_web_sm`](/models/en#en_core_web_sm) is a small English
|
2020-09-03 11:13:03 +00:00
|
|
|
pipeline trained on written web text (blogs, news, comments), that includes
|
2020-08-11 18:57:23 +00:00
|
|
|
vocabulary, vectors, syntax and entities.
|
💫 Update website (#3285)
<!--- Provide a general summary of your changes in the title. -->
## Description
The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.
This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.
### Types of change
enhancement
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-02-17 18:31:19 +00:00
|
|
|
|
2020-09-03 11:13:03 +00:00
|
|
|
### Package versioning {#model-versioning}
|
💫 Update website (#3285)
<!--- Provide a general summary of your changes in the title. -->
## Description
The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.
This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.
### Types of change
enhancement
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-02-17 18:31:19 +00:00
|
|
|
|
2020-09-03 11:13:03 +00:00
|
|
|
Additionally, the pipeline package versioning reflects both the compatibility
|
|
|
|
with spaCy, as well as the major and minor version. A package version `a.b.c`
|
|
|
|
translates to:
|
💫 Update website (#3285)
<!--- Provide a general summary of your changes in the title. -->
## Description
The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.
This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.
### Types of change
enhancement
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-02-17 18:31:19 +00:00
|
|
|
|
|
|
|
- `a`: **spaCy major version**. For example, `2` for spaCy v2.x.
|
2020-09-03 11:13:03 +00:00
|
|
|
- `b`: **Package major version**. Pipelines with a different major version can't
|
|
|
|
be loaded by the same code. For example, changing the width of the model,
|
|
|
|
adding hidden layers or changing the activation changes the major version.
|
|
|
|
- `c`: **Package minor version**. Same pipeline structure, but different
|
|
|
|
parameter values, e.g. from being trained on different data, for different
|
|
|
|
numbers of iterations, etc.
|
💫 Update website (#3285)
<!--- Provide a general summary of your changes in the title. -->
## Description
The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.
This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.
### Types of change
enhancement
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-02-17 18:31:19 +00:00
|
|
|
|
|
|
|
For a detailed compatibility overview, see the
|
2020-09-03 11:13:03 +00:00
|
|
|
[`compatibility.json`](https://github.com/explosion/spacy-models/tree/master/compatibility.json).
|
|
|
|
This is also the source of spaCy's internal compatibility check, performed when
|
|
|
|
you run the [`download`](/api/cli#download) command.
|
2021-03-17 10:29:57 +00:00
|
|
|
|
2021-04-06 04:13:22 +00:00
|
|
|
## Trained pipeline design {#design}
|
2021-03-17 10:29:57 +00:00
|
|
|
|
2021-04-06 04:13:22 +00:00
|
|
|
The spaCy v3 trained pipelines are designed to be efficient and configurable.
|
2021-03-17 10:29:57 +00:00
|
|
|
For example, multiple components can share a common "token-to-vector" model and
|
|
|
|
it's easy to swap out or disable the lemmatizer. The pipelines are designed to
|
|
|
|
be efficient in terms of speed and size and work well when the pipeline is run
|
|
|
|
in full.
|
|
|
|
|
2021-04-06 04:13:22 +00:00
|
|
|
When modifying a trained pipeline, it's important to understand how the
|
2021-03-17 10:29:57 +00:00
|
|
|
components **depend on** each other. Unlike spaCy v2, where the `tagger`,
|
|
|
|
`parser` and `ner` components were all independent, some v3 components depend on
|
|
|
|
earlier components in the pipeline. As a result, disabling or reordering
|
|
|
|
components can affect the annotation quality or lead to warnings and errors.
|
|
|
|
|
|
|
|
Main changes from spaCy v2 models:
|
|
|
|
|
|
|
|
- The [`Tok2Vec`](/api/tok2vec) component may be a separate, shared component. A
|
|
|
|
component like a tagger or parser can
|
|
|
|
[listen](/api/architectures#Tok2VecListener) to an earlier `tok2vec` or
|
|
|
|
`transformer` rather than having its own separate tok2vec layer.
|
|
|
|
- Rule-based exceptions move from individual components to the
|
|
|
|
`attribute_ruler`. Lemma and POS exceptions move from the tokenizer exceptions
|
|
|
|
to the attribute ruler and the tag map and morph rules move from the tagger to
|
|
|
|
the attribute ruler.
|
|
|
|
- The lemmatizer tables and processing move from the vocab and tagger to a
|
|
|
|
separate `lemmatizer` component.
|
|
|
|
|
|
|
|
### CNN/CPU pipeline design
|
|
|
|
|
2021-04-06 04:13:22 +00:00
|
|
|
![Components and their dependencies in the CNN pipelines](../images/pipeline-design.svg)
|
|
|
|
|
2021-03-17 10:29:57 +00:00
|
|
|
In the `sm`/`md`/`lg` models:
|
|
|
|
|
|
|
|
- The `tagger`, `morphologizer` and `parser` components listen to the `tok2vec`
|
|
|
|
component.
|
|
|
|
- The `attribute_ruler` maps `token.tag` to `token.pos` if there is no
|
|
|
|
`morphologizer`. The `attribute_ruler` additionally makes sure whitespace is
|
|
|
|
tagged consistently and copies `token.pos` to `token.tag` if there is no
|
|
|
|
tagger. For English, the attribute ruler can improve its mapping from
|
|
|
|
`token.tag` to `token.pos` if dependency parses from a `parser` are present,
|
|
|
|
but the parser is not required.
|
|
|
|
- The rule-based `lemmatizer` (Dutch, English, French, Greek, Macedonian,
|
|
|
|
Norwegian and Spanish) requires `token.pos` annotation from either
|
|
|
|
`tagger`+`attribute_ruler` or `morphologizer`.
|
|
|
|
- The `ner` component is independent with its own internal tok2vec layer.
|
|
|
|
|
|
|
|
### Transformer pipeline design
|
|
|
|
|
2021-04-06 04:13:22 +00:00
|
|
|
In the transformer (`trf`) models, the `tagger`, `parser` and `ner` (if present)
|
2021-03-17 10:29:57 +00:00
|
|
|
all listen to the `transformer` component. The `attribute_ruler` and
|
|
|
|
`lemmatizer` have the same configuration as in the CNN models.
|
|
|
|
|
|
|
|
<!-- TODO: pretty diagram -->
|
|
|
|
|
|
|
|
### Modifying the default pipeline
|
|
|
|
|
|
|
|
For faster processing, you may only want to run a subset of the components in a
|
2021-04-06 04:13:22 +00:00
|
|
|
trained pipeline. The `disable` and `exclude` arguments to
|
2021-03-17 10:29:57 +00:00
|
|
|
[`spacy.load`](/api/top-level#spacy.load) let you control which components are
|
|
|
|
loaded and run. Disabled components are loaded in the background so it's
|
|
|
|
possible to reenable them in the same pipeline in the future with
|
|
|
|
[`nlp.enable_pipe`](/api/language/#enable_pipe). To skip loading a component
|
|
|
|
completely, use `exclude` instead of `disable`.
|
|
|
|
|
|
|
|
#### Disable part-of-speech tagging and lemmatization
|
|
|
|
|
|
|
|
To disable part-of-speech tagging and lemmatization, disable the `tagger`,
|
|
|
|
`morphologizer`, `attribute_ruler` and `lemmatizer` components.
|
|
|
|
|
|
|
|
```python
|
|
|
|
# Note: English doesn't include a morphologizer
|
|
|
|
nlp = spacy.load("en_core_web_sm", disable=["tagger", "attribute_ruler", "lemmatizer"])
|
|
|
|
nlp = spacy.load("en_core_web_trf", disable=["tagger", "attribute_ruler", "lemmatizer"])
|
|
|
|
```
|
|
|
|
|
|
|
|
<Infobox variant="warning" title="Rule-based lemmatizers require Token.pos">
|
|
|
|
|
|
|
|
The lemmatizer depends on `tagger`+`attribute_ruler` or `morphologizer` for
|
|
|
|
Dutch, English, French, Greek, Macedonian, Norwegian and Spanish. If you disable
|
|
|
|
any of these components, you'll see lemmatizer warnings unless the lemmatizer is
|
|
|
|
also disabled.
|
|
|
|
|
|
|
|
</Infobox>
|
|
|
|
|
|
|
|
#### Use senter rather than parser for fast sentence segmentation
|
|
|
|
|
|
|
|
If you need fast sentence segmentation without dependency parses, disable the
|
|
|
|
`parser` use the `senter` component instead:
|
|
|
|
|
|
|
|
```python
|
|
|
|
nlp = spacy.load("en_core_web_sm")
|
|
|
|
nlp.disable_pipe("parser")
|
|
|
|
nlp.enable_pipe("senter")
|
|
|
|
```
|
|
|
|
|
|
|
|
The `senter` component is ~10× faster than the parser and more accurate
|
|
|
|
than the rule-based `sentencizer`.
|
|
|
|
|
|
|
|
#### Switch from rule-based to lookup lemmatization
|
|
|
|
|
|
|
|
For the Dutch, English, French, Greek, Macedonian, Norwegian and Spanish
|
|
|
|
pipelines, you can switch from the default rule-based lemmatizer to a lookup
|
|
|
|
lemmatizer:
|
|
|
|
|
|
|
|
```python
|
|
|
|
# Requirements: pip install spacy-lookups-data
|
|
|
|
nlp = spacy.load("en_core_web_sm")
|
|
|
|
nlp.remove_pipe("lemmatizer")
|
|
|
|
nlp.add_pipe("lemmatizer", config={"mode": "lookup"}).initialize()
|
|
|
|
```
|
|
|
|
|
|
|
|
#### Disable everything except NER
|
|
|
|
|
|
|
|
For the non-transformer models, the `ner` component is independent, so you can
|
|
|
|
disable everything else:
|
|
|
|
|
|
|
|
```python
|
|
|
|
nlp = spacy.load("en_core_web_sm", disable=["tok2vec", "tagger", "parser", "attribute_ruler", "lemmatizer"])
|
|
|
|
```
|
|
|
|
|
2021-03-18 18:01:10 +00:00
|
|
|
In the transformer models, `ner` listens to the `transformer` component, so you
|
2021-03-17 11:59:05 +00:00
|
|
|
can disable all components related tagging, parsing, and lemmatization.
|
2021-03-17 10:29:57 +00:00
|
|
|
|
|
|
|
```python
|
|
|
|
nlp = spacy.load("en_core_web_trf", disable=["tagger", "parser", "attribute_ruler", "lemmatizer"])
|
|
|
|
```
|
|
|
|
|
|
|
|
#### Move NER to the end of the pipeline
|
|
|
|
|
|
|
|
For access to `POS` and `LEMMA` features in an `entity_ruler`, move `ner` to the
|
|
|
|
end of the pipeline after `attribute_ruler` and `lemmatizer`:
|
|
|
|
|
|
|
|
```python
|
|
|
|
# load without NER
|
|
|
|
nlp = spacy.load("en_core_web_sm", exclude=["ner"])
|
|
|
|
|
|
|
|
# source NER from the same pipeline package as the last component
|
|
|
|
nlp.add_pipe("ner", source=spacy.load("en_core_web_sm"))
|
|
|
|
|
|
|
|
# insert the entity ruler
|
|
|
|
nlp.add_pipe("entity_ruler", before="ner")
|
|
|
|
```
|