mirror of https://github.com/explosion/spaCy.git
Add details about pretrained pipeline design
This commit is contained in:
parent
61472e7cb3
commit
a5ffe8dfed
|
@ -4,6 +4,7 @@ teaser: Downloadable trained pipelines and weights for spaCy
|
|||
menu:
|
||||
- ['Quickstart', 'quickstart']
|
||||
- ['Conventions', 'conventions']
|
||||
- ['Pipeline Design', 'design']
|
||||
---
|
||||
|
||||
<!-- TODO: include interactive demo -->
|
||||
|
@ -53,3 +54,146 @@ For a detailed compatibility overview, see the
|
|||
[`compatibility.json`](https://github.com/explosion/spacy-models/tree/master/compatibility.json).
|
||||
This is also the source of spaCy's internal compatibility check, performed when
|
||||
you run the [`download`](/api/cli#download) command.
|
||||
|
||||
## Pretrained pipeline design {#design}
|
||||
|
||||
The spaCy v3 pretrained pipelines are designed to be efficient and configurable.
|
||||
For example, multiple components can share a common "token-to-vector" model and
|
||||
it's easy to swap out or disable the lemmatizer. The pipelines are designed to
|
||||
be efficient in terms of speed and size and work well when the pipeline is run
|
||||
in full.
|
||||
|
||||
When modifying a pretrained v3 pipeline, it's important to understand how the
|
||||
components **depend on** each other. Unlike spaCy v2, where the `tagger`,
|
||||
`parser` and `ner` components were all independent, some v3 components depend on
|
||||
earlier components in the pipeline. As a result, disabling or reordering
|
||||
components can affect the annotation quality or lead to warnings and errors.
|
||||
|
||||
Main changes from spaCy v2 models:
|
||||
|
||||
- The [`Tok2Vec`](/api/tok2vec) component may be a separate, shared component. A
|
||||
component like a tagger or parser can
|
||||
[listen](/api/architectures#Tok2VecListener) to an earlier `tok2vec` or
|
||||
`transformer` rather than having its own separate tok2vec layer.
|
||||
- Rule-based exceptions move from individual components to the
|
||||
`attribute_ruler`. Lemma and POS exceptions move from the tokenizer exceptions
|
||||
to the attribute ruler and the tag map and morph rules move from the tagger to
|
||||
the attribute ruler.
|
||||
- The lemmatizer tables and processing move from the vocab and tagger to a
|
||||
separate `lemmatizer` component.
|
||||
|
||||
### CNN/CPU pipeline design
|
||||
|
||||
In the `sm`/`md`/`lg` models:
|
||||
|
||||
- The `tagger`, `morphologizer` and `parser` components listen to the `tok2vec`
|
||||
component.
|
||||
- The `attribute_ruler` maps `token.tag` to `token.pos` if there is no
|
||||
`morphologizer`. The `attribute_ruler` additionally makes sure whitespace is
|
||||
tagged consistently and copies `token.pos` to `token.tag` if there is no
|
||||
tagger. For English, the attribute ruler can improve its mapping from
|
||||
`token.tag` to `token.pos` if dependency parses from a `parser` are present,
|
||||
but the parser is not required.
|
||||
- The rule-based `lemmatizer` (Dutch, English, French, Greek, Macedonian,
|
||||
Norwegian and Spanish) requires `token.pos` annotation from either
|
||||
`tagger`+`attribute_ruler` or `morphologizer`.
|
||||
- The `ner` component is independent with its own internal tok2vec layer.
|
||||
|
||||
<!-- TODO: pretty diagram -->
|
||||
|
||||
### Transformer pipeline design
|
||||
|
||||
In the tranformer (`trf`) models, the `tagger`, `parser` and `ner` (if present)
|
||||
all listen to the `transformer` component. The `attribute_ruler` and
|
||||
`lemmatizer` have the same configuration as in the CNN models.
|
||||
|
||||
<!-- TODO: pretty diagram -->
|
||||
|
||||
### Modifying the default pipeline
|
||||
|
||||
For faster processing, you may only want to run a subset of the components in a
|
||||
pretrained pipeline. The `disable` and `exclude` arguments to
|
||||
[`spacy.load`](/api/top-level#spacy.load) let you control which components are
|
||||
loaded and run. Disabled components are loaded in the background so it's
|
||||
possible to reenable them in the same pipeline in the future with
|
||||
[`nlp.enable_pipe`](/api/language/#enable_pipe). To skip loading a component
|
||||
completely, use `exclude` instead of `disable`.
|
||||
|
||||
#### Disable part-of-speech tagging and lemmatization
|
||||
|
||||
To disable part-of-speech tagging and lemmatization, disable the `tagger`,
|
||||
`morphologizer`, `attribute_ruler` and `lemmatizer` components.
|
||||
|
||||
```python
|
||||
# Note: English doesn't include a morphologizer
|
||||
nlp = spacy.load("en_core_web_sm", disable=["tagger", "attribute_ruler", "lemmatizer"])
|
||||
nlp = spacy.load("en_core_web_trf", disable=["tagger", "attribute_ruler", "lemmatizer"])
|
||||
```
|
||||
|
||||
<Infobox variant="warning" title="Rule-based lemmatizers require Token.pos">
|
||||
|
||||
The lemmatizer depends on `tagger`+`attribute_ruler` or `morphologizer` for
|
||||
Dutch, English, French, Greek, Macedonian, Norwegian and Spanish. If you disable
|
||||
any of these components, you'll see lemmatizer warnings unless the lemmatizer is
|
||||
also disabled.
|
||||
|
||||
</Infobox>
|
||||
|
||||
#### Use senter rather than parser for fast sentence segmentation
|
||||
|
||||
If you need fast sentence segmentation without dependency parses, disable the
|
||||
`parser` use the `senter` component instead:
|
||||
|
||||
```python
|
||||
nlp = spacy.load("en_core_web_sm")
|
||||
nlp.disable_pipe("parser")
|
||||
nlp.enable_pipe("senter")
|
||||
```
|
||||
|
||||
The `senter` component is ~10× faster than the parser and more accurate
|
||||
than the rule-based `sentencizer`.
|
||||
|
||||
#### Switch from rule-based to lookup lemmatization
|
||||
|
||||
For the Dutch, English, French, Greek, Macedonian, Norwegian and Spanish
|
||||
pipelines, you can switch from the default rule-based lemmatizer to a lookup
|
||||
lemmatizer:
|
||||
|
||||
```python
|
||||
# Requirements: pip install spacy-lookups-data
|
||||
nlp = spacy.load("en_core_web_sm")
|
||||
nlp.remove_pipe("lemmatizer")
|
||||
nlp.add_pipe("lemmatizer", config={"mode": "lookup"}).initialize()
|
||||
```
|
||||
|
||||
#### Disable everything except NER
|
||||
|
||||
For the non-transformer models, the `ner` component is independent, so you can
|
||||
disable everything else:
|
||||
|
||||
```python
|
||||
nlp = spacy.load("en_core_web_sm", disable=["tok2vec", "tagger", "parser", "attribute_ruler", "lemmatizer"])
|
||||
```
|
||||
|
||||
In the transformer models, `ner` listens to the `transformer` layer, so you can
|
||||
disable all components related tagging, parsing, and lemmatization.
|
||||
|
||||
```python
|
||||
nlp = spacy.load("en_core_web_trf", disable=["tagger", "parser", "attribute_ruler", "lemmatizer"])
|
||||
```
|
||||
|
||||
#### Move NER to the end of the pipeline
|
||||
|
||||
For access to `POS` and `LEMMA` features in an `entity_ruler`, move `ner` to the
|
||||
end of the pipeline after `attribute_ruler` and `lemmatizer`:
|
||||
|
||||
```python
|
||||
# load without NER
|
||||
nlp = spacy.load("en_core_web_sm", exclude=["ner"])
|
||||
|
||||
# source NER from the same pipeline package as the last component
|
||||
nlp.add_pipe("ner", source=spacy.load("en_core_web_sm"))
|
||||
|
||||
# insert the entity ruler
|
||||
nlp.add_pipe("entity_ruler", before="ner")
|
||||
```
|
||||
|
|
Loading…
Reference in New Issue