spaCy/website/docs/usage/_spacy-101/_pipelines.jade

61 lines
2.2 KiB
Plaintext
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

//- 💫 DOCS > USAGE > SPACY 101 > PIPELINES
p
| When you call #[code nlp] on a text, spaCy first tokenizes the text to
| produce a #[code Doc] object. The #[code Doc] is then processed in several
| different steps this is also referred to as the
| #[strong processing pipeline]. The pipeline used by the
| #[+a("/docs/usage/models") default models] consists of a
| vectorizer, a tagger, a parser and an entity recognizer. Each pipeline
| component returns the processed #[code Doc], which is then passed on to
| the next component.
+image
include ../../../assets/img/docs/pipeline.svg
.u-text-right
+button("/assets/img/docs/pipeline.svg", false, "secondary").u-text-tag View large graphic
+aside
| #[strong Name:] ID of the pipeline component.#[br]
| #[strong Component:] spaCy's implementation of the component.#[br]
| #[strong Creates:] Objects, attributes and properties modified and set by
| the component.
+table(["Name", "Component", "Creates"])
+row
+cell tokenizer
+cell #[+api("tokenizer") #[code Tokenizer]]
+cell #[code Doc]
+row("divider")
+cell vectorizer
+cell #[code Vectorizer]
+cell #[code Doc.tensor]
+row
+cell tagger
+cell #[+api("tagger") #[code Tagger]]
+cell #[code Doc[i].tag]
+row
+cell parser
+cell #[+api("dependencyparser") #[code DependencyParser]]
+cell
| #[code Doc[i].head], #[code Doc[i].dep], #[code Doc.sents],
| #[code Doc.noun_chunks]
+row
+cell ner
+cell #[+api("entityrecognizer") #[code EntityRecognizer]]
+cell #[code Doc.ents], #[code Doc[i].ent_iob], #[code Doc[i].ent_type]
p
| The processing pipeline always #[strong depends on the statistical model]
| and its capabilities. For example, a pipeline can only include an entity
| recognizer component if the model includes data to make predictions of
| entity labels. This is why each model will specify the pipeline to use
| in its meta data, as a simple list containing the component names:
+code(false, "json").
"pipeline": ["vectorizer", "tagger", "parser", "ner"]