spaCy/website/docs/usage/101/_architecture.md

The central data structures in spaCy are the [`Language`](/api/language) class,
the [`Vocab`](/api/vocab) and the [`Doc`](/api/doc) object. The `Language` class
is used to process a text and turn it into a `Doc` object. It's typically stored
as a variable called `nlp`. The `Doc` object owns the **sequence of tokens** and
all their annotations. By centralizing strings, word vectors and lexical
attributes in the `Vocab`, we avoid storing multiple copies of this data. This
saves memory, and ensures there's a **single source of truth**.

Text annotations are also designed to allow a single source of truth: the `Doc`
object owns the data, and [`Span`](/api/span) and [`Token`](/api/token) are
**views that point into it**. The `Doc` object is constructed by the
[`Tokenizer`](/api/tokenizer), and then **modified in place** by the components
of the pipeline. The `Language` object coordinates these components. It takes
raw text and sends it through the pipeline, returning an **annotated document**.
It also orchestrates training and serialization.

![Library architecture](../../images/architecture.svg)

### Container objects {#architecture-containers}

| Name                        | Description                                                                                                                                             |
| --------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------- |
| [`Language`](/api/language) | Processing class that turns text into `Doc` objects. Different languages implement their own subclasses of it. The variable is typically called `nlp`.  |
| [`Doc`](/api/doc)           | A container for accessing linguistic annotations.                                                                                                       |
| [`Span`](/api/span)         | A slice from a `Doc` object.                                                                                                                            |
| [`Token`](/api/token)       | An individual token — i.e. a word, punctuation symbol, whitespace, etc.                                                                                 |
| [`Lexeme`](/api/lexeme)     | An entry in the vocabulary. It's a word type with no context, as opposed to a word token. It therefore has no part-of-speech tag, dependency parse etc. |
| [`Example`](/api/example)   | A collection of training annotations, containing two `Doc` objects: the reference data and the predictions.                                             |
| [`DocBin`](/api/docbin)     | A collection of `Doc` objects for efficient binary serialization. Also used for [training data](/api/data-formats#binary-training).                     |

### Processing pipeline {#architecture-pipeline}

The processing pipeline consists of one or more **pipeline components** that are
called on the `Doc` in order. The tokenizer runs before the components. Pipeline
components can be added using [`Language.add_pipe`](/api/language#add_pipe).
They can contain a statistical model and trained weights, or only make
rule-based modifications to the `Doc`. spaCy provides a range of built-in
components for different language processing tasks and also allows adding
[custom components](/usage/processing-pipelines#custom-components).

![The processing pipeline](../../images/pipeline.svg)

| Name                                            | Description                                                                                 |
| ----------------------------------------------- | ------------------------------------------------------------------------------------------- |
| [`Tokenizer`](/api/tokenizer)                   | Segment raw text and create `Doc` objects from the words.                                   |
| [`Tok2Vec`](/api/tok2vec)                       | Apply a "token-to-vector" model and set its outputs.                                        |
| [`Transformer`](/api/transformer)               | Use a transformer model and set its outputs.                                                |
| [`Lemmatizer`](/api/lemmatizer)                 | Determine the base forms of words.                                                          |
| [`Morphologizer`](/api/morphologizer)           | Predict morphological features and coarse-grained part-of-speech tags.                      |
| [`Tagger`](/api/tagger)                         | Predict part-of-speech tags.                                                                |
| [`AttributeRuler`](/api/attributeruler)         | Set token attributes using matcher rules.                                                   |
| [`DependencyParser`](/api/dependencyparser)     | Predict syntactic dependencies.                                                             |
| [`EntityRecognizer`](/api/entityrecognizer)     | Predict named entities, e.g. persons or products.                                           |
| [`EntityRuler`](/api/entityruler)               | Add entity spans to the `Doc` using token-based rules or exact phrase matches.              |
| [`EntityLinker`](/api/entitylinker)             | Disambiguate named entities to nodes in a knowledge base.                                   |
| [`TextCategorizer`](/api/textcategorizer)       | Predict categories or labels over the whole document.                                       |
| [`Sentencizer`](/api/sentencizer)               | Implement rule-based sentence boundary detection that doesn't require the dependency parse. |
| [`SentenceRecognizer`](/api/sentencerecognizer) | Predict sentence boundaries.                                                                |
| [Other functions](/api/pipeline-functions)      | Automatically apply something to the `Doc`, e.g. to merge spans of tokens.                  |
| [`Pipe`](/api/pipe)                             | Base class that all trainable pipeline components inherit from.                             |

### Matchers {#architecture-matchers}

Matchers help you find and extract information from [`Doc`](/api/doc) objects
based on match patterns describing the sequences you're looking for. A matcher
operates on a `Doc` and gives you access to the matched tokens **in context**.

| Name                                          | Description                                                                                                                                                                         |
| --------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| [`Matcher`](/api/matcher)                     | Match sequences of tokens, based on pattern rules, similar to regular expressions.                                                                                                  |
| [`PhraseMatcher`](/api/phrasematcher)         | Match sequences of tokens based on phrases.                                                                                                                                         |
| [`DependencyMatcher`](/api/dependencymatcher) | Match sequences of tokens based on dependency trees using the [Semgrex syntax](https://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/semgraph/semgrex/SemgrexPattern.html). |

### Other classes {#architecture-other}

| Name                                             | Description                                                                                                      |
| ------------------------------------------------ | ---------------------------------------------------------------------------------------------------------------- |
| [`Vocab`](/api/vocab)                            | The shared vocabulary that stores strings and gives you access to [`Lexeme`](/api/lexeme) objects.               |
| [`StringStore`](/api/stringstore)                | Map strings to and from hash values.                                                                             |
| [`Vectors`](/api/vectors)                        | Container class for vector data keyed by string.                                                                 |
| [`Lookups`](/api/lookups)                        | Container for convenient access to large lookup tables and dictionaries.                                         |
| [`Morphology`](/api/morphology)                  | Assign linguistic features like lemmas, noun case, verb tense etc. based on the word and its part-of-speech tag. |
| [`MorphAnalysis`](/api/morphology#morphanalysis) | A morphological analysis.                                                                                        |
| [`KnowledgeBase`](/api/kb)                       | Storage for entities and aliases of a knowledge base for entity linking.                                         |
| [`Scorer`](/api/scorer)                          | Compute evaluation scores.                                                                                       |
| [`Corpus`](/api/corpus)                          | Class for managing annotated corpora for training and evaluation data.                                           |
Update docs [ci skip] 2020-08-09 22:42:26 +00:00			The central data structures in spaCy are the [`Language`](/api/language) class,
			the [`Vocab`](/api/vocab) and the [`Doc`](/api/doc) object. The `Language` class
			is used to process a text and turn it into a `Doc` object. It's typically stored
			as a variable called `nlp`. The `Doc` object owns the sequence of tokens and
			`all their annotations. By centralizing strings, word vectors and lexical`
			attributes in the `Vocab`, we avoid storing multiple copies of this data. This
			`saves memory, and ensures there's a single source of truth.`
💫 Update website (#3285) <!--- Provide a general summary of your changes in the title. --> ## Description The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in straightforward Markdown without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on. This PR also includes various new docs pages and content. Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837. ### Types of change enhancement ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. 2019-02-17 18:31:19 +00:00
			Text annotations are also designed to allow a single source of truth: the `Doc`
Update docs [ci skip] 2020-08-09 22:42:26 +00:00			object owns the data, and [`Span`](/api/span) and [`Token`](/api/token) are
			views that point into it. The `Doc` object is constructed by the
			[`Tokenizer`](/api/tokenizer), and then modified in place by the components
			of the pipeline. The `Language` object coordinates these components. It takes
			`raw text and sends it through the pipeline, returning an annotated document.`
			`It also orchestrates training and serialization.`
💫 Update website (#3285) <!--- Provide a general summary of your changes in the title. --> ## Description The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in straightforward Markdown without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on. This PR also includes various new docs pages and content. Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837. ### Types of change enhancement ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. 2019-02-17 18:31:19 +00:00
			`![Library architecture](../../images/architecture.svg)`

			`### Container objects {#architecture-containers}`

Update docs [ci skip] 2020-08-09 22:42:26 +00:00			`\| Name \| Description \|`
			`\| --------------------------- \| ------------------------------------------------------------------------------------------------------------------------------------------------------- \|`
			\| [`Language`](/api/language) \| Processing class that turns text into `Doc` objects. Different languages implement their own subclasses of it. The variable is typically called `nlp`. \|
			\| [`Doc`](/api/doc) \| A container for accessing linguistic annotations. \|
			\| [`Span`](/api/span) \| A slice from a `Doc` object. \|
			\| [`Token`](/api/token) \| An individual token — i.e. a word, punctuation symbol, whitespace, etc. \|
			\| [`Lexeme`](/api/lexeme) \| An entry in the vocabulary. It's a word type with no context, as opposed to a word token. It therefore has no part-of-speech tag, dependency parse etc. \|
			\| [`Example`](/api/example) \| A collection of training annotations, containing two `Doc` objects: the reference data and the predictions. \|
			\| [`DocBin`](/api/docbin) \| A collection of `Doc` objects for efficient binary serialization. Also used for [training data](/api/data-formats#binary-training). \|
💫 Update website (#3285) <!--- Provide a general summary of your changes in the title. --> ## Description The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in straightforward Markdown without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on. This PR also includes various new docs pages and content. Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837. ### Types of change enhancement ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. 2019-02-17 18:31:19 +00:00
			`### Processing pipeline {#architecture-pipeline}`

Update docs [ci skip] 2020-08-09 22:42:26 +00:00			`The processing pipeline consists of one or more pipeline components that are`
			called on the `Doc` in order. The tokenizer runs before the components. Pipeline
			components can be added using [`Language.add_pipe`](/api/language#add_pipe).
			`They can contain a statistical model and trained weights, or only make`
			rule-based modifications to the `Doc`. spaCy provides a range of built-in
			`components for different language processing tasks and also allows adding`
			`[custom components](/usage/processing-pipelines#custom-components).`

			`![The processing pipeline](../../images/pipeline.svg)`

			`\| Name \| Description \|`
			`\| ----------------------------------------------- \| ------------------------------------------------------------------------------------------- \|`
			\| [`Tokenizer`](/api/tokenizer) \| Segment raw text and create `Doc` objects from the words. \|
			\| [`Tok2Vec`](/api/tok2vec) \| Apply a "token-to-vector" model and set its outputs. \|
			\| [`Transformer`](/api/transformer) \| Use a transformer model and set its outputs. \|
			\| [`Lemmatizer`](/api/lemmatizer) \| Determine the base forms of words. \|
			\| [`Morphologizer`](/api/morphologizer) \| Predict morphological features and coarse-grained part-of-speech tags. \|
			\| [`Tagger`](/api/tagger) \| Predict part-of-speech tags. \|
			\| [`AttributeRuler`](/api/attributeruler) \| Set token attributes using matcher rules. \|
			\| [`DependencyParser`](/api/dependencyparser) \| Predict syntactic dependencies. \|
			\| [`EntityRecognizer`](/api/entityrecognizer) \| Predict named entities, e.g. persons or products. \|
			\| [`EntityRuler`](/api/entityruler) \| Add entity spans to the `Doc` using token-based rules or exact phrase matches. \|
			\| [`EntityLinker`](/api/entitylinker) \| Disambiguate named entities to nodes in a knowledge base. \|
			\| [`TextCategorizer`](/api/textcategorizer) \| Predict categories or labels over the whole document. \|
			\| [`Sentencizer`](/api/sentencizer) \| Implement rule-based sentence boundary detection that doesn't require the dependency parse. \|
			\| [`SentenceRecognizer`](/api/sentencerecognizer) \| Predict sentence boundaries. \|
			\| [Other functions](/api/pipeline-functions) \| Automatically apply something to the `Doc`, e.g. to merge spans of tokens. \|
			\| [`Pipe`](/api/pipe) \| Base class that all trainable pipeline components inherit from. \|

			`### Matchers {#architecture-matchers}`

			Matchers help you find and extract information from [`Doc`](/api/doc) objects
			`based on match patterns describing the sequences you're looking for. A matcher`
			operates on a `Doc` and gives you access to the matched tokens in context.

			`\| Name \| Description \|`
			`\| --------------------------------------------- \| ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- \|`
			\| [`Matcher`](/api/matcher) \| Match sequences of tokens, based on pattern rules, similar to regular expressions. \|
			\| [`PhraseMatcher`](/api/phrasematcher) \| Match sequences of tokens based on phrases. \|
			\| [`DependencyMatcher`](/api/dependencymatcher) \| Match sequences of tokens based on dependency trees using the [Semgrex syntax](https://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/semgraph/semgrex/SemgrexPattern.html). \|
💫 Update website (#3285) <!--- Provide a general summary of your changes in the title. --> ## Description The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in straightforward Markdown without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on. This PR also includes various new docs pages and content. Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837. ### Types of change enhancement ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. 2019-02-17 18:31:19 +00:00
			`### Other classes {#architecture-other}`

Update docs, types and API consistency 2020-08-17 14:45:24 +00:00			`\| Name \| Description \|`
			`\| ------------------------------------------------ \| ---------------------------------------------------------------------------------------------------------------- \|`
			\| [`Vocab`](/api/vocab) \| The shared vocabulary that stores strings and gives you access to [`Lexeme`](/api/lexeme) objects. \|
			\| [`StringStore`](/api/stringstore) \| Map strings to and from hash values. \|
			\| [`Vectors`](/api/vectors) \| Container class for vector data keyed by string. \|
			\| [`Lookups`](/api/lookups) \| Container for convenient access to large lookup tables and dictionaries. \|
			\| [`Morphology`](/api/morphology) \| Assign linguistic features like lemmas, noun case, verb tense etc. based on the word and its part-of-speech tag. \|
			\| [`MorphAnalysis`](/api/morphology#morphanalysis) \| A morphological analysis. \|
			\| [`KnowledgeBase`](/api/kb) \| Storage for entities and aliases of a knowledge base for entity linking. \|
			\| [`Scorer`](/api/scorer) \| Compute evaluation scores. \|
			\| [`Corpus`](/api/corpus) \| Class for managing annotated corpora for training and evaluation data. \|