spaCy/website/docs/api/architectures.md

37 KiB

title teaser source menu
Model Architectures Pre-defined model architectures included with the core library spacy/ml/models
Tok2Vec
tok2vec
Transformers
transformers
Parser & NER
parser
Tagging
tagger
Text Classification
textcat
Entity Linking
entitylinker

TODO: intro and how architectures work, link to registry, custom models usage etc.

Tok2Vec architectures

spacy.HashEmbedCNN.v1

Example Config

[model]
@architectures = "spacy.HashEmbedCNN.v1"
pretrained_vectors = null
width = 96
depth = 4
embed_size = 2000
window_size = 1
maxout_pieces = 3
subword_features = true

Build spaCy's 'standard' tok2vec layer, which uses hash embedding with subword features and a CNN with layer-normalized maxout.

Name Type Description
width int The width of the input and output. These are required to be the same, so that residual connections can be used. Recommended values are 96, 128 or 300.
depth int The number of convolutional layers to use. Recommended values are between 2 and 8.
embed_size int The number of rows in the hash embedding tables. This can be surprisingly small, due to the use of the hash embeddings. Recommended values are between 2000 and 10000.
window_size int The number of tokens on either side to concatenate during the convolutions. The receptive field of the CNN will be depth * (window_size * 2 + 1), so a 4-layer network with a window size of 2 will be sensitive to 17 words at a time. Recommended value is 1.
maxout_pieces int The number of pieces to use in the maxout non-linearity. If 1, the Mish non-linearity is used instead. Recommended values are 1-3.
subword_features bool Whether to also embed subword features, specifically the prefix, suffix and word shape. This is recommended for alphabetic languages like English, but not if single-character tokens are used for a language such as Chinese.
pretrained_vectors bool Whether to also use static vectors.

spacy.Tok2Vec.v1

Example config

[model]
@architectures = "spacy.Tok2Vec.v1"

[model.embed]

[model.encode]

Construct a tok2vec model out of embedding and encoding subnetworks. See the "Embed, Encode, Attend, Predict" blog post for background.

Name Type Description
embed Model Input: List[Doc]. Output: List[Floats2d]. Embed tokens into context-independent word vector representations.
encode Model Input: List[Floats2d]. Output: List[Floats2d]. Encode context into the embeddings, using an architecture such as a CNN, BiLSTM or transformer.

spacy.Tok2VecListener.v1

Example config

[components.tok2vec]
factory = "tok2vec"

[components.tok2vec.model]
@architectures = "spacy.HashEmbedCNN.v1"
width = 342

[components.tagger]
factory = "tagger"

[components.tagger.model]
@architectures = "spacy.Tagger.v1"

[components.tagger.model.tok2vec]
@architectures = "spacy.Tok2VecListener.v1"
width = ${components.tok2vec.model:width}

A listener is used as a sublayer within a component such as a DependencyParser, EntityRecognizeror TextCategorizer. Usually you'll have multiple listeners connecting to a single upstream Tok2Vec component that's earlier in the pipeline. The listener layers act as proxies, passing the predictions from the Tok2Vec component into downstream components, and communicating gradients back upstream.

Instead of defining its own Tok2Vec instance, a model architecture like Tagger can define a listener as its tok2vec argument that connects to the shared tok2vec component in the pipeline.

Name Type Description
width int The width of the vectors produced by the "upstream" Tok2Vec component.
upstream str A string to identify the "upstream" Tok2Vec component to communicate with. The upstream name should either be the wildcard string "*", or the name of the Tok2Vec component. You'll almost never have multiple upstream Tok2Vec components, so the wildcard string will almost always be fine.

spacy.MultiHashEmbed.v1

Example config

[model]
@architectures = "spacy.MultiHashEmbed.v1"
width = 64
rows = 2000
also_embed_subwords = false
also_use_static_vectors = false

Construct an embedding layer that separately embeds a number of lexical attributes using hash embedding, concatenates the results, and passes it through a feed-forward subnetwork to build a mixed representations. The features used are the NORM, PREFIX, SUFFIX and SHAPE, which can have varying definitions depending on the Vocab of the Doc object passed in. Vectors from pretrained static vectors can also be incorporated into the concatenated representation.

Name Type Description
width int The output width. Also used as the width of the embedding tables. Recommended values are between 64 and 300.
rows int The number of rows for the embedding tables. Can be low, due to the hashing trick. Embeddings for prefix, suffix and word shape use half as many rows. Recommended values are between 2000 and 10000.
also_embed_subwords bool Whether to use the PREFIX, SUFFIX and SHAPE features in the embeddings. If not using these, you may need more rows in your hash embeddings, as there will be increased chance of collisions.
also_use_static_vectors bool Whether to also use static word vectors. Requires a vectors table to be loaded in the Doc objects' vocab.

spacy.CharacterEmbed.v1

Example config

[model]
@architectures = "spacy.CharacterEmbed.v1"
width = 64
rows = 2000
nM = 16
nC = 4

Construct an embedded representations based on character embeddings, using a feed-forward network. A fixed number of UTF-8 byte characters are used for each word, taken from the beginning and end of the word equally. Padding is used in the center for words that are too short.

For instance, let's say nC=4, and the word is "jumping". The characters used will be "jung" (two from the start, two from the end). If we had nC=8, the characters would be "jumpping": 4 from the start, 4 from the end. This ensures that the final character is always in the last position, instead of being in an arbitrary position depending on the word length.

The characters are embedded in a embedding table with 256 rows, and the vectors concatenated. A hash-embedded vector of the NORM of the word is also concatenated on, and the result is then passed through a feed-forward network to construct a single vector to represent the information.

Name Type Description
width int The width of the output vector and the NORM hash embedding.
rows int The number of rows in the NORM hash embedding table.
nM int The dimensionality of the character embeddings. Recommended values are between 16 and 64.
nC int The number of UTF-8 bytes to embed per word. Recommended values are between 3 and 8, although it may depend on the length of words in the language.

spacy.MaxoutWindowEncoder.v1

Example config

[model]
@architectures = "spacy.MaxoutWindowEncoder.v1"
width = 64
window_size = 1
maxout_pieces = 2
depth = 4

Encode context using convolutions with maxout activation, layer normalization and residual connections.

Name Type Description
width int The input and output width. These are required to be the same, to allow residual connections. This value will be determined by the width of the inputs. Recommended values are between 64 and 300.
window_size int The number of words to concatenate around each token to construct the convolution. Recommended value is 1.
maxout_pieces int The number of maxout pieces to use. Recommended values are 2 or 3.
depth int The number of convolutional layers. Recommended value is 4.

spacy.MishWindowEncoder.v1

Example config

[model]
@architectures = "spacy.MishWindowEncoder.v1"
width = 64
window_size = 1
depth = 4

Encode context using convolutions with Mish activation, layer normalization and residual connections.

Name Type Description
width int The input and output width. These are required to be the same, to allow residual connections. This value will be determined by the width of the inputs. Recommended values are between 64 and 300.
window_size int The number of words to concatenate around each token to construct the convolution. Recommended value is 1.
depth int The number of convolutional layers. Recommended value is 4.

spacy.TorchBiLSTMEncoder.v1

Example config

[model]
@architectures = "spacy.TorchBiLSTMEncoder.v1"
width = 64
window_size = 1
depth = 4

Encode context using bidirectonal LSTM layers. Requires PyTorch.

Name Type Description
width int The input and output width. These are required to be the same, to allow residual connections. This value will be determined by the width of the inputs. Recommended values are between 64 and 300.
window_size int The number of words to concatenate around each token to construct the convolution. Recommended value is 1.
depth int The number of convolutional layers. Recommended value is 4.

Transformer architectures

The following architectures are provided by the package spacy-transformers. See the usage documentation for how to integrate the architectures into your training config.

spacy-transformers.TransformerModel.v1

Example Config

[model]
@architectures = "spacy-transformers.TransformerModel.v1"
name = "roberta-base"
tokenizer_config = {"use_fast": true}

[model.get_spans]
@span_getters = "strided_spans.v1"
window = 128
stride = 96
Name Type Description
name str Any model name that can be loaded by transformers.AutoModel.
get_spans Callable Function that takes a batch of Doc object and returns lists of Span objects to process by the transformer. See here for built-in options and examples.
tokenizer_config Dict[str, Any] Tokenizer settings passed to transformers.AutoTokenizer.

spacy-transformers.Tok2VecListener.v1

Example Config

[model]
@architectures = "spacy-transformers.Tok2VecListener.v1"
grad_factor = 1.0

[model.pooling]
@layers = "reduce_mean.v1"
Name Type Description
grad_factor float Factor for weighting the gradient if multiple components listen to the same transformer model.
pooling Model[Ragged, Floats2d] Pooling layer to determine how the vector for each spaCy token will be computed.

Parser & NER architectures

spacy.TransitionBasedParser.v1

Example Config

[model]
@architectures = "spacy.TransitionBasedParser.v1"
nr_feature_tokens = 6
hidden_width = 64
maxout_pieces = 2

[model.tok2vec]
@architectures = "spacy.HashEmbedCNN.v1"
pretrained_vectors = null
width = 96
depth = 4
embed_size = 2000
window_size = 1
maxout_pieces = 3
subword_features = true

Build a transition-based parser model. Can apply to NER or dependency-parsing. Transition-based parsing is an approach to structured prediction where the task of predicting the structure is mapped to a series of state transitions. You might find this tutorial helpful for background information. The neural network state prediction model consists of either two or three subnetworks:

  • tok2vec: Map each token into a vector representations. This subnetwork is run once for each batch.
  • lower: Construct a feature-specific vector for each (token, feature) pair. This is also run once for each batch. Constructing the state representation is then simply a matter of summing the component features and applying the non-linearity.
  • upper (optional): A feed-forward network that predicts scores from the state representation. If not present, the output from the lower model is used as action scores directly.
Name Type Description
tok2vec Model Input: List[Doc]. Output: List[Floats2d]. Subnetwork to map tokens into vector representations.
nr_feature_tokens int The number of tokens in the context to use to construct the state vector. Valid choices are 1, 2, 3, 6, 8 and 13. The 2, 8 and 13 feature sets are designed for the parser, while the 3 and 6 feature sets are designed for the entity recognizer. The recommended feature sets are 3 for NER, and 8 for the dependency parser.
hidden_width int The width of the hidden layer.
maxout_pieces int How many pieces to use in the state prediction layer. Recommended values are 1, 2 or 3. If 1, the maxout non-linearity is replaced with a Relu non-linearity if use_upper is True, and no non-linearity if False.
use_upper bool Whether to use an additional hidden layer after the state vector in order to predict the action scores. It is recommended to set this to False for large pretrained models such as transformers, and True for smaller networks. The upper layer is computed on CPU, which becomes a bottleneck on larger GPU-based models, where it's also less necessary.
nO int The number of actions the model will predict between. Usually inferred from data at the beginning of training, or loaded from disk.

spacy.BILUOTagger.v1

Example Config

[model]
@architectures = "spacy.BILUOTagger.v1 "

[model.tok2vec]
@architectures = "spacy.HashEmbedCNN.v1"
# etc.

Construct a simple NER tagger that predicts BILUO tag scores for each token and uses greedy decoding with transition-constraints to return a valid BILUO tag sequence. A BILUO tag sequence encodes a sequence of non-overlapping labelled spans into tags assigned to each token. The first token of a span is given the tag B-LABEL, the last token of the span is given the tag L-LABEL, and tokens within the span are given the tag U-LABEL. Single-token spans are given the tag U-LABEL. All other tokens are assigned the tag O. The BILUO tag scheme generally results in better linear separation between classes, especially for non-CRF models, because there are more distinct classes for the different situations (Ratinov et al., 2009).

Name Type Description
tok2vec Model Input: List[Doc]. Output: List[Floats2d]. Subnetwork to map tokens into vector representations.

spacy.IOBTagger.v1

Example Config

[model]
@architectures = "spacy.IOBTagger.v1 "

[model.tok2vec]
@architectures = "spacy.HashEmbedCNN.v1"
# etc.

Construct a simple NER tagger, that predicts IOB tag scores for each token and uses greedy decoding with transition-constraints to return a valid IOB tag sequence. An IOB tag sequence encodes a sequence of non-overlapping labeled spans into tags assigned to each token. The first token of a span is given the tag B-LABEL, and subsequent tokens are given the tag I-LABEL. All other tokens are assigned the tag O.

Name Type Description
tok2vec Model Input: List[Doc]. Output: List[Floats2d]. Subnetwork to map tokens into vector representations.

Tagging architectures

spacy.Tagger.v1

Example Config

[model]
@architectures = "spacy.Tagger.v1"
nO = null

[model.tok2vec]
# ...

Build a tagger model, using a provided token-to-vector component. The tagger model simply adds a linear layer with softmax activation to predict scores given the token vectors.

Name Type Description
tok2vec Model Input: List[Doc]. Output: List[Floats2d]. Subnetwork to map tokens into vector representations.
nO int The number of tags to output. Inferred from the data if None.

Text classification architectures

A text classification architecture needs to take a Doc as input, and produce a score for each potential label class. Textcat challenges can be binary (e.g. sentiment analysis) or involve multiple possible labels. Multi-label challenges can either have mutually exclusive labels (each example has exactly one label), or multiple labels may be applicable at the same time.

As the properties of text classification problems can vary widely, we provide several different built-in architectures. It is recommended to experiment with different architectures and settings to determine what works best on your specific data and challenge.

spacy.TextCatEnsemble.v1

Stacked ensemble of a bag-of-words model and a neural network model. The neural network has an internal CNN Tok2Vec layer and uses attention.

Example Config

[model]
@architectures = "spacy.TextCatEnsemble.v1"
exclusive_classes = false
pretrained_vectors = null
width = 64
embed_size = 2000
conv_depth = 2
window_size = 1
ngram_size = 1
dropout = null
nO = null
Name Type Description
exclusive_classes bool Whether or not categories are mutually exclusive.
pretrained_vectors bool Whether or not pretrained vectors will be used in addition to the feature vectors.
width int Output dimension of the feature encoding step.
embed_size int Input dimension of the feature encoding step.
conv_depth int Depth of the Tok2Vec layer.
window_size int The number of contextual vectors to concatenate from the left and from the right.
ngram_size int Determines the maximum length of the n-grams in the BOW model. For instance, ngram_size=3would give unigram, trigram and bigram features.
dropout float The dropout rate.
nO int Output dimension, determined by the number of different labels. If not set, the the TextCategorizer component will set it when
begin_training is called.

spacy.TextCatCNN.v1

Example Config

[model]
@architectures = "spacy.TextCatCNN.v1"
exclusive_classes = false
nO = null

[model.tok2vec]
@architectures = "spacy.HashEmbedCNN.v1"
pretrained_vectors = null
width = 96
depth = 4
embed_size = 2000
window_size = 1
maxout_pieces = 3
subword_features = true

A neural network model where token vectors are calculated using a CNN. The vectors are mean pooled and used as features in a feed-forward network. This architecture is usually less accurate than the ensemble, but runs faster.

Name Type Description
exclusive_classes bool Whether or not categories are mutually exclusive.
tok2vec Model The tok2vec layer of the model.
nO int Output dimension, determined by the number of different labels. If not set, the the TextCategorizer component will set it when
begin_training is called.

spacy.TextCatBOW.v1

Example Config

[model]
@architectures = "spacy.TextCatBOW.v1"
exclusive_classes = false
ngram_size = 1
no_output_layer = false
nO = null

An ngram "bag-of-words" model. This architecture should run much faster than the others, but may not be as accurate, especially if texts are short.

Name Type Description
exclusive_classes bool Whether or not categories are mutually exclusive.
ngram_size int Determines the maximum length of the n-grams in the BOW model. For instance, ngram_size=3would give unigram, trigram and bigram features.
no_output_layer float Whether or not to add an output layer to the model (Softmax activation if exclusive_classes=True, else Logistic.
nO int Output dimension, determined by the number of different labels. If not set, the the TextCategorizer component will set it when
begin_training is called.

Entity linking architectures

An EntityLinker component disambiguates textual mentions (tagged as named entities) to unique identifiers, grounding the named entities into the "real world". This requires 3 main components:

  • A KnowledgeBase (KB) holding the unique identifiers, potential synonyms and prior probabilities.
  • A candidate generation step to produce a set of likely identifiers, given a certain textual mention.
  • A Machine learning Model that picks the most plausible ID from the set of candidates.

spacy.EntityLinker.v1

The EntityLinker model architecture is a Thinc Model with a Linear output layer.

Example Config

[model]
@architectures = "spacy.EntityLinker.v1"
nO = null

[model.tok2vec]
@architectures = "spacy.HashEmbedCNN.v1"
pretrained_vectors = null
width = 96
depth = 2
embed_size = 300
window_size = 1
maxout_pieces = 3
subword_features = true

[kb_loader]
@assets = "spacy.EmptyKB.v1"
entity_vector_length = 64

[get_candidates]
@assets = "spacy.CandidateGenerator.v1"
Name Type Description
tok2vec Model The tok2vec layer of the model.
nO int Output dimension, determined by the length of the vectors encoding each entity in the KB

If the nO dimension is not set, the Entity Linking component will set it when begin_training is called.

spacy.EmptyKB.v1

A function that creates a default, empty KnowledgeBase from a Vocab instance.

Name Type Description
entity_vector_length int The length of the vectors encoding each entity in the KB - 64 by default.

spacy.CandidateGenerator.v1

A function that takes as input a KnowledgeBase and a Span object denoting a named entity, and returns a list of plausible Candidate objects.

The default CandidateGenerator simply uses the text of a mention to find its potential aliases in the Knowledgebase. Note that this function is case-dependent.