From 7ed8f4504bc5a64f3c8afb8c2003b0ba92254d3c Mon Sep 17 00:00:00 2001 From: Matthew Honnibal Date: Fri, 21 Aug 2020 15:22:19 +0200 Subject: [PATCH] Update API docs for architectures --- website/docs/api/architectures.md | 51 +++++++++++++++---------------- 1 file changed, 25 insertions(+), 26 deletions(-) diff --git a/website/docs/api/architectures.md b/website/docs/api/architectures.md index 835815496..993f25bbc 100644 --- a/website/docs/api/architectures.md +++ b/website/docs/api/architectures.md @@ -110,13 +110,11 @@ Instead of defining its own `Tok2Vec` instance, a model architecture like [Tagger](/api/architectures#tagger) can define a listener as its `tok2vec` argument that connects to the shared `tok2vec` component in the pipeline. - - | Name | Description | | ----------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | `width` | The width of the vectors produced by the "upstream" [`Tok2Vec`](/api/tok2vec) component. ~~int~~ | | `upstream` | A string to identify the "upstream" `Tok2Vec` component to communicate with. The upstream name should either be the wildcard string `"*"`, or the name of the `Tok2Vec` component. You'll almost never have multiple upstream `Tok2Vec` components, so the wildcard string will almost always be fine. ~~str~~ | -| **CREATES** | The model using the architecture. ~~Model~~ | +| **CREATES** | The model using the architecture. ~~Model[List[Doc], List[Floats2d]]~~ | ### spacy.MultiHashEmbed.v1 {#MultiHashEmbed} @@ -139,15 +137,13 @@ definitions depending on the `Vocab` of the `Doc` object passed in. Vectors from pretrained static vectors can also be incorporated into the concatenated representation. - - | Name | Description | | ------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | `width` | The output width. Also used as the width of the embedding tables. Recommended values are between `64` and `300`. ~~int~~ | | `rows` | The number of rows for the embedding tables. Can be low, due to the hashing trick. Embeddings for prefix, suffix and word shape use half as many rows. Recommended values are between `2000` and `10000`. ~~int~~ | | `also_embed_subwords` | Whether to use the `PREFIX`, `SUFFIX` and `SHAPE` features in the embeddings. If not using these, you may need more rows in your hash embeddings, as there will be increased chance of collisions. ~~bool~~ | | `also_use_static_vectors` | Whether to also use static word vectors. Requires a vectors table to be loaded in the [Doc](/api/doc) objects' vocab. ~~bool~~ | -| **CREATES** | The model using the architecture. ~~Model~~ | +| **CREATES** | The model using the architecture. ~~Model[List[Doc], List[Floats2d]]~~ | ### spacy.CharacterEmbed.v1 {#CharacterEmbed} @@ -178,15 +174,13 @@ concatenated. A hash-embedded vector of the `NORM` of the word is also concatenated on, and the result is then passed through a feed-forward network to construct a single vector to represent the information. - - | Name | Description | | ----------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------- | | `width` | The width of the output vector and the `NORM` hash embedding. ~~int~~ | | `rows` | The number of rows in the `NORM` hash embedding table. ~~int~~ | | `nM` | The dimensionality of the character embeddings. Recommended values are between `16` and `64`. ~~int~~ | | `nC` | The number of UTF-8 bytes to embed per word. Recommended values are between `3` and `8`, although it may depend on the length of words in the language. ~~int~~ | -| **CREATES** | The model using the architecture. ~~Model~~ | +| **CREATES** | The model using the architecture. ~~Model[List[Doc], List[Floats2d]]~~ | ### spacy.MaxoutWindowEncoder.v1 {#MaxoutWindowEncoder} @@ -277,12 +271,10 @@ Embed [`Doc`](/api/doc) objects with their vocab's vectors table, applying a learned linear projection to control the dimensionality. See the documentation on [static vectors](/usage/embeddings-transformers#static-vectors) for details. - - | Name |  Description | | ----------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `nO` | Defaults to `None`. ~~Optional[int]~~ | -| `nM` | Defaults to `None`. ~~Optional[int]~~ | +| `nO` | The output width of the layer, after the linear projection. ~~Optional[int]~~ | +| `nM` | The width of the static vectors. ~~Optional[int]~~ | | `dropout` | Optional dropout rate. If set, it's applied per dimension over the whole batch. Defaults to `None`. ~~Optional[float]~~ | | `init_W` | The [initialization function](https://thinc.ai/docs/api-initializers). Defaults to [`glorot_uniform_init`](https://thinc.ai/docs/api-initializers#glorot_uniform_init). ~~Callable[[Ops, Tuple[int, ...]]], FloatsXd]~~ | | `key_attr` | Defaults to `"ORTH"`. ~~str~~ | @@ -311,7 +303,22 @@ architectures into your training config. > stride = 96 > ``` - +Load and wrap a transformer model from the Huggingface transformers library. +You can any transformer that has pretrained weights and a PyTorch +implementation. The `name` variable is passed through to the underlying +library, so it can be either a string or a path. If it's a string, the +pretrained weights will be downloaded via the transformers library if they are +not already available locally. + +In order to support longer documents, the `TransformerModel` layer allows you +to pass in a `get_spans` function that will divide up the `Doc` objects before +passing them through the transformer. Your spans are allowed to overlap or +exclude tokens. + +This layer outputs a `FullTransformerBatch` dataclass. In order to plug the +layer into most architectures, you'll probably need to map the raw transformer +output to token-aligned vectors using a layer such as `trfs2arrays`. + | Name | Description | | ------------------ | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | @@ -541,8 +548,6 @@ specific data and challenge. Stacked ensemble of a bag-of-words model and a neural network model. The neural network has an internal CNN Tok2Vec layer and uses attention. - - | Name | Description | | -------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | `exclusive_classes` | Whether or not categories are mutually exclusive. ~~bool~~ | @@ -554,7 +559,7 @@ network has an internal CNN Tok2Vec layer and uses attention. | `ngram_size` | Determines the maximum length of the n-grams in the BOW model. For instance, `ngram_size=3`would give unigram, trigram and bigram features. ~~int~~ | | `dropout` | The dropout rate. ~~float~~ | | `nO` | Output dimension, determined by the number of different labels. If not set, the [`TextCategorizer`](/api/textcategorizer) component will set it when `begin_training` is called. ~~Optional[int]~~ | -| **CREATES** | The model using the architecture. ~~Model~~ | +| **CREATES** | The model using the architecture. ~~Model[List[Doc], Floats2d]~~ | ### spacy.TextCatCNN.v1 {#TextCatCNN} @@ -581,14 +586,12 @@ A neural network model where token vectors are calculated using a CNN. The vectors are mean pooled and used as features in a feed-forward network. This architecture is usually less accurate than the ensemble, but runs faster. - - | Name | Description | | ------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | `exclusive_classes` | Whether or not categories are mutually exclusive. ~~bool~~ | | `tok2vec` | The [`tok2vec`](#tok2vec) layer of the model. ~~Model~~ | | `nO` | Output dimension, determined by the number of different labels. If not set, the [`TextCategorizer`](/api/textcategorizer) component will set it when `begin_training` is called. ~~Optional[int]~~ | -| **CREATES** | The model using the architecture. ~~Model~~ | +| **CREATES** | The model using the architecture. ~~Model[List[Doc], Floats2d]~~ | ### spacy.TextCatBOW.v1 {#TextCatBOW} @@ -606,15 +609,13 @@ architecture is usually less accurate than the ensemble, but runs faster. An ngram "bag-of-words" model. This architecture should run much faster than the others, but may not be as accurate, especially if texts are short. - - | Name | Description | | ------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | `exclusive_classes` | Whether or not categories are mutually exclusive. ~~bool~~ | | `ngram_size` | Determines the maximum length of the n-grams in the BOW model. For instance, `ngram_size=3`would give unigram, trigram and bigram features. ~~int~~ | | `no_output_layer` | Whether or not to add an output layer to the model (`Softmax` activation if `exclusive_classes` is `True`, else `Logistic`. ~~bool~~ | | `nO` | Output dimension, determined by the number of different labels. If not set, the [`TextCategorizer`](/api/textcategorizer) component will set it when `begin_training` is called. ~~Optional[int]~~ | -| **CREATES** | The model using the architecture. ~~Model~~ | +| **CREATES** | The model using the architecture. ~~Model[List[Doc], Floats2d]~~ | ## Entity linking architectures {#entitylinker source="spacy/ml/models/entity_linker.py"} @@ -659,13 +660,11 @@ into the "real world". This requires 3 main components: The `EntityLinker` model architecture is a Thinc `Model` with a [`Linear`](https://thinc.ai/api-layers#linear) output layer. - - | Name | Description | | ----------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | `tok2vec` | The [`tok2vec`](#tok2vec) layer of the model. ~~Model~~ | | `nO` | Output dimension, determined by the length of the vectors encoding each entity in the KB. If the `nO` dimension is not set, the entity linking component will set it when `begin_training` is called. ~~Optional[int]~~ | -| **CREATES** | The model using the architecture. ~~Model~~ | +| **CREATES** | The model using the architecture. ~~Model[List[Doc], Floats2d]~~ | ### spacy.EmptyKB.v1 {#EmptyKB}