diff --git a/website/docs/api/architectures.md b/website/docs/api/architectures.md index 8d28a78c3..0050b53a5 100644 --- a/website/docs/api/architectures.md +++ b/website/docs/api/architectures.md @@ -417,20 +417,18 @@ network has an internal CNN Tok2Vec layer and uses attention. > nO = null > ``` -| Name | Type | Description | -| -------------------- | ----- | ------------------------------------------------------------------------------------------------------------------------------------------- | -| `exclusive_classes` | bool | Whether or not categories are mutually exclusive. | -| `pretrained_vectors` | bool | Whether or not pretrained vectors will be used in addition to the feature vectors. | -| `width` | int | Output dimension of the feature encoding step. | -| `embed_size` | int | Input dimension of the feature encoding step. | -| `conv_depth` | int | Depth of the Tok2Vec layer. | -| `window_size` | int | The number of contextual vectors to [concatenate](https://thinc.ai/docs/api-layers#expand_window) from the left and from the right. | -| `ngram_size` | int | Determines the maximum length of the n-grams in the BOW model. For instance, `ngram_size=3`would give unigram, trigram and bigram features. | -| `dropout` | float | The dropout rate. | -| `nO` | int | Output dimension, determined by the number of different labels. | - -If the `nO` dimension is not set, the TextCategorizer component will set it when -`begin_training` is called. +| Name | Type | Description | +| --------------------------- | ----- | -------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `exclusive_classes` | bool | Whether or not categories are mutually exclusive. | +| `pretrained_vectors` | bool | Whether or not pretrained vectors will be used in addition to the feature vectors. | +| `width` | int | Output dimension of the feature encoding step. | +| `embed_size` | int | Input dimension of the feature encoding step. | +| `conv_depth` | int | Depth of the Tok2Vec layer. | +| `window_size` | int | The number of contextual vectors to [concatenate](https://thinc.ai/docs/api-layers#expand_window) from the left and from the right. | +| `ngram_size` | int | Determines the maximum length of the n-grams in the BOW model. For instance, `ngram_size=3`would give unigram, trigram and bigram features. | +| `dropout` | float | The dropout rate. | +| `nO` | int | Output dimension, determined by the number of different labels. If not set, the the [`TextCategorizer`](/api/textcategorizer) component will set it when | +| `begin_training` is called. | ### spacy.TextCatCNN.v1 {#TextCatCNN} @@ -457,14 +455,12 @@ A neural network model where token vectors are calculated using a CNN. The vectors are mean pooled and used as features in a feed-forward network. This architecture is usually less accurate than the ensemble, but runs faster. -| Name | Type | Description | -| ------------------- | ------------------------------------------ | --------------------------------------------------------------- | -| `exclusive_classes` | bool | Whether or not categories are mutually exclusive. | -| `tok2vec` | [`Model`](https://thinc.ai/docs/api-model) | The [`tok2vec`](#tok2vec) layer of the model. | -| `nO` | int | Output dimension, determined by the number of different labels. | - -If the `nO` dimension is not set, the TextCategorizer component will set it when -`begin_training` is called. +| Name | Type | Description | +| --------------------------- | ------------------------------------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `exclusive_classes` | bool | Whether or not categories are mutually exclusive. | +| `tok2vec` | [`Model`](https://thinc.ai/docs/api-model) | The [`tok2vec`](#tok2vec) layer of the model. | +| `nO` | int | Output dimension, determined by the number of different labels. If not set, the the [`TextCategorizer`](/api/textcategorizer) component will set it when | +| `begin_training` is called. | ### spacy.TextCatBOW.v1 {#TextCatBOW} @@ -482,17 +478,17 @@ If the `nO` dimension is not set, the TextCategorizer component will set it when An ngram "bag-of-words" model. This architecture should run much faster than the others, but may not be as accurate, especially if texts are short. -| Name | Type | Description | -| ------------------- | ----- | ------------------------------------------------------------------------------------------------------------------------------------------- | -| `exclusive_classes` | bool | Whether or not categories are mutually exclusive. | -| `ngram_size` | int | Determines the maximum length of the n-grams in the BOW model. For instance, `ngram_size=3`would give unigram, trigram and bigram features. | -| `no_output_layer` | float | Whether or not to add an output layer to the model (`Softmax` activation if `exclusive_classes=True`, else `Logistic`. | -| `nO` | int | Output dimension, determined by the number of different labels. | - -If the `nO` dimension is not set, the TextCategorizer component will set it when -`begin_training` is called. +| Name | Type | Description | +| --------------------------- | ----- | -------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `exclusive_classes` | bool | Whether or not categories are mutually exclusive. | +| `ngram_size` | int | Determines the maximum length of the n-grams in the BOW model. For instance, `ngram_size=3`would give unigram, trigram and bigram features. | +| `no_output_layer` | float | Whether or not to add an output layer to the model (`Softmax` activation if `exclusive_classes=True`, else `Logistic`. | +| `nO` | int | Output dimension, determined by the number of different labels. If not set, the the [`TextCategorizer`](/api/textcategorizer) component will set it when | +| `begin_training` is called. | + ## Entity linking architectures {#entitylinker source="spacy/ml/models/entity_linker.py"} diff --git a/website/docs/api/top-level.md b/website/docs/api/top-level.md index b63a4adba..0b3167901 100644 --- a/website/docs/api/top-level.md +++ b/website/docs/api/top-level.md @@ -340,7 +340,7 @@ See the [`Transformer`](/api/transformer) API reference and ## Batchers {#batchers source="spacy/gold/batchers.py" new="3"} - + #### batch_by_words.v1 {#batch_by_words tag="registered function"} @@ -361,19 +361,16 @@ themselves, or be discarded if `discard_oversize` is set to `True`. The argument > get_length = null > ``` - - -| Name | Type | Description | -| ------------------ | ---------------------- | ----------------------------------------------------------------------------------------------------------------------------------- | -| `size` | `Iterable[int]` / int | The batch size. Can also be a block referencing a schedule, e.g. [`compounding`](https://thinc.ai/docs/api-schedules/#compounding). | -| `tolerance` | float | | -| `discard_oversize` | bool | Discard items that are longer than the specified batch length. | -| `get_length` | `Callable[[Any], int]` | Optional function that receives a sequence and returns its length. Defaults to the built-in `len()` if not set. | +| Name | Type | Description | +| ------------------ | ---------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `seqs` | `Iterable[Any]` | The sequences to minibatch. | +| `size` | `Iterable[int]` / int | The target number of words per batch. Can also be a block referencing a schedule, e.g. [`compounding`](https://thinc.ai/docs/api-schedules/#compounding). | +| `tolerance` | float | What percentage of the size to allow batches to exceed. | +| `discard_oversize` | bool | Whether to discard sequences that by themselves exceed the tolerated size. | +| `get_length` | `Callable[[Any], int]` | Optional function that receives a sequence item and returns its length. Defaults to the built-in `len()` if not set. | #### batch_by_sequence.v1 {#batch_by_sequence tag="registered function"} - - > #### Example config > > ```ini @@ -383,34 +380,37 @@ themselves, or be discarded if `discard_oversize` is set to `True`. The argument > get_length = null > ``` - +Create a batcher that creates batches of the specified size. -| Name | Type | Description | -| ------------ | ---------------------- | ----------------------------------------------------------------------------------------------------------------------------------- | -| `size` | `Iterable[int]` / int | The batch size. Can also be a block referencing a schedule, e.g. [`compounding`](https://thinc.ai/docs/api-schedules/#compounding). | -| `get_length` | `Callable[[Any], int]` | Optional function that receives a sequence and returns its length. Defaults to the built-in `len()` if not set. | +| Name | Type | Description | +| ------------ | ---------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `size` | `Iterable[int]` / int | The target number of items per batch. Can also be a block referencing a schedule, e.g. [`compounding`](https://thinc.ai/docs/api-schedules/#compounding). | +| `get_length` | `Callable[[Any], int]` | Optional function that receives a sequence item and returns its length. Defaults to the built-in `len()` if not set. | #### batch_by_padded.v1 {#batch_by_padded tag="registered function"} - - > #### Example config > > ```ini > [training.batcher] -> @batchers = "batch_by_words.v1" +> @batchers = "batch_by_padded.v1" > size = 100 -> buffer = TODO: +> buffer = 256 > discard_oversize = false > get_length = null > ``` -| Name | Type | Description | -| ------------------ | ---------------------- | ----------------------------------------------------------------------------------------------------------------------------------- | -| `size` | `Iterable[int]` / int | The batch size. Can also be a block referencing a schedule, e.g. [`compounding`](https://thinc.ai/docs/api-schedules/#compounding). | -| `buffer` | int | | -| `discard_oversize` | bool | Discard items that are longer than the specified batch length. | -| `get_length` | `Callable[[Any], int]` | Optional function that receives a sequence and returns its length. Defaults to the built-in `len()` if not set. | +Minibatch a sequence by the size of padded batches that would result, with +sequences binned by length within a window. The padded size is defined as the +maximum length of sequences within the batch multiplied by the number of +sequences in the batch. + +| Name | Type | Description | +| ------------------ | ---------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `size` | `Iterable[int]` / int | The largest padded size to batch sequences into. Can also be a block referencing a schedule, e.g. [`compounding`](https://thinc.ai/docs/api-schedules/#compounding). | +| `buffer` | int | The number of sequences to accumulate before sorting by length. A larger buffer will result in more even sizing, but if the buffer is very large, the iteration order will be less random, which can result in suboptimal training. | +| `discard_oversize` | bool | Whether to discard sequences that are by themselves longer than the largest padded batch size. | +| `get_length` | `Callable[[Any], int]` | Optional function that receives a sequence item and returns its length. Defaults to the built-in `len()` if not set. | ## Training data and alignment {#gold source="spacy/gold"}