diff --git a/website/docs/api/architectures.md b/website/docs/api/architectures.md index 0d13e4749..6d05f0b86 100644 --- a/website/docs/api/architectures.md +++ b/website/docs/api/architectures.md @@ -285,8 +285,8 @@ Encode context using bidirectional LSTM layers. Requires Embed [`Doc`](/api/doc) objects with their vocab's vectors table, applying a learned linear projection to control the dimensionality. Unknown tokens are -mapped to a zero vector. See the documentation on [static -vectors](/usage/embeddings-transformers#static-vectors) for details. +mapped to a zero vector. See the documentation on +[static vectors](/usage/embeddings-transformers#static-vectors) for details. | Name |  Description | | ----------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | @@ -449,7 +449,7 @@ For more information, see the section on > ```ini > [pretraining] > component = "tok2vec" -> +> > [initialize] > vectors = "en_core_web_lg" > ... @@ -462,8 +462,8 @@ For more information, see the section on > ``` Predict the word's vector from a static embeddings table as pretraining -objective for a Tok2Vec layer. To use this objective, make sure that the -`initialize.vectors` section in the config refers to a model with static +objective for a Tok2Vec layer. To use this objective, make sure that the +`initialize.vectors` section in the config refers to a model with static vectors. | Name | Description | @@ -649,8 +649,8 @@ from the linear model, where it is stored in `model.attrs["multi_label"]`. -[TextCatEnsemble.v1](/api/legacy#TextCatEnsemble_v1) was functionally similar, but used an internal `tok2vec` instead of -taking it as argument: +[TextCatEnsemble.v1](/api/legacy#TextCatEnsemble_v1) was functionally similar, +but used an internal `tok2vec` instead of taking it as argument: | Name | Description | | -------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | @@ -701,8 +701,9 @@ architecture is usually less accurate than the ensemble, but runs faster. -[TextCatCNN.v1](/api/legacy#TextCatCNN_v1) had the exact same signature, but was not yet resizable. -Since v2, new labels can be added to this component, even after training. +[TextCatCNN.v1](/api/legacy#TextCatCNN_v1) had the exact same signature, but was +not yet resizable. Since v2, new labels can be added to this component, even +after training. @@ -732,8 +733,9 @@ the others, but may not be as accurate, especially if texts are short. -[TextCatBOW.v1](/api/legacy#TextCatBOW_v1) had the exact same signature, but was not yet resizable. -Since v2, new labels can be added to this component, even after training. +[TextCatBOW.v1](/api/legacy#TextCatBOW_v1) had the exact same signature, but was +not yet resizable. Since v2, new labels can be added to this component, even +after training. @@ -747,7 +749,7 @@ Since v2, new labels can be added to this component, even after training. > [model] > @architectures = "spacy.SpanCategorizer.v1" > scorer = {"@layers": "spacy.LinearLogistic.v1"} -> +> > [model.reducer] > @layers = spacy.mean_max_reducer.v1" > hidden_size = 128 diff --git a/website/docs/api/dependencyparser.md b/website/docs/api/dependencyparser.md index f2d67b95c..fa02a6f99 100644 --- a/website/docs/api/dependencyparser.md +++ b/website/docs/api/dependencyparser.md @@ -231,14 +231,14 @@ model. Delegates to [`predict`](/api/dependencyparser#predict) and > losses = parser.update(examples, sgd=optimizer) > ``` -| Name | Description | -| ----------------- | ---------------------------------------------------------------------------------------------------------------------------------- | -| `examples` | A batch of [`Example`](/api/example) objects to learn from. ~~Iterable[Example]~~ | -| _keyword-only_ | | -| `drop` | The dropout rate. ~~float~~ | -| `sgd` | An optimizer. Will be created via [`create_optimizer`](#create_optimizer) if not set. ~~Optional[Optimizer]~~ | -| `losses` | Optional record of the loss during training. Updated using the component name as the key. ~~Optional[Dict[str, float]]~~ | -| **RETURNS** | The updated `losses` dictionary. ~~Dict[str, float]~~ | +| Name | Description | +| -------------- | ------------------------------------------------------------------------------------------------------------------------ | +| `examples` | A batch of [`Example`](/api/example) objects to learn from. ~~Iterable[Example]~~ | +| _keyword-only_ | | +| `drop` | The dropout rate. ~~float~~ | +| `sgd` | An optimizer. Will be created via [`create_optimizer`](#create_optimizer) if not set. ~~Optional[Optimizer]~~ | +| `losses` | Optional record of the loss during training. Updated using the component name as the key. ~~Optional[Dict[str, float]]~~ | +| **RETURNS** | The updated `losses` dictionary. ~~Dict[str, float]~~ | ## DependencyParser.get_loss {#get_loss tag="method"} diff --git a/website/docs/api/entityruler.md b/website/docs/api/entityruler.md index 76a4b3604..66cb6d4e4 100644 --- a/website/docs/api/entityruler.md +++ b/website/docs/api/entityruler.md @@ -35,11 +35,11 @@ how the component should be configured. You can override its settings via the > ``` | Setting | Description | -| --------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| --------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --- | ----------- | | `phrase_matcher_attr` | Optional attribute name match on for the internal [`PhraseMatcher`](/api/phrasematcher), e.g. `LOWER` to match on the lowercase token text. Defaults to `None`. ~~Optional[Union[int, str]]~~ | | `validate` | Whether patterns should be validated (passed to the `Matcher` and `PhraseMatcher`). Defaults to `False`. ~~bool~~ | | `overwrite_ents` | If existing entities are present, e.g. entities added by the model, overwrite them by matches if necessary. Defaults to `False`. ~~bool~~ | -| `ent_id_sep` | Separator used internally for entity IDs. Defaults to `"||"`. ~~str~~ | +| `ent_id_sep` | Separator used internally for entity IDs. Defaults to `" | | "`. ~~str~~ | ```python %%GITHUB_SPACY/spacy/pipeline/entityruler.py @@ -64,14 +64,14 @@ be a token pattern (list) or a phrase pattern (string). For example: > ``` | Name | Description | -| --------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| --------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --- | ----------- | | `nlp` | The shared nlp object to pass the vocab to the matchers and process phrase patterns. ~~Language~~ | | `name` 3 | Instance name of the current pipeline component. Typically passed in automatically from the factory when the component is added. Used to disable the current entity ruler while creating phrase patterns with the nlp object. ~~str~~ | | _keyword-only_ | | | `phrase_matcher_attr` | Optional attribute name match on for the internal [`PhraseMatcher`](/api/phrasematcher), e.g. `LOWER` to match on the lowercase token text. Defaults to `None`. ~~Optional[Union[int, str]]~~ | | `validate` | Whether patterns should be validated, passed to Matcher and PhraseMatcher as `validate`. Defaults to `False`. ~~bool~~ | | `overwrite_ents` | If existing entities are present, e.g. entities added by the model, overwrite them by matches if necessary. Defaults to `False`. ~~bool~~ | -| `ent_id_sep` | Separator used internally for entity IDs. Defaults to `"||"`. ~~str~~ | +| `ent_id_sep` | Separator used internally for entity IDs. Defaults to `" | | "`. ~~str~~ | | `patterns` | Optional patterns to load in on initialization. ~~Optional[List[Dict[str, Union[str, List[dict]]]]]~~ | ## EntityRuler.initialize {#initialize tag="method" new="3"} diff --git a/website/docs/api/kb.md b/website/docs/api/kb.md index 3cbc5dbd8..e7a8fcd6f 100644 --- a/website/docs/api/kb.md +++ b/website/docs/api/kb.md @@ -245,8 +245,8 @@ certain prior probability. ### Candidate.\_\_init\_\_ {#candidate-init tag="method"} Construct a `Candidate` object. Usually this constructor is not called directly, -but instead these objects are returned by the -`get_candidates` method of the [`entity_linker`](/api/entitylinker) pipe. +but instead these objects are returned by the `get_candidates` method of the +[`entity_linker`](/api/entitylinker) pipe. > #### Example > diff --git a/website/docs/api/legacy.md b/website/docs/api/legacy.md index 563d5aea8..ff8a0f3c1 100644 --- a/website/docs/api/legacy.md +++ b/website/docs/api/legacy.md @@ -178,8 +178,9 @@ added to an existing vectors table. See more details in ### spacy.TextCatCNN.v1 {#TextCatCNN_v1} -Since `spacy.TextCatCNN.v2`, this architecture has become resizable, which means that you can add -labels to a previously trained textcat. `TextCatCNN` v1 did not yet support that. +Since `spacy.TextCatCNN.v2`, this architecture has become resizable, which means +that you can add labels to a previously trained textcat. `TextCatCNN` v1 did not +yet support that. > #### Example Config > @@ -213,8 +214,9 @@ architecture is usually less accurate than the ensemble, but runs faster. ### spacy.TextCatBOW.v1 {#TextCatBOW_v1} -Since `spacy.TextCatBOW.v2`, this architecture has become resizable, which means that you can add -labels to a previously trained textcat. `TextCatBOW` v1 did not yet support that. +Since `spacy.TextCatBOW.v2`, this architecture has become resizable, which means +that you can add labels to a previously trained textcat. `TextCatBOW` v1 did not +yet support that. > #### Example Config > diff --git a/website/docs/api/matcher.md b/website/docs/api/matcher.md index c0e169799..9c15f8797 100644 --- a/website/docs/api/matcher.md +++ b/website/docs/api/matcher.md @@ -120,14 +120,14 @@ Find all token sequences matching the supplied patterns on the `Doc` or `Span`. > matches = matcher(doc) > ``` -| Name | Description | -| ---------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `doclike` | The `Doc` or `Span` to match over. ~~Union[Doc, Span]~~ | -| _keyword-only_ | | -| `as_spans` 3 | Instead of tuples, return a list of [`Span`](/api/span) objects of the matches, with the `match_id` assigned as the span label. Defaults to `False`. ~~bool~~ | -| `allow_missing` 3 | Whether to skip checks for missing annotation for attributes included in patterns. Defaults to `False`. ~~bool~~ | -| `with_alignments` 3.0.6 | Return match alignment information as part of the match tuple as `List[int]` with the same length as the matched span. Each entry denotes the corresponding index of the token pattern. If `as_spans` is set to `True`, this setting is ignored. Defaults to `False`. ~~bool~~ | -| **RETURNS** | A list of `(match_id, start, end)` tuples, describing the matches. A match tuple describes a span `doc[start:end`]. The `match_id` is the ID of the added match pattern. If `as_spans` is set to `True`, a list of `Span` objects is returned instead. ~~Union[List[Tuple[int, int, int]], List[Span]]~~ | +| Name | Description | +| ------------------------------------------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `doclike` | The `Doc` or `Span` to match over. ~~Union[Doc, Span]~~ | +| _keyword-only_ | | +| `as_spans` 3 | Instead of tuples, return a list of [`Span`](/api/span) objects of the matches, with the `match_id` assigned as the span label. Defaults to `False`. ~~bool~~ | +| `allow_missing` 3 | Whether to skip checks for missing annotation for attributes included in patterns. Defaults to `False`. ~~bool~~ | +| `with_alignments` 3.0.6 | Return match alignment information as part of the match tuple as `List[int]` with the same length as the matched span. Each entry denotes the corresponding index of the token pattern. If `as_spans` is set to `True`, this setting is ignored. Defaults to `False`. ~~bool~~ | +| **RETURNS** | A list of `(match_id, start, end)` tuples, describing the matches. A match tuple describes a span `doc[start:end`]. The `match_id` is the ID of the added match pattern. If `as_spans` is set to `True`, a list of `Span` objects is returned instead. ~~Union[List[Tuple[int, int, int]], List[Span]]~~ | ## Matcher.\_\_len\_\_ {#len tag="method" new="2"} diff --git a/website/docs/api/morphologizer.md b/website/docs/api/morphologizer.md index 059040a19..d2dd28ac2 100644 --- a/website/docs/api/morphologizer.md +++ b/website/docs/api/morphologizer.md @@ -61,11 +61,11 @@ shortcut for this and instantiate the component using its string name and > morphologizer = Morphologizer(nlp.vocab, model) > ``` -| Name | Description | -| -------------- | -------------------------------------------------------------------------------------------------------------------- | -| `vocab` | The shared vocabulary. ~~Vocab~~ | -| `model` | The [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component. ~~Model[List[Doc], List[Floats2d]]~~ | -| `name` | String name of the component instance. Used to add entries to the `losses` during training. ~~str~~ | +| Name | Description | +| ------- | -------------------------------------------------------------------------------------------------------------------- | +| `vocab` | The shared vocabulary. ~~Vocab~~ | +| `model` | The [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component. ~~Model[List[Doc], List[Floats2d]]~~ | +| `name` | String name of the component instance. Used to add entries to the `losses` during training. ~~str~~ | ## Morphologizer.\_\_call\_\_ {#call tag="method"} @@ -200,14 +200,14 @@ Delegates to [`predict`](/api/morphologizer#predict) and > losses = morphologizer.update(examples, sgd=optimizer) > ``` -| Name | Description | -| ----------------- | ---------------------------------------------------------------------------------------------------------------------------------- | -| `examples` | A batch of [`Example`](/api/example) objects to learn from. ~~Iterable[Example]~~ | -| _keyword-only_ | | -| `drop` | The dropout rate. ~~float~~ | -| `sgd` | An optimizer. Will be created via [`create_optimizer`](#create_optimizer) if not set. ~~Optional[Optimizer]~~ | -| `losses` | Optional record of the loss during training. Updated using the component name as the key. ~~Optional[Dict[str, float]]~~ | -| **RETURNS** | The updated `losses` dictionary. ~~Dict[str, float]~~ | +| Name | Description | +| -------------- | ------------------------------------------------------------------------------------------------------------------------ | +| `examples` | A batch of [`Example`](/api/example) objects to learn from. ~~Iterable[Example]~~ | +| _keyword-only_ | | +| `drop` | The dropout rate. ~~float~~ | +| `sgd` | An optimizer. Will be created via [`create_optimizer`](#create_optimizer) if not set. ~~Optional[Optimizer]~~ | +| `losses` | Optional record of the loss during training. Updated using the component name as the key. ~~Optional[Dict[str, float]]~~ | +| **RETURNS** | The updated `losses` dictionary. ~~Dict[str, float]~~ | ## Morphologizer.get_loss {#get_loss tag="method"} diff --git a/website/docs/api/morphology.md b/website/docs/api/morphology.md index e64f26bdd..565e520b5 100644 --- a/website/docs/api/morphology.md +++ b/website/docs/api/morphology.md @@ -98,18 +98,18 @@ representation. > assert f == "Feat1=Val1|Feat2=Val2" > ``` -| Name | Description | -| ------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `feats_dict` | The morphological features as a dictionary. ~~Dict[str, str]~~ | +| Name | Description | +| ------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------- | +| `feats_dict` | The morphological features as a dictionary. ~~Dict[str, str]~~ | | **RETURNS** | The morphological features in Universal Dependencies [FEATS](https://universaldependencies.org/format.html#morphological-annotation) format. ~~str~~ | ## Attributes {#attributes} -| Name | Description | -| ------------- | ------------------------------------------------------------------------------------------------------------------------------ | -| `FEATURE_SEP` | The [FEATS](https://universaldependencies.org/format.html#morphological-annotation) feature separator. Default is `|`. ~~str~~ | -| `FIELD_SEP` | The [FEATS](https://universaldependencies.org/format.html#morphological-annotation) field separator. Default is `=`. ~~str~~ | -| `VALUE_SEP` | The [FEATS](https://universaldependencies.org/format.html#morphological-annotation) value separator. Default is `,`. ~~str~~ | +| Name | Description | +| ------------- | ---------------------------------------------------------------------------------------------------------------------------- | ---------- | +| `FEATURE_SEP` | The [FEATS](https://universaldependencies.org/format.html#morphological-annotation) feature separator. Default is ` | `. ~~str~~ | +| `FIELD_SEP` | The [FEATS](https://universaldependencies.org/format.html#morphological-annotation) field separator. Default is `=`. ~~str~~ | +| `VALUE_SEP` | The [FEATS](https://universaldependencies.org/format.html#morphological-annotation) value separator. Default is `,`. ~~str~~ | ## MorphAnalysis {#morphanalysis tag="class" source="spacy/tokens/morphanalysis.pyx"} diff --git a/website/docs/api/phrasematcher.md b/website/docs/api/phrasematcher.md index 540476949..4a5fb6042 100644 --- a/website/docs/api/phrasematcher.md +++ b/website/docs/api/phrasematcher.md @@ -59,7 +59,7 @@ Find all token sequences matching the supplied patterns on the `Doc` or `Span`. | Name | Description | | ------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `doclike` | The `Doc` or `Span` to match over. ~~Union[Doc, Span]~~ | +| `doclike` | The `Doc` or `Span` to match over. ~~Union[Doc, Span]~~ | | _keyword-only_ | | | `as_spans` 3 | Instead of tuples, return a list of [`Span`](/api/span) objects of the matches, with the `match_id` assigned as the span label. Defaults to `False`. ~~bool~~ | | **RETURNS** | A list of `(match_id, start, end)` tuples, describing the matches. A match tuple describes a span `doc[start:end`]. The `match_id` is the ID of the added match pattern. If `as_spans` is set to `True`, a list of `Span` objects is returned instead. ~~Union[List[Tuple[int, int, int]], List[Span]]~~ | @@ -149,8 +149,8 @@ patterns = [nlp("health care reform"), nlp("healthcare reform")] | Name | Description | -| -------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `match_id` | An ID for the thing you're matching. ~~str~~ | | +| -------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------- | --- | +| `match_id` | An ID for the thing you're matching. ~~str~~ | | | `docs` | `Doc` objects of the phrases to match. ~~List[Doc]~~ | | _keyword-only_ | | | `on_match` | Callback function to act on matches. Takes the arguments `matcher`, `doc`, `i` and `matches`. ~~Optional[Callable[[Matcher, Doc, int, List[tuple], Any]]~~ | diff --git a/website/docs/api/sentencerecognizer.md b/website/docs/api/sentencerecognizer.md index ce66ecaa4..e82a4bef6 100644 --- a/website/docs/api/sentencerecognizer.md +++ b/website/docs/api/sentencerecognizer.md @@ -187,14 +187,14 @@ Delegates to [`predict`](/api/sentencerecognizer#predict) and > losses = senter.update(examples, sgd=optimizer) > ``` -| Name | Description | -| ----------------- | ---------------------------------------------------------------------------------------------------------------------------------- | -| `examples` | A batch of [`Example`](/api/example) objects to learn from. ~~Iterable[Example]~~ | -| _keyword-only_ | | -| `drop` | The dropout rate. ~~float~~ | -| `sgd` | An optimizer. Will be created via [`create_optimizer`](#create_optimizer) if not set. ~~Optional[Optimizer]~~ | -| `losses` | Optional record of the loss during training. Updated using the component name as the key. ~~Optional[Dict[str, float]]~~ | -| **RETURNS** | The updated `losses` dictionary. ~~Dict[str, float]~~ | +| Name | Description | +| -------------- | ------------------------------------------------------------------------------------------------------------------------ | +| `examples` | A batch of [`Example`](/api/example) objects to learn from. ~~Iterable[Example]~~ | +| _keyword-only_ | | +| `drop` | The dropout rate. ~~float~~ | +| `sgd` | An optimizer. Will be created via [`create_optimizer`](#create_optimizer) if not set. ~~Optional[Optimizer]~~ | +| `losses` | Optional record of the loss during training. Updated using the component name as the key. ~~Optional[Dict[str, float]]~~ | +| **RETURNS** | The updated `losses` dictionary. ~~Dict[str, float]~~ | ## SentenceRecognizer.rehearse {#rehearse tag="method,experimental" new="3"} diff --git a/website/docs/api/sentencizer.md b/website/docs/api/sentencizer.md index a377fcf65..75a253fc0 100644 --- a/website/docs/api/sentencizer.md +++ b/website/docs/api/sentencizer.md @@ -28,7 +28,7 @@ how the component should be configured. You can override its settings via the > ``` | Setting | Description | -| ------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------ | +| ------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------ | ------ | | `punct_chars` | Optional custom list of punctuation characters that mark sentence ends. See below for defaults if not set. Defaults to `None`. ~~Optional[List[str]]~~ | `None` | ```python diff --git a/website/docs/api/span.md b/website/docs/api/span.md index 333344b31..9212f957d 100644 --- a/website/docs/api/span.md +++ b/website/docs/api/span.md @@ -491,8 +491,8 @@ document by the `parser`, `senter`, `sentencizer` or some custom function. It will raise an error otherwise. If the span happens to cross sentence boundaries, only the first sentence will -be returned. If it is required that the sentence always includes the -full span, the result can be adjusted as such: +be returned. If it is required that the sentence always includes the full span, +the result can be adjusted as such: ```python sent = span.sent diff --git a/website/docs/api/spancategorizer.md b/website/docs/api/spancategorizer.md index ccab6535e..bc15ea472 100644 --- a/website/docs/api/spancategorizer.md +++ b/website/docs/api/spancategorizer.md @@ -213,14 +213,14 @@ Delegates to [`predict`](/api/spancategorizer#predict) and > losses = spancat.update(examples, sgd=optimizer) > ``` -| Name | Description | -| ----------------- | ---------------------------------------------------------------------------------------------------------------------------------- | -| `examples` | A batch of [`Example`](/api/example) objects to learn from. ~~Iterable[Example]~~ | -| _keyword-only_ | | -| `drop` | The dropout rate. ~~float~~ | -| `sgd` | An optimizer. Will be created via [`create_optimizer`](#create_optimizer) if not set. ~~Optional[Optimizer]~~ | -| `losses` | Optional record of the loss during training. Updated using the component name as the key. ~~Optional[Dict[str, float]]~~ | -| **RETURNS** | The updated `losses` dictionary. ~~Dict[str, float]~~ | +| Name | Description | +| -------------- | ------------------------------------------------------------------------------------------------------------------------ | +| `examples` | A batch of [`Example`](/api/example) objects to learn from. ~~Iterable[Example]~~ | +| _keyword-only_ | | +| `drop` | The dropout rate. ~~float~~ | +| `sgd` | An optimizer. Will be created via [`create_optimizer`](#create_optimizer) if not set. ~~Optional[Optimizer]~~ | +| `losses` | Optional record of the loss during training. Updated using the component name as the key. ~~Optional[Dict[str, float]]~~ | +| **RETURNS** | The updated `losses` dictionary. ~~Dict[str, float]~~ | ## SpanCategorizer.get_loss {#get_loss tag="method"} diff --git a/website/docs/api/tagger.md b/website/docs/api/tagger.md index 1a4c70522..3002aff7b 100644 --- a/website/docs/api/tagger.md +++ b/website/docs/api/tagger.md @@ -25,9 +25,9 @@ architectures and their arguments and hyperparameters. > nlp.add_pipe("tagger", config=config) > ``` -| Setting | Description | -| ---------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | -| `model` | A model instance that predicts the tag probabilities. The output vectors should match the number of tags in size, and be normalized as probabilities (all scores between 0 and 1, with the rows summing to `1`). Defaults to [Tagger](/api/architectures#Tagger). ~~Model[List[Doc], List[Floats2d]]~~ | +| Setting | Description | +| ------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | +| `model` | A model instance that predicts the tag probabilities. The output vectors should match the number of tags in size, and be normalized as probabilities (all scores between 0 and 1, with the rows summing to `1`). Defaults to [Tagger](/api/architectures#Tagger). ~~Model[List[Doc], List[Floats2d]]~~ | ```python %%GITHUB_SPACY/spacy/pipeline/tagger.pyx @@ -54,11 +54,11 @@ Create a new pipeline instance. In your application, you would normally use a shortcut for this and instantiate the component using its string name and [`nlp.add_pipe`](/api/language#add_pipe). -| Name | Description | -| ---------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `vocab` | The shared vocabulary. ~~Vocab~~ | -| `model` | A model instance that predicts the tag probabilities. The output vectors should match the number of tags in size, and be normalized as probabilities (all scores between 0 and 1, with the rows summing to `1`). ~~Model[List[Doc], List[Floats2d]]~~ | -| `name` | String name of the component instance. Used to add entries to the `losses` during training. ~~str~~ | +| Name | Description | +| ------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `vocab` | The shared vocabulary. ~~Vocab~~ | +| `model` | A model instance that predicts the tag probabilities. The output vectors should match the number of tags in size, and be normalized as probabilities (all scores between 0 and 1, with the rows summing to `1`). ~~Model[List[Doc], List[Floats2d]]~~ | +| `name` | String name of the component instance. Used to add entries to the `losses` during training. ~~str~~ | ## Tagger.\_\_call\_\_ {#call tag="method"} @@ -198,14 +198,14 @@ Delegates to [`predict`](/api/tagger#predict) and > losses = tagger.update(examples, sgd=optimizer) > ``` -| Name | Description | -| ----------------- | ---------------------------------------------------------------------------------------------------------------------------------- | -| `examples` | A batch of [`Example`](/api/example) objects to learn from. ~~Iterable[Example]~~ | -| _keyword-only_ | | -| `drop` | The dropout rate. ~~float~~ | -| `sgd` | An optimizer. Will be created via [`create_optimizer`](#create_optimizer) if not set. ~~Optional[Optimizer]~~ | -| `losses` | Optional record of the loss during training. Updated using the component name as the key. ~~Optional[Dict[str, float]]~~ | -| **RETURNS** | The updated `losses` dictionary. ~~Dict[str, float]~~ | +| Name | Description | +| -------------- | ------------------------------------------------------------------------------------------------------------------------ | +| `examples` | A batch of [`Example`](/api/example) objects to learn from. ~~Iterable[Example]~~ | +| _keyword-only_ | | +| `drop` | The dropout rate. ~~float~~ | +| `sgd` | An optimizer. Will be created via [`create_optimizer`](#create_optimizer) if not set. ~~Optional[Optimizer]~~ | +| `losses` | Optional record of the loss during training. Updated using the component name as the key. ~~Optional[Dict[str, float]]~~ | +| **RETURNS** | The updated `losses` dictionary. ~~Dict[str, float]~~ | ## Tagger.rehearse {#rehearse tag="method,experimental" new="3"} diff --git a/website/docs/api/tok2vec.md b/website/docs/api/tok2vec.md index 90278e8cc..70c352b4d 100644 --- a/website/docs/api/tok2vec.md +++ b/website/docs/api/tok2vec.md @@ -196,14 +196,14 @@ Delegates to [`predict`](/api/tok2vec#predict). > losses = tok2vec.update(examples, sgd=optimizer) > ``` -| Name | Description | -| ----------------- | ---------------------------------------------------------------------------------------------------------------------------------- | -| `examples` | A batch of [`Example`](/api/example) objects to learn from. ~~Iterable[Example]~~ | -| _keyword-only_ | | -| `drop` | The dropout rate. ~~float~~ | -| `sgd` | An optimizer. Will be created via [`create_optimizer`](#create_optimizer) if not set. ~~Optional[Optimizer]~~ | -| `losses` | Optional record of the loss during training. Updated using the component name as the key. ~~Optional[Dict[str, float]]~~ | -| **RETURNS** | The updated `losses` dictionary. ~~Dict[str, float]~~ | +| Name | Description | +| -------------- | ------------------------------------------------------------------------------------------------------------------------ | +| `examples` | A batch of [`Example`](/api/example) objects to learn from. ~~Iterable[Example]~~ | +| _keyword-only_ | | +| `drop` | The dropout rate. ~~float~~ | +| `sgd` | An optimizer. Will be created via [`create_optimizer`](#create_optimizer) if not set. ~~Optional[Optimizer]~~ | +| `losses` | Optional record of the loss during training. Updated using the component name as the key. ~~Optional[Dict[str, float]]~~ | +| **RETURNS** | The updated `losses` dictionary. ~~Dict[str, float]~~ | ## Tok2Vec.create_optimizer {#create_optimizer tag="method"} diff --git a/website/docs/api/token.md b/website/docs/api/token.md index c272b1fce..44c92d1ee 100644 --- a/website/docs/api/token.md +++ b/website/docs/api/token.md @@ -362,8 +362,8 @@ unknown. Defaults to `True` for the first token in the `Doc`. > assert not doc[5].is_sent_start > ``` -| Name | Description | -| ----------- | --------------------------------------------- | +| Name | Description | +| ----------- | ------------------------------------------------------- | | **RETURNS** | Whether the token starts a sentence. ~~Optional[bool]~~ | ## Token.has_vector {#has_vector tag="property" model="vectors"} @@ -420,73 +420,73 @@ The L2 norm of the token's vector representation. ## Attributes {#attributes} -| Name | Description | -| -------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `doc` | The parent document. ~~Doc~~ | -| `lex` 3 | The underlying lexeme. ~~Lexeme~~ | -| `sent` 2.0.12 | The sentence span that this token is a part of. ~~Span~~ | -| `text` | Verbatim text content. ~~str~~ | -| `text_with_ws` | Text content, with trailing space character if present. ~~str~~ | -| `whitespace_` | Trailing space character if present. ~~str~~ | -| `orth` | ID of the verbatim text content. ~~int~~ | -| `orth_` | Verbatim text content (identical to `Token.text`). Exists mostly for consistency with the other attributes. ~~str~~ | -| `vocab` | The vocab object of the parent `Doc`. ~~vocab~~ | -| `tensor` 2.1.7 | The token's slice of the parent `Doc`'s tensor. ~~numpy.ndarray~~ | -| `head` | The syntactic parent, or "governor", of this token. ~~Token~~ | -| `left_edge` | The leftmost token of this token's syntactic descendants. ~~Token~~ | -| `right_edge` | The rightmost token of this token's syntactic descendants. ~~Token~~ | -| `i` | The index of the token within the parent document. ~~int~~ | -| `ent_type` | Named entity type. ~~int~~ | -| `ent_type_` | Named entity type. ~~str~~ | -| `ent_iob` | IOB code of named entity tag. `3` means the token begins an entity, `2` means it is outside an entity, `1` means it is inside an entity, and `0` means no entity tag is set. ~~int~~ | -| `ent_iob_` | IOB code of named entity tag. "B" means the token begins an entity, "I" means it is inside an entity, "O" means it is outside an entity, and "" means no entity tag is set. ~~str~~ | -| `ent_kb_id` 2.2 | Knowledge base ID that refers to the named entity this token is a part of, if any. ~~int~~ | -| `ent_kb_id_` 2.2 | Knowledge base ID that refers to the named entity this token is a part of, if any. ~~str~~ | -| `ent_id` | ID of the entity the token is an instance of, if any. Currently not used, but potentially for coreference resolution. ~~int~~ | -| `ent_id_` | ID of the entity the token is an instance of, if any. Currently not used, but potentially for coreference resolution. ~~str~~ | -| `lemma` | Base form of the token, with no inflectional suffixes. ~~int~~ | -| `lemma_` | Base form of the token, with no inflectional suffixes. ~~str~~ | -| `norm` | The token's norm, i.e. a normalized form of the token text. Can be set in the language's [tokenizer exceptions](/usage/linguistic-features#language-data). ~~int~~ | -| `norm_` | The token's norm, i.e. a normalized form of the token text. Can be set in the language's [tokenizer exceptions](/usage/linguistic-features#language-data). ~~str~~ | -| `lower` | Lowercase form of the token. ~~int~~ | -| `lower_` | Lowercase form of the token text. Equivalent to `Token.text.lower()`. ~~str~~ | -| `shape` | Transform of the token's string to show orthographic features. Alphabetic characters are replaced by `x` or `X`, and numeric characters are replaced by `d`, and sequences of the same character are truncated after length 4. For example,`"Xxxx"`or`"dd"`. ~~int~~ | -| `shape_` | Transform of the token's string to show orthographic features. Alphabetic characters are replaced by `x` or `X`, and numeric characters are replaced by `d`, and sequences of the same character are truncated after length 4. For example,`"Xxxx"`or`"dd"`. ~~str~~ | -| `prefix` | Hash value of a length-N substring from the start of the token. Defaults to `N=1`. ~~int~~ | -| `prefix_` | A length-N substring from the start of the token. Defaults to `N=1`. ~~str~~ | -| `suffix` | Hash value of a length-N substring from the end of the token. Defaults to `N=3`. ~~int~~ | -| `suffix_` | Length-N substring from the end of the token. Defaults to `N=3`. ~~str~~ | -| `is_alpha` | Does the token consist of alphabetic characters? Equivalent to `token.text.isalpha()`. ~~bool~~ | -| `is_ascii` | Does the token consist of ASCII characters? Equivalent to `all(ord(c) < 128 for c in token.text)`. ~~bool~~ | -| `is_digit` | Does the token consist of digits? Equivalent to `token.text.isdigit()`. ~~bool~~ | -| `is_lower` | Is the token in lowercase? Equivalent to `token.text.islower()`. ~~bool~~ | -| `is_upper` | Is the token in uppercase? Equivalent to `token.text.isupper()`. ~~bool~~ | -| `is_title` | Is the token in titlecase? Equivalent to `token.text.istitle()`. ~~bool~~ | -| `is_punct` | Is the token punctuation? ~~bool~~ | -| `is_left_punct` | Is the token a left punctuation mark, e.g. `"("` ? ~~bool~~ | -| `is_right_punct` | Is the token a right punctuation mark, e.g. `")"` ? ~~bool~~ | -| `is_space` | Does the token consist of whitespace characters? Equivalent to `token.text.isspace()`. ~~bool~~ | -| `is_bracket` | Is the token a bracket? ~~bool~~ | -| `is_quote` | Is the token a quotation mark? ~~bool~~ | -| `is_currency` 2.0.8 | Is the token a currency symbol? ~~bool~~ | -| `like_url` | Does the token resemble a URL? ~~bool~~ | -| `like_num` | Does the token represent a number? e.g. "10.9", "10", "ten", etc. ~~bool~~ | -| `like_email` | Does the token resemble an email address? ~~bool~~ | -| `is_oov` | Is the token out-of-vocabulary (i.e. does it not have a word vector)? ~~bool~~ | -| `is_stop` | Is the token part of a "stop list"? ~~bool~~ | -| `pos` | Coarse-grained part-of-speech from the [Universal POS tag set](https://universaldependencies.org/docs/u/pos/). ~~int~~ | -| `pos_` | Coarse-grained part-of-speech from the [Universal POS tag set](https://universaldependencies.org/docs/u/pos/). ~~str~~ | -| `tag` | Fine-grained part-of-speech. ~~int~~ | -| `tag_` | Fine-grained part-of-speech. ~~str~~ | -| `morph` 3 | Morphological analysis. ~~MorphAnalysis~~ | -| `dep` | Syntactic dependency relation. ~~int~~ | -| `dep_` | Syntactic dependency relation. ~~str~~ | -| `lang` | Language of the parent document's vocabulary. ~~int~~ | -| `lang_` | Language of the parent document's vocabulary. ~~str~~ | -| `prob` | Smoothed log probability estimate of token's word type (context-independent entry in the vocabulary). ~~float~~ | -| `idx` | The character offset of the token within the parent document. ~~int~~ | -| `sentiment` | A scalar value indicating the positivity or negativity of the token. ~~float~~ | -| `lex_id` | Sequential ID of the token's lexical type, used to index into tables, e.g. for word vectors. ~~int~~ | -| `rank` | Sequential ID of the token's lexical type, used to index into tables, e.g. for word vectors. ~~int~~ | -| `cluster` | Brown cluster ID. ~~int~~ | -| `_` | User space for adding custom [attribute extensions](/usage/processing-pipelines#custom-components-attributes). ~~Underscore~~ | +| Name | Description | +| -------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `doc` | The parent document. ~~Doc~~ | +| `lex` 3 | The underlying lexeme. ~~Lexeme~~ | +| `sent` 2.0.12 | The sentence span that this token is a part of. ~~Span~~ | +| `text` | Verbatim text content. ~~str~~ | +| `text_with_ws` | Text content, with trailing space character if present. ~~str~~ | +| `whitespace_` | Trailing space character if present. ~~str~~ | +| `orth` | ID of the verbatim text content. ~~int~~ | +| `orth_` | Verbatim text content (identical to `Token.text`). Exists mostly for consistency with the other attributes. ~~str~~ | +| `vocab` | The vocab object of the parent `Doc`. ~~vocab~~ | +| `tensor` 2.1.7 | The token's slice of the parent `Doc`'s tensor. ~~numpy.ndarray~~ | +| `head` | The syntactic parent, or "governor", of this token. ~~Token~~ | +| `left_edge` | The leftmost token of this token's syntactic descendants. ~~Token~~ | +| `right_edge` | The rightmost token of this token's syntactic descendants. ~~Token~~ | +| `i` | The index of the token within the parent document. ~~int~~ | +| `ent_type` | Named entity type. ~~int~~ | +| `ent_type_` | Named entity type. ~~str~~ | +| `ent_iob` | IOB code of named entity tag. `3` means the token begins an entity, `2` means it is outside an entity, `1` means it is inside an entity, and `0` means no entity tag is set. ~~int~~ | +| `ent_iob_` | IOB code of named entity tag. "B" means the token begins an entity, "I" means it is inside an entity, "O" means it is outside an entity, and "" means no entity tag is set. ~~str~~ | +| `ent_kb_id` 2.2 | Knowledge base ID that refers to the named entity this token is a part of, if any. ~~int~~ | +| `ent_kb_id_` 2.2 | Knowledge base ID that refers to the named entity this token is a part of, if any. ~~str~~ | +| `ent_id` | ID of the entity the token is an instance of, if any. Currently not used, but potentially for coreference resolution. ~~int~~ | +| `ent_id_` | ID of the entity the token is an instance of, if any. Currently not used, but potentially for coreference resolution. ~~str~~ | +| `lemma` | Base form of the token, with no inflectional suffixes. ~~int~~ | +| `lemma_` | Base form of the token, with no inflectional suffixes. ~~str~~ | +| `norm` | The token's norm, i.e. a normalized form of the token text. Can be set in the language's [tokenizer exceptions](/usage/linguistic-features#language-data). ~~int~~ | +| `norm_` | The token's norm, i.e. a normalized form of the token text. Can be set in the language's [tokenizer exceptions](/usage/linguistic-features#language-data). ~~str~~ | +| `lower` | Lowercase form of the token. ~~int~~ | +| `lower_` | Lowercase form of the token text. Equivalent to `Token.text.lower()`. ~~str~~ | +| `shape` | Transform of the token's string to show orthographic features. Alphabetic characters are replaced by `x` or `X`, and numeric characters are replaced by `d`, and sequences of the same character are truncated after length 4. For example,`"Xxxx"`or`"dd"`. ~~int~~ | +| `shape_` | Transform of the token's string to show orthographic features. Alphabetic characters are replaced by `x` or `X`, and numeric characters are replaced by `d`, and sequences of the same character are truncated after length 4. For example,`"Xxxx"`or`"dd"`. ~~str~~ | +| `prefix` | Hash value of a length-N substring from the start of the token. Defaults to `N=1`. ~~int~~ | +| `prefix_` | A length-N substring from the start of the token. Defaults to `N=1`. ~~str~~ | +| `suffix` | Hash value of a length-N substring from the end of the token. Defaults to `N=3`. ~~int~~ | +| `suffix_` | Length-N substring from the end of the token. Defaults to `N=3`. ~~str~~ | +| `is_alpha` | Does the token consist of alphabetic characters? Equivalent to `token.text.isalpha()`. ~~bool~~ | +| `is_ascii` | Does the token consist of ASCII characters? Equivalent to `all(ord(c) < 128 for c in token.text)`. ~~bool~~ | +| `is_digit` | Does the token consist of digits? Equivalent to `token.text.isdigit()`. ~~bool~~ | +| `is_lower` | Is the token in lowercase? Equivalent to `token.text.islower()`. ~~bool~~ | +| `is_upper` | Is the token in uppercase? Equivalent to `token.text.isupper()`. ~~bool~~ | +| `is_title` | Is the token in titlecase? Equivalent to `token.text.istitle()`. ~~bool~~ | +| `is_punct` | Is the token punctuation? ~~bool~~ | +| `is_left_punct` | Is the token a left punctuation mark, e.g. `"("` ? ~~bool~~ | +| `is_right_punct` | Is the token a right punctuation mark, e.g. `")"` ? ~~bool~~ | +| `is_space` | Does the token consist of whitespace characters? Equivalent to `token.text.isspace()`. ~~bool~~ | +| `is_bracket` | Is the token a bracket? ~~bool~~ | +| `is_quote` | Is the token a quotation mark? ~~bool~~ | +| `is_currency` 2.0.8 | Is the token a currency symbol? ~~bool~~ | +| `like_url` | Does the token resemble a URL? ~~bool~~ | +| `like_num` | Does the token represent a number? e.g. "10.9", "10", "ten", etc. ~~bool~~ | +| `like_email` | Does the token resemble an email address? ~~bool~~ | +| `is_oov` | Is the token out-of-vocabulary (i.e. does it not have a word vector)? ~~bool~~ | +| `is_stop` | Is the token part of a "stop list"? ~~bool~~ | +| `pos` | Coarse-grained part-of-speech from the [Universal POS tag set](https://universaldependencies.org/docs/u/pos/). ~~int~~ | +| `pos_` | Coarse-grained part-of-speech from the [Universal POS tag set](https://universaldependencies.org/docs/u/pos/). ~~str~~ | +| `tag` | Fine-grained part-of-speech. ~~int~~ | +| `tag_` | Fine-grained part-of-speech. ~~str~~ | +| `morph` 3 | Morphological analysis. ~~MorphAnalysis~~ | +| `dep` | Syntactic dependency relation. ~~int~~ | +| `dep_` | Syntactic dependency relation. ~~str~~ | +| `lang` | Language of the parent document's vocabulary. ~~int~~ | +| `lang_` | Language of the parent document's vocabulary. ~~str~~ | +| `prob` | Smoothed log probability estimate of token's word type (context-independent entry in the vocabulary). ~~float~~ | +| `idx` | The character offset of the token within the parent document. ~~int~~ | +| `sentiment` | A scalar value indicating the positivity or negativity of the token. ~~float~~ | +| `lex_id` | Sequential ID of the token's lexical type, used to index into tables, e.g. for word vectors. ~~int~~ | +| `rank` | Sequential ID of the token's lexical type, used to index into tables, e.g. for word vectors. ~~int~~ | +| `cluster` | Brown cluster ID. ~~int~~ | +| `_` | User space for adding custom [attribute extensions](/usage/processing-pipelines#custom-components-attributes). ~~Underscore~~ | diff --git a/website/docs/api/tokenizer.md b/website/docs/api/tokenizer.md index 5958f2e57..8809c10bc 100644 --- a/website/docs/api/tokenizer.md +++ b/website/docs/api/tokenizer.md @@ -239,6 +239,7 @@ it. | `infix_finditer` | A function to find internal segment separators, e.g. hyphens. Returns a (possibly empty) sequence of `re.MatchObject` objects. ~~Optional[Callable[[str], Iterator[Match]]]~~ | | `token_match` | A function matching the signature of `re.compile(string).match` to find token matches. Returns an `re.MatchObject` or `None`. ~~Optional[Callable[[str], Optional[Match]]]~~ | | `rules` | A dictionary of tokenizer exceptions and special cases. ~~Optional[Dict[str, List[Dict[int, str]]]]~~ | + ## Serialization fields {#serialization-fields} During serialization, spaCy will export several data fields used to restore diff --git a/website/docs/api/vectors.md b/website/docs/api/vectors.md index ba2d5ab42..598abe681 100644 --- a/website/docs/api/vectors.md +++ b/website/docs/api/vectors.md @@ -290,8 +290,8 @@ If a table is full, it can be resized using ## Vectors.n_keys {#n_keys tag="property"} Get the number of keys in the table. Note that this is the number of _all_ keys, -not just unique vectors. If several keys are mapped to the same -vectors, they will be counted individually. +not just unique vectors. If several keys are mapped to the same vectors, they +will be counted individually. > #### Example > @@ -321,7 +321,7 @@ performed in chunks to avoid consuming too much memory. You can set the > ``` | Name | Description | -| -------------- | --------------------------------------------------------------------------- | +| -------------- | --------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------- | | `queries` | An array with one or more vectors. ~~numpy.ndarray~~ | | _keyword-only_ | | | `batch_size` | The batch size to use. Default to `1024`. ~~int~~ |