diff --git a/website/docs/api/entityrecognizer.md b/website/docs/api/entityrecognizer.md index b237729be..601b644c1 100644 --- a/website/docs/api/entityrecognizer.md +++ b/website/docs/api/entityrecognizer.md @@ -82,7 +82,7 @@ shortcut for this and instantiate the component using its string name and | `moves` | A list of transition names. Inferred from the data if set to `None`, which is the default. ~~Optional[List[str]]~~ | | _keyword-only_ | | | `update_with_oracle_cut_size` | During training, cut long sequences into shorter segments by creating intermediate states based on the gold-standard history. The model is not very sensitive to this parameter, so you usually won't need to change it. Defaults to `100`. ~~int~~ | -| `incorrect_spans_key` | Identifies spans that are known to be incorrect entity annotations. The incorrect entity annotations can be stored in the span group, under this key. Defaults to `None`. ~~Optional[str]~~ | +| `incorrect_spans_key` | Identifies spans that are known to be incorrect entity annotations. The incorrect entity annotations can be stored in the span group in [`Doc.spans`](/api/doc#spans), under this key. Defaults to `None`. ~~Optional[str]~~ | ## EntityRecognizer.\_\_call\_\_ {#call tag="method"} diff --git a/website/docs/usage/v3-1.md b/website/docs/usage/v3-1.md new file mode 100644 index 000000000..fb04c8e46 --- /dev/null +++ b/website/docs/usage/v3-1.md @@ -0,0 +1,114 @@ +--- +title: What's New in v3.1 +teaser: New features and how to upgrade +menu: + - ['New Features', 'features'] + - ['Upgrading Notes', 'upgrading'] +--- + +## New Features {#features hidden="true"} + + + +### Using predicted annotations during training {#predicted-annotations-training} + + + + + +This project shows how to use the `token.dep` attribute predicted by the parser +as a feature for a subsequent tagger component in the pipeline. + + + +### SpanCategorizer for predicting arbitrary and overlapping spans {#spancategorizer tag="experimental"} + +A common task in applied NLP is extracting spans of texts from documents, +including longer phrases or nested expressions. Named entity recognition isn't +the right tool for this problem, since an entity recognizer typically predicts +single token-based tags that are very sensitive to boundaries. This is effective +for proper nouns and self-contained expressions, but less useful for other types +of phrases or overlapping spans. The new +[`SpanCategorizer`](/api/spancategorizer) component and +[SpanCategorizer](/api/architectures#spancategorizer) architecture let you label +arbitrary and potentially overlapping spans of texts. A span categorizer +consists of two parts: a [suggester function](/api/spancategorizer#suggesters) +that proposes candidate spans, which may or may not overlap, and a labeler model +that predicts zero or more labels for each candidate. The predicted spans are +available via the [`Doc.spans`](/api/doc#spans) container. + + + + + + + +The upcoming version of our annotation tool [Prodigy](https://prodi.gy) +(currently available as a [pre-release](https://support.prodi.gy/t/3861) for all +users) features a [new workflow and UI](https://support.prodi.gy/t/3861) for +annotating overlapping and nested spans. You can use it to create training data +for spaCy's `SpanCategorizer` component. + + + +### Update the entity recognizer with partial incorrect annotations {#negative-samples} + +> #### config.cfg (excerpt) +> +> ```ini +> [components.ner] +> factory = "ner" +> incorrect_spans_key = "incorrect_spans" +> moves = null +> update_with_oracle_cut_size = 100 +> ``` + +The [`EntityRecognizer`](/api/entityrecognizer) can now be updated with known +incorrect annotations, which lets you take advantage of partial and sparse data. +For example, you'll be able to use the information that certain spans of text +are definitely **not** `PERSON` entities, without having to provide the +complete-gold standard annotations for the given example. The incorrect span +annotations can be added via the [`Doc.spans`](/api/doc#spans) in the training +data under the key defined as +[`incorrect_spans_key`](/api/entityrecognizer#init) in the component config. + + + +### New pipeline packages for Catalan and Danish {#pipeline-packages} + + + +| Package | Language | Tagger | Parser |  NER | +| ------------------------------------------------- | -------- | -----: | -----: | ---: | +| [`ca_core_news_sm`](/models/ca#ca_core_news_sm) | Catalan | | | | +| [`ca_core_news_md`](/models/ca#ca_core_news_md) | Catalan | | | | +| [`ca_core_news_lg`](/models/ca#ca_core_news_lg) | Catalan | | | | +| [`ca_core_news_trf`](/models/ca#ca_core_news_trf) | Catalan | | | | +| [`da_core_news_trf`](/models/da#da_core_news_trf) | Danish | | | | + +### Resizable text classification architectures {#resizable-textcat} + + + +### CLI command to assemble pipeline from config {#assemble} + +The [`spacy assemble`](/api/cli#assemble) command lets you assemble a pipeline +from a config file without additional training. It can be especially useful for +creating a blank pipeline with a custom tokenizer, rule-based components or word +vectors. + +```cli +$ python -m spacy assemble config.cfg ./output +``` + +### Support for streaming large or infinite corpora {#streaming-corpora} + + + +### New lemmatizers for Catalan and Italian {#pos-lemmatizers} + + + +## Notes about upgrading from v3.0 {#upgrading} + + diff --git a/website/meta/sidebars.json b/website/meta/sidebars.json index a7e87ff72..afa8c7e2d 100644 --- a/website/meta/sidebars.json +++ b/website/meta/sidebars.json @@ -9,7 +9,8 @@ { "text": "Models & Languages", "url": "/usage/models" }, { "text": "Facts & Figures", "url": "/usage/facts-figures" }, { "text": "spaCy 101", "url": "/usage/spacy-101" }, - { "text": "New in v3.0", "url": "/usage/v3" } + { "text": "New in v3.0", "url": "/usage/v3" }, + { "text": "New in v3.1", "url": "/usage/v3-1" } ] }, { @@ -135,9 +136,7 @@ }, { "label": "Legacy", - "items": [ - { "text": "Legacy functions", "url": "/api/legacy" } - ] + "items": [{ "text": "Legacy functions", "url": "/api/legacy" }] } ] } diff --git a/website/src/templates/index.js b/website/src/templates/index.js index a5adc6e50..2c68ff056 100644 --- a/website/src/templates/index.js +++ b/website/src/templates/index.js @@ -119,8 +119,8 @@ const AlertSpace = ({ nightly, legacy }) => { } const navAlert = ( - - 💥 Out now: spaCy v3.0 + + 💥 Out now: spaCy v3.1 )