What's new in v3.2 (#9633)

* What's new in v3.2 * Fix formatting * Fix typo * Redo thanks * Formatting * Fix typo * Fix project links * Fix typo * Minimal intro, floret python module * Rephrase * Rephrase, extend * Rephrase * Update links and formatting [ci skip] * Minor correction * Fix typo Co-authored-by: Ines Montani <ines@ines.io>
2021-11-05 16:31:14 +01:00 · 2021-11-05 16:31:14 +01:00 · 216ed231a9
parent 0fc3dee772
commit 216ed231a9
3 changed files with 248 additions and 3 deletions
--- a/website/docs/usage/v3-2.md
+++ b/website/docs/usage/v3-2.md
@ -0,0 +1,244 @@
+---
+title: What's New in v3.2
+teaser: New features and how to upgrade
+menu:
+  - ['New Features', 'features']
+  - ['Upgrading Notes', 'upgrading']
+---
+
+## New Features {#features hidden="true"}
+
+spaCy v3.2 adds support for [`floret`](https://github.com/explosion/floret)
+vectors, makes custom `Doc` creation and scoring easier, and includes many bug
+fixes and improvements. For the trained pipelines, there's a new transformer
+pipeline for Japanese and the Universal Dependencies training data has been
+updated across the board to the most recent release.
+
+<Infobox title="Improve performance for spaCy on Apple M1 with AppleOps" variant="warning" emoji="📣">
+
+spaCy is now up to **8 &times; faster on M1 Macs** by calling into Apple's
+native Accelerate library for matrix multiplication. For more details, see
+[`thinc-apple-ops`](https://github.com/explosion/thinc-apple-ops).
+
+```bash
+$ pip install spacy[apple]
+```
+
+</Infobox>
+
+### Registered scoring functions {#registered-scoring-functions}
+
+To customize the scoring, you can specify a scoring function for each component
+in your config from the new [`scorers` registry](/api/top-level#registry):
+
+```ini
+### config.cfg (excerpt) {highlight="3"}
+[components.tagger]
+factory = "tagger"
+scorer = {"@scorers":"spacy.tagger_scorer.v1"}
+```
+
+### Overwrite settings {#overwrite}
+
+Most pipeline components now include an `overwrite` setting in the config that
+determines whether existing annotation in the `Doc` is preserved or overwritten:
+
+```ini
+### config.cfg (excerpt) {highlight="3"}
+[components.tagger]
+factory = "tagger"
+overwrite = false
+```
+
+### Doc input for pipelines {#doc-input}
+
+[`nlp`](/api/language#call) and [`nlp.pipe`](/api/language#pipe) accept
+[`Doc`](/api/doc) input, skipping the tokenizer if a `Doc` is provided instead
+of a string. This makes it easier to create a `Doc` with custom tokenization or
+to set custom extensions before processing:
+
+```python
+doc = nlp.make_doc("This is text 500.")
+doc._.text_id = 500
+doc = nlp(doc)
+```
+
+### Support for floret vectors {#vectors}
+
+We recently published [`floret`](https://github.com/explosion/floret), an
+extended version of [fastText](https://fasttext.cc) that combines fastText's
+subwords with Bloom embeddings for compact, full-coverage vectors. The use of
+subwords means that there are no OOV words and due to Bloom embeddings, the
+vector table can be kept very small at <100K entries. Bloom embeddings are
+already used by [HashEmbed](https://thinc.ai/docs/api-layers#hashembed) in
+[tok2vec](/api/architectures#tok2vec-arch) for compact spaCy models.
+
+For easy integration, floret includes a
+[Python wrapper](https://github.com/explosion/floret/blob/main/python/README.md):
+
+```bash
+$ pip install floret
+```
+
+A demo project shows how to train and import floret vectors:
+
+<Project id="pipelines/floret_vectors_demo">
+
+Train toy English floret vectors and import them into a spaCy pipeline.
+
+</Project>
+
+Two additional demo projects compare standard fastText vectors with floret
+vectors for full spaCy pipelines. For agglutinative languages like Finnish or
+Korean, there are large improvements in performance due to the use of subwords
+(no OOV words!), with a vector table containing merely 50K entries.
+
+<Project id="pipelines/floret_fi_core_demo">
+
+Finnish UD+NER vector and pipeline training, comparing standard fasttext vs.
+floret vectors.
+
+For the default project settings with 1M (2.6G) tokenized training texts and 50K
+300-dim vectors, ~300K keys for the standard vectors:
+
+| Vectors                                      |      TAG |      POS |  DEP UAS |  DEP LAS |    NER F |
+| -------------------------------------------- | -------: | -------: | -------: | -------: | -------: |
+| none                                         |     93.3 |     92.3 |     79.7 |     72.8 |     61.0 |
+| standard (pruned: 50K vectors for 300K keys) |     95.9 |     94.7 |     83.3 |     77.9 |     68.5 |
+| standard (unpruned: 300K vectors/keys)       |     96.0 |     95.0 | **83.8** |     78.4 |     69.1 |
+| floret (minn 4, maxn 5; 50K vectors, no OOV) | **96.6** | **95.5** |     83.5 | **78.5** | **70.9** |
+
+</Project>
+
+<Project id="pipelines/floret_ko_ud_demo">
+
+Korean UD vector and pipeline training, comparing standard fasttext vs. floret
+vectors.
+
+For the default project settings with 1M (3.3G) tokenized training texts and 50K
+300-dim vectors, ~800K keys for the standard vectors:
+
+| Vectors                                      |      TAG |      POS |  DEP UAS |  DEP LAS |
+| -------------------------------------------- | -------: | -------: | -------: | -------: |
+| none                                         |     72.5 |     85.0 |     73.2 |     64.3 |
+| standard (pruned: 50K vectors for 800K keys) |     77.9 |     89.4 |     78.8 |     72.8 |
+| standard (unpruned: 800K vectors/keys)       |     79.0 |     90.2 |     79.2 |     73.9 |
+| floret (minn 2, maxn 3; 50K vectors, no OOV) | **82.5** | **93.8** | **83.0** | **80.1** |
+
+</Project>
+
+### Updates for spacy-transformers v1.1 {#spacy-transformers}
+
+[`spacy-transformers`](https://github.com/explosion/spacy-transformers) v1.1 has
+been refactored to improve serialization and support of inline transformer
+components and replacing listeners. In addition, the transformer model output is
+provided as
+[`ModelOutput`](https://huggingface.co/transformers/main_classes/output.html?highlight=modeloutput#transformers.file_utils.ModelOutput)
+instead of tuples in
+`TransformerData.model_output and FullTransformerBatch.model_output.` For
+backwards compatibility, the tuple format remains available under
+`TransformerData.tensors` and `FullTransformerBatch.tensors`. See more details
+in the [transformer API docs](/api/architectures#TransformerModel).
+
+`spacy-transfomers` v1.1 also adds support for `transformer_config` settings
+such as `output_attentions`. Additional output is stored under
+`TransformerData.model_output`. More details are in the
+[TransformerModel docs](/api/architectures#TransformerModel). The training speed
+has been improved by streamlining allocations for tokenizer output and there is
+new support for [mixed-precision training](/api/architectures#TransformerModel).
+
+### New transformer package for Japanese {#pipeline-packages}
+
+spaCy v3.2 adds a new transformer pipeline package for Japanese
+[`ja_core_news_trf`](/models/ja#ja_core_news_trf), which uses the `basic`
+pretokenizer instead of `mecab` to limit the number of dependencies required for
+the pipeline. Thanks to Hiroshi Matsuda and the spaCy Japanese community for
+their contributions!
+
+### Pipeline and language updates {#pipeline-updates}
+
+- All Universal Dependencies training data has been updated to v2.8.
+- The Catalan data, tokenizer and lemmatizer have been updated, thanks to Carlos
+  Rodriguez and the Barcelona Supercomputing Center!
+- The transformer pipelines are trained using spacy-transformers v1.1, with
+  improved IO and more options for
+  [model config and output](/api/architectures#TransformerModel).
+- Trailing whitespace has been added as a `tok2vec` feature, improving the
+  performance for many components, especially fine-grained tagging and sentence
+  segmentation.
+- The English attribute ruler patterns have been overhauled to improve
+  `Token.pos` and `Token.morph`.
+
+spaCy v3.2 also features a new Irish lemmatizer, support for `noun_chunks` in
+Portuguese, improved `noun_chunks` for Spanish and additional updates for
+Bulgarian, Catalan, Sinhala, Tagalog, Tigrinya and Vietnamese.
+
+## Notes about upgrading from v3.1 {#upgrading}
+
+### Pipeline package version compatibility {#version-compat}
+
+> #### Using legacy implementations
+>
+> In spaCy v3, you'll still be able to load and reference legacy implementations
+> via [`spacy-legacy`](https://github.com/explosion/spacy-legacy), even if the
+> components or architectures change and newer versions are available in the
+> core library.
+
+When you're loading a pipeline package trained with spaCy v3.0 or v3.1, you will
+see a warning telling you that the pipeline may be incompatible. This doesn't
+necessarily have to be true, but we recommend running your pipelines against
+your test suite or evaluation data to make sure there are no unexpected results.
+If you're using one of the [trained pipelines](/models) we provide, you should
+run [`spacy download`](/api/cli#download) to update to the latest version. To
+see an overview of all installed packages and their compatibility, you can run
+[`spacy validate`](/api/cli#validate).
+
+If you've trained your own custom pipeline and you've confirmed that it's still
+working as expected, you can update the spaCy version requirements in the
+[`meta.json`](/api/data-formats#meta):
+
+```diff
+- "spacy_version": ">=3.1.0,<3.2.0",
+ "spacy_version": ">=3.2.0,<3.3.0",
+```
+
+### Updating v3.1 configs
+
+To update a config from spaCy v3.1 with the new v3.2 settings, run
+[`init fill-config`](/api/cli#init-fill-config):
+
+```cli
+$ python -m spacy init fill-config config-v3.1.cfg config-v3.2.cfg
+```
+
+In many cases ([`spacy train`](/api/cli#train),
+[`spacy.load`](/api/top-level#spacy.load)), the new defaults will be filled in
+automatically, but you'll need to fill in the new settings to run
+[`debug config`](/api/cli#debug) and [`debug data`](/api/cli#debug-data).
+
+## Notes about upgrading from spacy-transformers v1.0 {#upgrading-transformers}
+
+When you're loading a transformer pipeline package trained with
+[`spacy-transformers`](https://github.com/explosion/spacy-transformers) v1.0
+after upgrading to `spacy-transformers` v1.1, you'll see a warning telling you
+that the pipeline may be incompatible. `spacy-transformers` v1.1 should be able
+to import v1.0 `transformer` components into the new internal format with no
+change in performance, but here we'd also recommend running your test suite to
+verify that the pipeline still performs as expected.
+
+If you save your pipeline with [`nlp.to_disk`](/api/language#to_disk), it will
+be saved in the new v1.1 format and should be fully compatible with
+`spacy-transformers` v1.1. Once you've confirmed the performance, you can update
+the requirements in [`meta.json`](/api/data-formats#meta):
+
+```diff
+  "requirements": [
+-    "spacy-transformers>=1.0.3,<1.1.0"
+    "spacy-transformers>=1.1.2,<1.2.0"
+  ]
+```
+
+If you're using one of the [trained pipelines](/models) we provide, you should
+run [`spacy download`](/api/cli#download) to update to the latest version. To
+see an overview of all installed packages and their compatibility, you can run
+[`spacy validate`](/api/cli#validate).
--- a/website/meta/sidebars.json
+++ b/website/meta/sidebars.json
@ -10,7 +10,8 @@
                    { "text": "Facts & Figures", "url": "/usage/facts-figures" },
                    { "text": "spaCy 101", "url": "/usage/spacy-101" },
                    { "text": "New in v3.0", "url": "/usage/v3" },
-                    { "text": "New in v3.1", "url": "/usage/v3-1" }
+                    { "text": "New in v3.1", "url": "/usage/v3-1" },
+                    { "text": "New in v3.2", "url": "/usage/v3-2" }
                ]
            },
            {
--- a/website/src/templates/index.js
+++ b/website/src/templates/index.js
@ -119,8 +119,8 @@ const AlertSpace = ({ nightly, legacy }) => {
 }

 const navAlert = (
-    <Link to="/usage/v3-1" hidden>
-        <strong>💥 Out now:</strong> spaCy v3.1
+    <Link to="/usage/v3-2" hidden>
+        <strong>💥 Out now:</strong> spaCy v3.2
    </Link>
 )