mirror of https://github.com/explosion/spaCy.git
245 lines
10 KiB
Markdown
245 lines
10 KiB
Markdown
---
|
|
title: What's New in v3.2
|
|
teaser: New features and how to upgrade
|
|
menu:
|
|
- ['New Features', 'features']
|
|
- ['Upgrading Notes', 'upgrading']
|
|
---
|
|
|
|
## New Features {#features hidden="true"}
|
|
|
|
spaCy v3.2 adds support for [`floret`](https://github.com/explosion/floret)
|
|
vectors, makes custom `Doc` creation and scoring easier, and includes many bug
|
|
fixes and improvements. For the trained pipelines, there's a new transformer
|
|
pipeline for Japanese and the Universal Dependencies training data has been
|
|
updated across the board to the most recent release.
|
|
|
|
<Infobox title="Improve performance for spaCy on Apple M1 with AppleOps" variant="warning" emoji="📣">
|
|
|
|
spaCy is now up to **8 × faster on M1 Macs** by calling into Apple's
|
|
native Accelerate library for matrix multiplication. For more details, see
|
|
[`thinc-apple-ops`](https://github.com/explosion/thinc-apple-ops).
|
|
|
|
```bash
|
|
$ pip install spacy[apple]
|
|
```
|
|
|
|
</Infobox>
|
|
|
|
### Registered scoring functions {#registered-scoring-functions}
|
|
|
|
To customize the scoring, you can specify a scoring function for each component
|
|
in your config from the new [`scorers` registry](/api/top-level#registry):
|
|
|
|
```ini
|
|
### config.cfg (excerpt) {highlight="3"}
|
|
[components.tagger]
|
|
factory = "tagger"
|
|
scorer = {"@scorers":"spacy.tagger_scorer.v1"}
|
|
```
|
|
|
|
### Overwrite settings {#overwrite}
|
|
|
|
Most pipeline components now include an `overwrite` setting in the config that
|
|
determines whether existing annotation in the `Doc` is preserved or overwritten:
|
|
|
|
```ini
|
|
### config.cfg (excerpt) {highlight="3"}
|
|
[components.tagger]
|
|
factory = "tagger"
|
|
overwrite = false
|
|
```
|
|
|
|
### Doc input for pipelines {#doc-input}
|
|
|
|
[`nlp`](/api/language#call) and [`nlp.pipe`](/api/language#pipe) accept
|
|
[`Doc`](/api/doc) input, skipping the tokenizer if a `Doc` is provided instead
|
|
of a string. This makes it easier to create a `Doc` with custom tokenization or
|
|
to set custom extensions before processing:
|
|
|
|
```python
|
|
doc = nlp.make_doc("This is text 500.")
|
|
doc._.text_id = 500
|
|
doc = nlp(doc)
|
|
```
|
|
|
|
### Support for floret vectors {#vectors}
|
|
|
|
We recently published [`floret`](https://github.com/explosion/floret), an
|
|
extended version of [fastText](https://fasttext.cc) that combines fastText's
|
|
subwords with Bloom embeddings for compact, full-coverage vectors. The use of
|
|
subwords means that there are no OOV words and due to Bloom embeddings, the
|
|
vector table can be kept very small at <100K entries. Bloom embeddings are
|
|
already used by [HashEmbed](https://thinc.ai/docs/api-layers#hashembed) in
|
|
[tok2vec](/api/architectures#tok2vec-arch) for compact spaCy models.
|
|
|
|
For easy integration, floret includes a
|
|
[Python wrapper](https://github.com/explosion/floret/blob/main/python/README.md):
|
|
|
|
```bash
|
|
$ pip install floret
|
|
```
|
|
|
|
A demo project shows how to train and import floret vectors:
|
|
|
|
<Project id="pipelines/floret_vectors_demo">
|
|
|
|
Train toy English floret vectors and import them into a spaCy pipeline.
|
|
|
|
</Project>
|
|
|
|
Two additional demo projects compare standard fastText vectors with floret
|
|
vectors for full spaCy pipelines. For agglutinative languages like Finnish or
|
|
Korean, there are large improvements in performance due to the use of subwords
|
|
(no OOV words!), with a vector table containing merely 50K entries.
|
|
|
|
<Project id="pipelines/floret_fi_core_demo">
|
|
|
|
Finnish UD+NER vector and pipeline training, comparing standard fasttext vs.
|
|
floret vectors.
|
|
|
|
For the default project settings with 1M (2.6G) tokenized training texts and 50K
|
|
300-dim vectors, ~300K keys for the standard vectors:
|
|
|
|
| Vectors | TAG | POS | DEP UAS | DEP LAS | NER F |
|
|
| -------------------------------------------- | -------: | -------: | -------: | -------: | -------: |
|
|
| none | 93.3 | 92.3 | 79.7 | 72.8 | 61.0 |
|
|
| standard (pruned: 50K vectors for 300K keys) | 95.9 | 94.7 | 83.3 | 77.9 | 68.5 |
|
|
| standard (unpruned: 300K vectors/keys) | 96.0 | 95.0 | **83.8** | 78.4 | 69.1 |
|
|
| floret (minn 4, maxn 5; 50K vectors, no OOV) | **96.6** | **95.5** | 83.5 | **78.5** | **70.9** |
|
|
|
|
</Project>
|
|
|
|
<Project id="pipelines/floret_ko_ud_demo">
|
|
|
|
Korean UD vector and pipeline training, comparing standard fasttext vs. floret
|
|
vectors.
|
|
|
|
For the default project settings with 1M (3.3G) tokenized training texts and 50K
|
|
300-dim vectors, ~800K keys for the standard vectors:
|
|
|
|
| Vectors | TAG | POS | DEP UAS | DEP LAS |
|
|
| -------------------------------------------- | -------: | -------: | -------: | -------: |
|
|
| none | 72.5 | 85.0 | 73.2 | 64.3 |
|
|
| standard (pruned: 50K vectors for 800K keys) | 77.9 | 89.4 | 78.8 | 72.8 |
|
|
| standard (unpruned: 800K vectors/keys) | 79.0 | 90.2 | 79.2 | 73.9 |
|
|
| floret (minn 2, maxn 3; 50K vectors, no OOV) | **82.5** | **93.8** | **83.0** | **80.1** |
|
|
|
|
</Project>
|
|
|
|
### Updates for spacy-transformers v1.1 {#spacy-transformers}
|
|
|
|
[`spacy-transformers`](https://github.com/explosion/spacy-transformers) v1.1 has
|
|
been refactored to improve serialization and support of inline transformer
|
|
components and replacing listeners. In addition, the transformer model output is
|
|
provided as
|
|
[`ModelOutput`](https://huggingface.co/transformers/main_classes/output.html?highlight=modeloutput#transformers.file_utils.ModelOutput)
|
|
instead of tuples in
|
|
`TransformerData.model_output and FullTransformerBatch.model_output.` For
|
|
backwards compatibility, the tuple format remains available under
|
|
`TransformerData.tensors` and `FullTransformerBatch.tensors`. See more details
|
|
in the [transformer API docs](/api/architectures#TransformerModel).
|
|
|
|
`spacy-transfomers` v1.1 also adds support for `transformer_config` settings
|
|
such as `output_attentions`. Additional output is stored under
|
|
`TransformerData.model_output`. More details are in the
|
|
[TransformerModel docs](/api/architectures#TransformerModel). The training speed
|
|
has been improved by streamlining allocations for tokenizer output and there is
|
|
new support for [mixed-precision training](/api/architectures#TransformerModel).
|
|
|
|
### New transformer package for Japanese {#pipeline-packages}
|
|
|
|
spaCy v3.2 adds a new transformer pipeline package for Japanese
|
|
[`ja_core_news_trf`](/models/ja#ja_core_news_trf), which uses the `basic`
|
|
pretokenizer instead of `mecab` to limit the number of dependencies required for
|
|
the pipeline. Thanks to Hiroshi Matsuda and the spaCy Japanese community for
|
|
their contributions!
|
|
|
|
### Pipeline and language updates {#pipeline-updates}
|
|
|
|
- All Universal Dependencies training data has been updated to v2.8.
|
|
- The Catalan data, tokenizer and lemmatizer have been updated, thanks to Carlos
|
|
Rodriguez, Carme Armentano and the Barcelona Supercomputing Center!
|
|
- The transformer pipelines are trained using spacy-transformers v1.1, with
|
|
improved IO and more options for
|
|
[model config and output](/api/architectures#TransformerModel).
|
|
- Trailing whitespace has been added as a `tok2vec` feature, improving the
|
|
performance for many components, especially fine-grained tagging and sentence
|
|
segmentation.
|
|
- The English attribute ruler patterns have been overhauled to improve
|
|
`Token.pos` and `Token.morph`.
|
|
|
|
spaCy v3.2 also features a new Irish lemmatizer, support for `noun_chunks` in
|
|
Portuguese, improved `noun_chunks` for Spanish and additional updates for
|
|
Bulgarian, Catalan, Sinhala, Tagalog, Tigrinya and Vietnamese.
|
|
|
|
## Notes about upgrading from v3.1 {#upgrading}
|
|
|
|
### Pipeline package version compatibility {#version-compat}
|
|
|
|
> #### Using legacy implementations
|
|
>
|
|
> In spaCy v3, you'll still be able to load and reference legacy implementations
|
|
> via [`spacy-legacy`](https://github.com/explosion/spacy-legacy), even if the
|
|
> components or architectures change and newer versions are available in the
|
|
> core library.
|
|
|
|
When you're loading a pipeline package trained with spaCy v3.0 or v3.1, you will
|
|
see a warning telling you that the pipeline may be incompatible. This doesn't
|
|
necessarily have to be true, but we recommend running your pipelines against
|
|
your test suite or evaluation data to make sure there are no unexpected results.
|
|
If you're using one of the [trained pipelines](/models) we provide, you should
|
|
run [`spacy download`](/api/cli#download) to update to the latest version. To
|
|
see an overview of all installed packages and their compatibility, you can run
|
|
[`spacy validate`](/api/cli#validate).
|
|
|
|
If you've trained your own custom pipeline and you've confirmed that it's still
|
|
working as expected, you can update the spaCy version requirements in the
|
|
[`meta.json`](/api/data-formats#meta):
|
|
|
|
```diff
|
|
- "spacy_version": ">=3.1.0,<3.2.0",
|
|
+ "spacy_version": ">=3.2.0,<3.3.0",
|
|
```
|
|
|
|
### Updating v3.1 configs
|
|
|
|
To update a config from spaCy v3.1 with the new v3.2 settings, run
|
|
[`init fill-config`](/api/cli#init-fill-config):
|
|
|
|
```cli
|
|
$ python -m spacy init fill-config config-v3.1.cfg config-v3.2.cfg
|
|
```
|
|
|
|
In many cases ([`spacy train`](/api/cli#train),
|
|
[`spacy.load`](/api/top-level#spacy.load)), the new defaults will be filled in
|
|
automatically, but you'll need to fill in the new settings to run
|
|
[`debug config`](/api/cli#debug) and [`debug data`](/api/cli#debug-data).
|
|
|
|
## Notes about upgrading from spacy-transformers v1.0 {#upgrading-transformers}
|
|
|
|
When you're loading a transformer pipeline package trained with
|
|
[`spacy-transformers`](https://github.com/explosion/spacy-transformers) v1.0
|
|
after upgrading to `spacy-transformers` v1.1, you'll see a warning telling you
|
|
that the pipeline may be incompatible. `spacy-transformers` v1.1 should be able
|
|
to import v1.0 `transformer` components into the new internal format with no
|
|
change in performance, but here we'd also recommend running your test suite to
|
|
verify that the pipeline still performs as expected.
|
|
|
|
If you save your pipeline with [`nlp.to_disk`](/api/language#to_disk), it will
|
|
be saved in the new v1.1 format and should be fully compatible with
|
|
`spacy-transformers` v1.1. Once you've confirmed the performance, you can update
|
|
the requirements in [`meta.json`](/api/data-formats#meta):
|
|
|
|
```diff
|
|
"requirements": [
|
|
- "spacy-transformers>=1.0.3,<1.1.0"
|
|
+ "spacy-transformers>=1.1.2,<1.2.0"
|
|
]
|
|
```
|
|
|
|
If you're using one of the [trained pipelines](/models) we provide, you should
|
|
run [`spacy download`](/api/cli#download) to update to the latest version. To
|
|
see an overview of all installed packages and their compatibility, you can run
|
|
[`spacy validate`](/api/cli#validate).
|