spaCy/v3-3.md at 0fa004c4cd718319d750abad896447c114f39106

12 KiB

Raw Blame History

title

teaser

What's New in v3.3

New features and how to upgrade

New Features

features

Upgrading Notes

upgrading

New features

spaCy v3.3 improves the speed of core pipeline components, adds a new trainable lemmatizer, and introduces trained pipelines for Finnish, Korean and Swedish.

Speed improvements

v3.3 includes a slew of speed improvements:

Speed up parser and NER by using constant-time head lookups.
Support unnormalized softmax probabilities in spacy.Tagger.v2 to speed up inference for tagger, morphologizer, senter and trainable lemmatizer.
Speed up parser projectivization functions.
Replace Ragged with faster AlignmentArray in Example for training.
Improve Matcher speed.
Improve serialization speed for empty Doc.spans.

For longer texts, the trained pipeline speeds improve 15% or more in prediction. We benchmarked en_core_web_md (same components as in v3.2) and de_core_news_md (with the new trainable lemmatizer) across a range of text sizes on Linux (Intel Xeon W-2265) and OS X (M1) to compare spaCy v3.2 vs. v3.3:

Intel Xeon W-2265

Model	Avg. Words/Doc	v3.2 Words/Sec	v3.3 Words/Sec	Diff
`en_core_web_md`	100	17292	17441	0.86%
(=same components)	1000	15408	16024	4.00%
	10000	12798	15346	19.91%
`de_core_news_md`	100	20221	19321	-4.45%
(+v3.3 trainable lemmatizer)	1000	17480	17345	-0.77%
	10000	14513	17036	17.38%

Apple M1

Model	Avg. Words/Doc	v3.2 Words/Sec	v3.3 Words/Sec	Diff
`en_core_web_md`	100	18272	18408	0.74%
(=same components)	1000	18794	19248	2.42%
	10000	15144	17513	15.64%
`de_core_news_md`	100	19227	19591	1.89%
(+v3.3 trainable lemmatizer)	1000	20047	20628	2.90%
	10000	15921	18546	16.49%

Trainable lemmatizer

The new trainable lemmatizer component uses edit trees to transform tokens into lemmas. Try out the trainable lemmatizer with the training quickstart!

displaCy support for overlapping spans and arcs

displaCy now supports overlapping spans with a new span style and multiple arcs with different labels between the same tokens for dep visualizations.

Overlapping spans can be visualized for any spans key in doc.spans:

import spacy
from spacy import displacy
from spacy.tokens import Span

nlp = spacy.blank("en")
text = "Welcome to the Bank of China."
doc = nlp(text)
doc.spans["custom"] = [Span(doc, 3, 6, "ORG"), Span(doc, 5, 6, "GPE")]
displacy.serve(doc, style="span", options={"spans_key": "custom"})

import DisplacySpanHtml from 'images/displacy-span.html'

12 KiB Raw Blame History

New features

Speed improvements

Trainable lemmatizer

displaCy support for overlapping spans and arcs

12 KiB

Raw Blame History