12 KiB
title | teaser | menu | ||||||
---|---|---|---|---|---|---|---|---|
What's New in v3.3 | New features and how to upgrade |
|
New features
spaCy v3.3 improves the speed of core pipeline components, adds a new trainable lemmatizer, and introduces trained pipelines for Finnish, Korean and Swedish.
Speed improvements
v3.3 includes a slew of speed improvements:
- Speed up parser and NER by using constant-time head lookups.
- Support unnormalized softmax probabilities in
spacy.Tagger.v2
to speed up inference for tagger, morphologizer, senter and trainable lemmatizer. - Speed up parser projectivization functions.
- Replace
Ragged
with fasterAlignmentArray
inExample
for training. - Improve
Matcher
speed. - Improve serialization speed for empty
Doc.spans
.
For longer texts, the trained pipeline speeds improve 15% or more in
prediction. We benchmarked en_core_web_md
(same components as in v3.2) and
de_core_news_md
(with the new trainable lemmatizer) across a range of text
sizes on Linux (Intel Xeon W-2265) and OS X (M1) to compare spaCy v3.2 vs. v3.3:
Intel Xeon W-2265
Model | Avg. Words/Doc | v3.2 Words/Sec | v3.3 Words/Sec | Diff |
---|---|---|---|---|
en_core_web_md |
100 | 17292 | 17441 | 0.86% |
(=same components) | 1000 | 15408 | 16024 | 4.00% |
10000 | 12798 | 15346 | 19.91% | |
de_core_news_md |
100 | 20221 | 19321 | -4.45% |
(+v3.3 trainable lemmatizer) | 1000 | 17480 | 17345 | -0.77% |
10000 | 14513 | 17036 | 17.38% |
Apple M1
Model | Avg. Words/Doc | v3.2 Words/Sec | v3.3 Words/Sec | Diff |
---|---|---|---|---|
en_core_web_md |
100 | 18272 | 18408 | 0.74% |
(=same components) | 1000 | 18794 | 19248 | 2.42% |
10000 | 15144 | 17513 | 15.64% | |
de_core_news_md |
100 | 19227 | 19591 | 1.89% |
(+v3.3 trainable lemmatizer) | 1000 | 20047 | 20628 | 2.90% |
10000 | 15921 | 18546 | 16.49% |
Trainable lemmatizer
The new trainable lemmatizer component uses edit trees to transform tokens into lemmas. Try out the trainable lemmatizer with the training quickstart!
displaCy support for overlapping spans and arcs
displaCy now supports overlapping spans with a new
span
style and multiple arcs with different labels
between the same tokens for dep
visualizations.
Overlapping spans can be visualized for any spans key in doc.spans
:
import spacy
from spacy import displacy
from spacy.tokens import Span
nlp = spacy.blank("en")
text = "Welcome to the Bank of China."
doc = nlp(text)
doc.spans["custom"] = [Span(doc, 3, 6, "ORG"), Span(doc, 5, 6, "GPE")]
displacy.serve(doc, style="span", options={"spans_key": "custom"})
import DisplacySpanHtml from 'images/displacy-span.html'