spaCy/facts-figures.md at bc85b12e6d0a1e7def18105c8c359e7f6f968e2c

3.9 KiB

Raw Blame History

title

teaser

Facts & Figures

The hard numbers for spaCy and how it compares to other tools

/usage/spacy-101

Feature Comparison

comparison

Benchmarks

benchmarks

Comparison

When should I use spaCy?

✅ I'm a beginner and just getting started with NLP. – spaCy makes it easy to get started and comes with extensive documentation, including a beginner-friendly 101 guide, a free interactive online course and a range of video tutorials.
✅ I want to build an end-to-end production application. – spaCy is specifically designed for production use and lets you build and train powerful NLP pipelines and package them for easy deployment.
✅ I want my application to be efficient on GPU and CPU. – While spaCy lets you train modern NLP models that are best run on GPU, it also offers CPU-optimized pipelines, which are less accurate but much cheaper to run.
✅ I want to try out different neural network architectures for NLP. – spaCy lets you customize and swap out the model architectures powering its components, and implement your own using a framework like PyTorch or TensorFlow. The declarative configuration system makes it easy to mix and match functions and keep track of your hyperparameters to make sure your experiments are reproducible.
❌ I want to build a language generation application. – spaCy's focus is natural language processing and extracting information from large volumes of text. While you can use it to help you re-write existing text, it doesn't include any specific functionality for language generation tasks.
❌ I want to research machine learning algorithms. spaCy is built on the latest research, but it's not a research library. If your goal is to write papers and run benchmarks, spaCy is probably not a good choice. However, you can use it to make the results of your research easily available for others to use, e.g. via a custom spaCy component.

Benchmarks

spaCy v3.0 introduces transformer-based pipelines that bring spaCy's accuracy right up to current state-of-the-art. You can also use a CPU-optimized pipeline, which is less accurate but much cheaper to run.

Evaluation details

OntoNotes 5.0: spaCy's English models are trained on this corpus, as it's several times larger than other English treebanks. However, most systems do not report accuracies on it.

Penn Treebank: The "classic" parsing evaluation for research. However, it's quite far removed from actual usage: it uses sentences with gold-standard segmentation and tokenization, from a pretty specific type of text (articles from a single newspaper, 1984-1989).

import Benchmarks from 'usage/_benchmarks-models.md'

Dependency Parsing System	UAS	LAS
spaCy RoBERTa (2020)¹	95.5	94.3
spaCy CNN (2020)¹
Mrini et al. (2019)	97.4	96.3
Zhou and Zhao (2019)	97.2	95.7

Dependency parsing accuracy on the Penn Treebank. See NLP-progress for more results. **1. ** Project template: benchmarks/parsing_penn_treebank.

3.9 KiB Raw Blame History Unescape Escape

Comparison

When should I use spaCy?

Benchmarks

Evaluation details

3.9 KiB

Raw Blame History