spaCy/facts-figures.md at 499c39acba4bedd1b73e3c0687020387013ee4ba

13 KiB

Raw Blame History

title

teaser

Facts & Figures

The hard numbers for spaCy and how it compares to other tools

/usage/spacy-101

Feature Comparison

comparison

Benchmarks

benchmarks

powered-by

Other Libraries

other-libraries

Feature comparison

Here's a quick comparison of the functionalities offered by spaCy, NLTK and CoreNLP.

	spaCy	NLTK	CoreNLP
Programming language	Python	Python	Java / Python
Neural network models	✅	❌	✅
Integrated word vectors	✅	❌	❌
Multi-language support	✅	✅	✅
Tokenization	✅	✅	✅
Part-of-speech tagging	✅	✅	✅
Sentence segmentation	✅	✅	✅
Dependency parsing	✅	❌	✅
Entity recognition	✅	✅	✅
Entity linking	✅	❌	❌
Coreference resolution	❌	❌	✅

When should I use what?

Natural Language Understanding is an active area of research and development, so there are many different tools or technologies catering to different use-cases. The table below summarizes a few libraries (spaCy, NLTK, AllenNLP, StanfordNLP and TensorFlow) to help you get a feel for things fit together.

	spaCy	NLTK	Allen- NLP	Stanford- NLP	Tensor- Flow
I'm a beginner and just getting started with NLP.	✅	✅	❌	✅	❌
I want to build an end-to-end production application.	✅	❌	❌	❌	✅
I want to try out different neural network architectures for NLP.	❌	❌	✅	❌	✅
I want to try the latest models with state-of-the-art accuracy.	❌	❌	✅	✅	✅
I want to train models from my own data.	✅	✅	✅	✅	✅
I want my application to be efficient on CPU.	✅	✅	❌	❌	❌

Benchmarks

Two peer-reviewed papers in 2015 confirmed that spaCy offers the fastest syntactic parser in the world and that its accuracy is within 1% of the best available. The few systems that are more accurate are 20× slower or more.

About the evaluation

The first of the evaluations was published by Yahoo! Labs and Emory University, as part of a survey of current parsing technologies (Choi et al., 2015). Their results and subsequent discussions helped us develop a novel psychologically-motivated technique to improve spaCy's accuracy, which we published in joint work with Macquarie University (Honnibal and Johnson, 2015).

import BenchmarksChoi from 'usage/_benchmarks-choi.md'

Algorithm comparison

In this section, we compare spaCy's algorithms to recently published systems, using some of the most popular benchmarks. These benchmarks are designed to help isolate the contributions of specific algorithmic decisions, so they promote slightly "idealized" conditions. Specifically, the text comes pre-processed with "gold standard" token and sentence boundaries. The data sets also tend to be fairly small, to help researchers iterate quickly. These conditions mean the models trained on these data sets are not always useful for practical purposes.

Parse accuracy (Penn Treebank / Wall Street Journal)

This is the "classic" evaluation, so it's the number parsing researchers are most easily able to put in context. However, it's quite far removed from actual usage: it uses sentences with gold-standard segmentation and tokenization, from a pretty specific type of text (articles from a single newspaper, 1984-1989).

Methodology

Andor et al. (2016) chose slightly different experimental conditions from Choi et al. (2015), so the two accuracy tables here do not present directly comparable figures.

System	Year	Type	Accuracy
spaCy v2.0.0	2017	neural	94.48
spaCy v1.1.0	2016	linear	92.80
Dozat and Manning	2017	neural	95.75
Andor et al.	2016	neural	94.44
SyntaxNet Parsey McParseface	2016	neural	94.15
Weiss et al.	2015	neural	93.91
Zhang and McDonald	2014	linear	93.32
Martins et al.	2013	linear	93.10

NER accuracy (OntoNotes 5, no pre-process)

This is the evaluation we use to tune spaCy's parameters to decide which algorithms are better than the others. It's reasonably close to actual usage, because it requires the parses to be produced from raw text, without any pre-processing.

System	Year	Type	Accuracy
spaCy `en_core_web_lg` v2.0.0a3	2017	neural	85.85
Strubell et al.	2017	neural	86.81
Chiu and Nichols	2016	neural	86.19
Durrett and Klein	2014	neural	84.04
Ratinov and Roth	2009	linear	83.45

Model comparison

In this section, we provide benchmark accuracies for the pre-trained model pipelines we distribute with spaCy. Evaluations are conducted end-to-end from raw text, with no "gold standard" pre-processing, over text from a mix of genres where possible.

Methodology

The evaluation was conducted on raw text with no gold standard information. The parser, tagger and entity recognizer were trained on the OntoNotes 5 corpus, the word vectors on Common Crawl.

English

Model	spaCy	Type	UAS	NER F	POS	WPS	Size
`en_core_web_sm` 2.0.0	2.x	neural	91.7	85.3	97.0	10.1k	35MB
`en_core_web_md` 2.0.0	2.x	neural	91.7	85.9	97.1	10.0k	115MB
`en_core_web_lg` 2.0.0	2.x	neural	91.9	85.9	97.2	10.0k	812MB
`en_core_web_sm` 1.2.0	1.x	linear	86.6	78.5	96.6	25.7k	50MB
`en_core_web_md` 1.2.1	1.x	linear	90.6	81.4	96.7	18.8k	1GB

Spanish

Evaluation note

The NER accuracy refers to the "silver standard" annotations in the WikiNER corpus. Accuracy on these annotations tends to be higher than correct human annotations.

Model	spaCy	Type	UAS	NER F	POS	WPS	Size
`es_core_news_sm` 2.0.0	2.x	neural	89.8	88.7	96.9	n/a	35MB
`es_core_news_md` 2.0.0	2.x	neural	90.2	89.0	97.8	n/a	93MB
`es_core_web_md` 1.1.0	1.x	linear	87.5	94.2	96.7	n/a	377MB

Detailed speed comparison

Here we compare the per-document processing time of various spaCy functionalities against other NLP libraries. We show both absolute timings (in ms) and relative performance (normalized to spaCy). Lower is better.

This evaluation was conducted in 2015. We're working on benchmarks on current CPU and GPU hardware. In the meantime, we're grateful to the Stanford folks for drawing our attention to what seems to be a long-standing error in our CoreNLP benchmarks, especially for their tokenizer. Until we run corrected experiments, we have updated the table using their figures.

Methodology

Set up: 100,000 plain-text documents were streamed from an SQLite3 database, and processed with an NLP library, to one of three levels of detail — tokenization, tagging, or parsing. The tasks are additive: to parse the text you have to tokenize and tag it. The pre-processing was not subtracted from the times — we report the time required for the pipeline to complete. We report mean times per document, in milliseconds.

Hardware: Intel i7-3770 (2012)

Implementation: spacy-benchmarks

	Absolute (ms per doc)			Relative (to spaCy)
System	Tokenize	Tag	Parse	Tokenize	Tag	Parse
spaCy	0.2ms	1ms	19ms	1x	1x	1x
CoreNLP	0.18ms	10ms	49ms	0.9x	10x	2.6x
ZPar	1ms	8ms	850ms	5x	8x	44.7x
NLTK	4ms	443ms	n/a	20x	443x	n/a

13 KiB Raw Blame History Unescape Escape

Feature comparison

When should I use what?

Benchmarks

About the evaluation

Algorithm comparison

Parse accuracy (Penn Treebank / Wall Street Journal)

Methodology

NER accuracy (OntoNotes 5, no pre-process)

Model comparison

Methodology

English

Spanish

Evaluation note

Detailed speed comparison

Methodology

13 KiB

Raw Blame History