13 KiB
title | teaser | next | menu | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Facts & Figures | The hard numbers for spaCy and how it compares to other tools | /usage/spacy-101 |
|
Feature comparison
Here's a quick comparison of the functionalities offered by spaCy, NLTK and CoreNLP.
spaCy | NLTK | CoreNLP | |
---|---|---|---|
Programming language | Python | Python | Java / Python |
Neural network models | ✅ | ❌ | ✅ |
Integrated word vectors | ✅ | ❌ | ❌ |
Multi-language support | ✅ | ✅ | ✅ |
Tokenization | ✅ | ✅ | ✅ |
Part-of-speech tagging | ✅ | ✅ | ✅ |
Sentence segmentation | ✅ | ✅ | ✅ |
Dependency parsing | ✅ | ❌ | ✅ |
Entity recognition | ✅ | ✅ | ✅ |
Entity linking | ✅ | ❌ | ❌ |
Coreference resolution | ❌ | ❌ | ✅ |
When should I use what?
Natural Language Understanding is an active area of research and development, so there are many different tools or technologies catering to different use-cases. The table below summarizes a few libraries (spaCy, NLTK, AllenNLP, StanfordNLP and TensorFlow) to help you get a feel for things fit together.
spaCy | NLTK | Allen- NLP |
Stanford- NLP |
Tensor- Flow |
|
---|---|---|---|---|---|
I'm a beginner and just getting started with NLP. | ✅ | ✅ | ❌ | ✅ | ❌ |
I want to build an end-to-end production application. | ✅ | ❌ | ❌ | ❌ | ✅ |
I want to try out different neural network architectures for NLP. | ❌ | ❌ | ✅ | ❌ | ✅ |
I want to try the latest models with state-of-the-art accuracy. | ❌ | ❌ | ✅ | ✅ | ✅ |
I want to train models from my own data. | ✅ | ✅ | ✅ | ✅ | ✅ |
I want my application to be efficient on CPU. | ✅ | ✅ | ❌ | ❌ | ❌ |
Benchmarks
Two peer-reviewed papers in 2015 confirmed that spaCy offers the fastest syntactic parser in the world and that its accuracy is within 1% of the best available. The few systems that are more accurate are 20× slower or more.
About the evaluation
The first of the evaluations was published by Yahoo! Labs and Emory University, as part of a survey of current parsing technologies (Choi et al., 2015). Their results and subsequent discussions helped us develop a novel psychologically-motivated technique to improve spaCy's accuracy, which we published in joint work with Macquarie University (Honnibal and Johnson, 2015).
import BenchmarksChoi from 'usage/_benchmarks-choi.md'
Algorithm comparison
In this section, we compare spaCy's algorithms to recently published systems, using some of the most popular benchmarks. These benchmarks are designed to help isolate the contributions of specific algorithmic decisions, so they promote slightly "idealized" conditions. Specifically, the text comes pre-processed with "gold standard" token and sentence boundaries. The data sets also tend to be fairly small, to help researchers iterate quickly. These conditions mean the models trained on these data sets are not always useful for practical purposes.
Parse accuracy (Penn Treebank / Wall Street Journal)
This is the "classic" evaluation, so it's the number parsing researchers are most easily able to put in context. However, it's quite far removed from actual usage: it uses sentences with gold-standard segmentation and tokenization, from a pretty specific type of text (articles from a single newspaper, 1984-1989).
Methodology
Andor et al. (2016) chose slightly different experimental conditions from Choi et al. (2015), so the two accuracy tables here do not present directly comparable figures.
System | Year | Type | Accuracy |
---|---|---|---|
spaCy v2.0.0 | 2017 | neural | 94.48 |
spaCy v1.1.0 | 2016 | linear | 92.80 |
Dozat and Manning | 2017 | neural | 95.75 |
Andor et al. | 2016 | neural | 94.44 |
SyntaxNet Parsey McParseface | 2016 | neural | 94.15 |
Weiss et al. | 2015 | neural | 93.91 |
Zhang and McDonald | 2014 | linear | 93.32 |
Martins et al. | 2013 | linear | 93.10 |
NER accuracy (OntoNotes 5, no pre-process)
This is the evaluation we use to tune spaCy's parameters to decide which algorithms are better than the others. It's reasonably close to actual usage, because it requires the parses to be produced from raw text, without any pre-processing.
System | Year | Type | Accuracy |
---|---|---|---|
spaCy en_core_web_lg v2.0.0a3 |
2017 | neural | 85.85 |
Strubell et al. | 2017 | neural | 86.81 |
Chiu and Nichols | 2016 | neural | 86.19 |
Durrett and Klein | 2014 | neural | 84.04 |
Ratinov and Roth | 2009 | linear | 83.45 |
Model comparison
In this section, we provide benchmark accuracies for the pre-trained model pipelines we distribute with spaCy. Evaluations are conducted end-to-end from raw text, with no "gold standard" pre-processing, over text from a mix of genres where possible.
Methodology
The evaluation was conducted on raw text with no gold standard information. The parser, tagger and entity recognizer were trained on the OntoNotes 5 corpus, the word vectors on Common Crawl.
English
Model | spaCy | Type | UAS | NER F | POS | WPS | Size |
---|---|---|---|---|---|---|---|
en_core_web_sm 2.0.0 |
2.x | neural | 91.7 | 85.3 | 97.0 | 10.1k | 35MB |
en_core_web_md 2.0.0 |
2.x | neural | 91.7 | 85.9 | 97.1 | 10.0k | 115MB |
en_core_web_lg 2.0.0 |
2.x | neural | 91.9 | 85.9 | 97.2 | 10.0k | 812MB |
en_core_web_sm 1.2.0 |
1.x | linear | 86.6 | 78.5 | 96.6 | 25.7k | 50MB |
en_core_web_md 1.2.1 |
1.x | linear | 90.6 | 81.4 | 96.7 | 18.8k | 1GB |
Spanish
Evaluation note
The NER accuracy refers to the "silver standard" annotations in the WikiNER corpus. Accuracy on these annotations tends to be higher than correct human annotations.
Model | spaCy | Type | UAS | NER F | POS | WPS | Size |
---|---|---|---|---|---|---|---|
es_core_news_sm 2.0.0 |
2.x | neural | 89.8 | 88.7 | 96.9 | n/a | 35MB |
es_core_news_md 2.0.0 |
2.x | neural | 90.2 | 89.0 | 97.8 | n/a | 93MB |
es_core_web_md 1.1.0 |
1.x | linear | 87.5 | 94.2 | 96.7 | n/a | 377MB |
Detailed speed comparison
Here we compare the per-document processing time of various spaCy functionalities against other NLP libraries. We show both absolute timings (in ms) and relative performance (normalized to spaCy). Lower is better.
This evaluation was conducted in 2015. We're working on benchmarks on current CPU and GPU hardware. In the meantime, we're grateful to the Stanford folks for drawing our attention to what seems to be a long-standing error in our CoreNLP benchmarks, especially for their tokenizer. Until we run corrected experiments, we have updated the table using their figures.
Methodology
- Set up: 100,000 plain-text documents were streamed from an SQLite3 database, and processed with an NLP library, to one of three levels of detail — tokenization, tagging, or parsing. The tasks are additive: to parse the text you have to tokenize and tag it. The pre-processing was not subtracted from the times — we report the time required for the pipeline to complete. We report mean times per document, in milliseconds.
- Hardware: Intel i7-3770 (2012)
- Implementation:
spacy-benchmarks
Absolute (ms per doc) | Relative (to spaCy) | |||||
---|---|---|---|---|---|---|
System | Tokenize | Tag | Parse | Tokenize | Tag | Parse |
spaCy | 0.2ms | 1ms | 19ms | 1x | 1x | 1x |
CoreNLP | 0.18ms | 10ms | 49ms | 0.9x | 10x | 2.6x |
ZPar | 1ms | 8ms | 850ms | 5x | 8x | 44.7x |
NLTK | 4ms | 443ms | n/a | 20x | 443x | n/a |