spaCy is a **free, open-source library** for advanced **Natural Language
Processing** (NLP) in Python.
If you're working with a lot of text, you'll eventually want to know more about
it. For example, what's it about? What do the words mean in context? Who is
doing what to whom? What companies and products are mentioned? Which texts are
similar to each other?
spaCy is designed specifically for **production use** and helps you build
applications that process and "understand" large volumes of text. It can be used
to build **information extraction** or **natural language understanding**
systems, or to pre-process text for **deep learning**.
- [Features](#features)
- [Linguistic annotations](#annotations)
- [Tokenization](#annotations-token)
- [POS tags and dependencies](#annotations-pos-deps)
- [Named entities](#annotations-ner)
- [Word vectors and similarity](#vectors-similarity)
- [Pipelines](#pipelines)
- [Vocab, hashes and lexemes](#vocab)
- [Serialization](#serialization)
- [Training](#training)
- [Language data](#language-data)
- [Lightning tour](#lightning-tour)
- [Architecture](#architecture)
- [Community & FAQ](#community)
### What spaCy isn't {#what-spacy-isnt}
- **spaCy is not a platform or "an API"**. Unlike a platform, spaCy does not
provide a software as a service, or a web application. It's an open-source
library designed to help you build NLP applications, not a consumable service.
- **spaCy is not an out-of-the-box chat bot engine**. While spaCy can be used to
power conversational applications, it's not designed specifically for chat
bots, and only provides the underlying text processing capabilities.
- **spaCy is not research software**. It's built on the latest research, but
it's designed to get things done. This leads to fairly different design
decisions than [NLTK](https://github.com/nltk/nltk) or
[CoreNLP](https://stanfordnlp.github.io/CoreNLP/), which were created as
platforms for teaching and research. The main difference is that spaCy is
integrated and opinionated. spaCy tries to avoid asking the user to choose
between multiple algorithms that deliver equivalent functionality. Keeping the
menu small lets spaCy deliver generally better performance and developer
experience.
- **spaCy is not a company**. It's an open-source library. Our company
publishing spaCy and other software is called
[Explosion](https://explosion.ai).
## Features {#features}
In the documentation, you'll come across mentions of spaCy's features and
capabilities. Some of them refer to linguistic concepts, while others are
related to more general machine learning functionality.
| Name | Description |
| ------------------------------------- | ------------------------------------------------------------------------------------------------------------------ |
| **Tokenization** | Segmenting text into words, punctuations marks etc. |
| **Part-of-speech** (POS) **Tagging** | Assigning word types to tokens, like verb or noun. |
| **Dependency Parsing** | Assigning syntactic dependency labels, describing the relations between individual tokens, like subject or object. |
| **Lemmatization** | Assigning the base forms of words. For example, the lemma of "was" is "be", and the lemma of "rats" is "rat". |
| **Sentence Boundary Detection** (SBD) | Finding and segmenting individual sentences. |
| **Named Entity Recognition** (NER) | Labelling named "real-world" objects, like persons, companies or locations. |
| **Entity Linking** (EL) | Disambiguating textual entities to unique identifiers in a knowledge base. |
| **Similarity** | Comparing words, text spans and documents and how similar they are to each other. |
| **Text Classification** | Assigning categories or labels to a whole document, or parts of a document. |
| **Rule-based Matching** | Finding sequences of tokens based on their texts and linguistic annotations, similar to regular expressions. |
| **Training** | Updating and improving a statistical model's predictions. |
| **Serialization** | Saving objects to files or byte strings. |
### Statistical models {#statistical-models}
While some of spaCy's features work independently, others require
[ statistical models](/models) to be loaded, which enable spaCy to **predict**
linguistic annotations – for example, whether a word is a verb or a noun. spaCy
currently offers statistical models for a variety of languages, which can be
installed as individual Python modules. Models can differ in size, speed, memory
usage, accuracy and the data they include. The model you choose always depends
on your use case and the texts you're working with. For a general-purpose use
case, the small, default models are always a good start. They typically include
the following components:
- **Binary weights** for the part-of-speech tagger, dependency parser and named
entity recognizer to predict those annotations in context.
- **Lexical entries** in the vocabulary, i.e. words and their
context-independent attributes like the shape or spelling.
- **Data files** like lemmatization rules and lookup tables.
- **Word vectors**, i.e. multi-dimensional meaning representations of words that
let you determine how similar they are to each other.
- **Configuration** options, like the language and processing pipeline settings,
to put spaCy in the correct state when you load in the model.
## Linguistic annotations {#annotations}
spaCy provides a variety of linguistic annotations to give you **insights into a
text's grammatical structure**. This includes the word types, like the parts of
speech, and how the words are related to each other. For example, if you're
analyzing text, it makes a huge difference whether a noun is the subject of a
sentence, or the object – or whether "google" is used as a verb, or refers to
the website or company in a specific context.
> #### Loading models
>
> ```bash
> $ python -m spacy download en_core_web_sm
>
> >>> import spacy
> >>> nlp = spacy.load("en_core_web_sm")
> ```
Once you've [downloaded and installed](/usage/models) a model, you can load it
via [`spacy.load()`](/api/top-level#spacy.load). This will return a `Language`
object containing all components and data needed to process text. We usually
call it `nlp`. Calling the `nlp` object on a string of text will return a
processed `Doc`:
```python
### {executable="true"}
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
for token in doc:
print(token.text, token.pos_, token.dep_)
```
Even though a `Doc` is processed – e.g. split into individual words and
annotated – it still holds **all information of the original text**, like
whitespace characters. You can always get the offset of a token into the
original string, or reconstruct the original by joining the tokens and their
trailing whitespace. This way, you'll never lose any information when processing
text with spaCy.
### Tokenization {#annotations-token}
import Tokenization101 from 'usage/101/\_tokenization.md'