spaCy/website/docs/usage/spacy-101.md

---
title: 'spaCy 101: Everything you need to know'
teaser: The most important concepts, explained in simple terms
menu:
  - ["What's spaCy?", 'whats-spacy']
  - ['Features', 'features']
  - ['Linguistic Annotations', 'annotations']
  - ['Pipelines', 'pipelines']
  - ['Vocab', 'vocab']
  - ['Serialization', 'serialization']
  - ['Training', 'training']
  - ['Language Data', 'language-data']
  - ['Lightning Tour', 'lightning-tour']
  - ['Architecture', 'architecture']
  - ['Community & FAQ', 'community-faq']
---

Whether you're new to spaCy, or just want to brush up on some NLP basics and
implementation details – this page should have you covered. Each section will
explain one of spaCy's features in simple terms and with examples or
illustrations. Some sections will also reappear across the usage guides as a
quick introduction.

> #### Help us improve the docs
>
> Did you spot a mistake or come across explanations that are unclear? We always
> appreciate improvement
> [suggestions](https://github.com/explosion/spaCy/issues) or
> [pull requests](https://github.com/explosion/spaCy/pulls). You can find a
> "Suggest edits" link at the bottom of each page that points you to the source.

<Infobox title="Take the free interactive course">

[![Advanced NLP with spaCy](../images/course.jpg)](https://course.spacy.io)

In this course you'll learn how to use spaCy to build advanced natural language
understanding systems, using both rule-based and machine learning approaches. It
includes 55 exercises featuring interactive coding practice, multiple-choice
questions and slide decks.

<p><Button to="https://course.spacy.io" variant="primary">Start the course</Button></p>

</Infobox>

## What's spaCy? {#whats-spacy}

<Grid cols={2}>

<div>

spaCy is a **free, open-source library** for advanced **Natural Language
Processing** (NLP) in Python.

If you're working with a lot of text, you'll eventually want to know more about
it. For example, what's it about? What do the words mean in context? Who is
doing what to whom? What companies and products are mentioned? Which texts are
similar to each other?

spaCy is designed specifically for **production use** and helps you build
applications that process and "understand" large volumes of text. It can be used
to build **information extraction** or **natural language understanding**
systems, or to pre-process text for **deep learning**.

</div>

<Infobox title="Table of contents" id="toc">

- [Features](#features)
- [Linguistic annotations](#annotations)
- [Tokenization](#annotations-token)
- [POS tags and dependencies](#annotations-pos-deps)
- [Named entities](#annotations-ner)
- [Word vectors and similarity](#vectors-similarity)
- [Pipelines](#pipelines)
- [Vocab, hashes and lexemes](#vocab)
- [Serialization](#serialization)
- [Training](#training)
- [Language data](#language-data)
- [Lightning tour](#lightning-tour)
- [Architecture](#architecture)
- [Community & FAQ](#community)

</Infobox>

</Grid>

### What spaCy isn't {#what-spacy-isnt}

- **spaCy is not a platform or "an API"**. Unlike a platform, spaCy does not
  provide a software as a service, or a web application. It's an open-source
  library designed to help you build NLP applications, not a consumable service.

- **spaCy is not an out-of-the-box chat bot engine**. While spaCy can be used to
  power conversational applications, it's not designed specifically for chat
  bots, and only provides the underlying text processing capabilities.

- **spaCy is not research software**. It's built on the latest research, but
  it's designed to get things done. This leads to fairly different design
  decisions than [NLTK](https://github.com/nltk/nltk) or
  [CoreNLP](https://stanfordnlp.github.io/CoreNLP/), which were created as
  platforms for teaching and research. The main difference is that spaCy is
  integrated and opinionated. spaCy tries to avoid asking the user to choose
  between multiple algorithms that deliver equivalent functionality. Keeping the
  menu small lets spaCy deliver generally better performance and developer
  experience.

- **spaCy is not a company**. It's an open-source library. Our company
  publishing spaCy and other software is called
  [Explosion AI](https://explosion.ai).

## Features {#features}

In the documentation, you'll come across mentions of spaCy's features and
capabilities. Some of them refer to linguistic concepts, while others are
related to more general machine learning functionality.

| Name                                  | Description                                                                                                        |
| ------------------------------------- | ------------------------------------------------------------------------------------------------------------------ |
| **Tokenization**                      | Segmenting text into words, punctuations marks etc.                                                                |
| **Part-of-speech** (POS) **Tagging**  | Assigning word types to tokens, like verb or noun.                                                                 |
| **Dependency Parsing**                | Assigning syntactic dependency labels, describing the relations between individual tokens, like subject or object. |
| **Lemmatization**                     | Assigning the base forms of words. For example, the lemma of "was" is "be", and the lemma of "rats" is "rat".      |
| **Sentence Boundary Detection** (SBD) | Finding and segmenting individual sentences.                                                                       |
| **Named Entity Recognition** (NER)    | Labelling named "real-world" objects, like persons, companies or locations.                                        |
| **Entity Linking** (EL)               | Disambiguating textual entities to unique identifiers in a Knowledge Base.                                         |
| **Similarity**                        | Comparing words, text spans and documents and how similar they are to each other.                                  |
| **Text Classification**               | Assigning categories or labels to a whole document, or parts of a document.                                        |
| **Rule-based Matching**               | Finding sequences of tokens based on their texts and linguistic annotations, similar to regular expressions.       |
| **Training**                          | Updating and improving a statistical model's predictions.                                                          |
| **Serialization**                     | Saving objects to files or byte strings.                                                                           |

### Statistical models {#statistical-models}

While some of spaCy's features work independently, others require
[ statistical models](/models) to be loaded, which enable spaCy to **predict**
linguistic annotations – for example, whether a word is a verb or a noun. spaCy
currently offers statistical models for a variety of languages, which can be
installed as individual Python modules. Models can differ in size, speed, memory
usage, accuracy and the data they include. The model you choose always depends
on your use case and the texts you're working with. For a general-purpose use
case, the small, default models are always a good start. They typically include
the following components:

- **Binary weights** for the part-of-speech tagger, dependency parser and named
  entity recognizer to predict those annotations in context.
- **Lexical entries** in the vocabulary, i.e. words and their
  context-independent attributes like the shape or spelling.
- **Data files** like lemmatization rules and lookup tables.
- **Word vectors**, i.e. multi-dimensional meaning representations of words that
  let you determine how similar they are to each other.
- **Configuration** options, like the language and processing pipeline settings,
  to put spaCy in the correct state when you load in the model.

## Linguistic annotations {#annotations}

spaCy provides a variety of linguistic annotations to give you **insights into a
text's grammatical structure**. This includes the word types, like the parts of
speech, and how the words are related to each other. For example, if you're
analyzing text, it makes a huge difference whether a noun is the subject of a
sentence, or the object – or whether "google" is used as a verb, or refers to
the website or company in a specific context.

> #### Loading models
>
> ```bash
> $ python -m spacy download en_core_web_sm
>
> >>> import spacy
> >>> nlp = spacy.load("en_core_web_sm")
> ```

Once you've [downloaded and installed](/usage/models) a model, you can load it
via [`spacy.load()`](/api/top-level#spacy.load). This will return a `Language`
object containing all components and data needed to process text. We usually
call it `nlp`. Calling the `nlp` object on a string of text will return a
processed `Doc`:

```python
### {executable="true"}
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
for token in doc:
    print(token.text, token.pos_, token.dep_)
```

Even though a `Doc` is processed – e.g. split into individual words and
annotated – it still holds **all information of the original text**, like
whitespace characters. You can always get the offset of a token into the
original string, or reconstruct the original by joining the tokens and their
trailing whitespace. This way, you'll never lose any information when processing
text with spaCy.

### Tokenization {#annotations-token}

import Tokenization101 from 'usage/101/\_tokenization.md'

<Tokenization101 />

<Infobox title="📖 Tokenization rules">

To learn more about how spaCy's tokenization rules work in detail, how to
**customize and replace** the default tokenizer and how to **add
language-specific data**, see the usage guides on
[adding languages](/usage/adding-languages) and
[customizing the tokenizer](/usage/linguistic-features#tokenization).

</Infobox>

### Part-of-speech tags and dependencies {#annotations-pos-deps model="parser"}

import PosDeps101 from 'usage/101/\_pos-deps.md'

<PosDeps101 />

<Infobox title="📖 Part-of-speech tagging and morphology">

To learn more about **part-of-speech tagging** and rule-based morphology, and
how to **navigate and use the parse tree** effectively, see the usage guides on
[part-of-speech tagging](/usage/linguistic-features#pos-tagging) and
[using the dependency parse](/usage/linguistic-features#dependency-parse).

</Infobox>

### Named Entities {#annotations-ner model="ner"}

import NER101 from 'usage/101/\_named-entities.md'

<NER101 />

<Infobox title="📖 Named Entity Recognition">

To learn more about entity recognition in spaCy, how to **add your own
entities** to a document and how to **train and update** the entity predictions
of a model, see the usage guides on
[named entity recognition](/usage/linguistic-features#named-entities) and
[training the named entity recognizer](/usage/training#ner).

</Infobox>

### Word vectors and similarity {#vectors-similarity model="vectors"}

import Vectors101 from 'usage/101/\_vectors-similarity.md'

<Vectors101 />

<Infobox title="📖 Word vectors">

To learn more about word vectors, how to **customize them** and how to load
**your own vectors** into spaCy, see the usage guide on
[using word vectors and semantic similarities](/usage/vectors-similarity).

</Infobox>

## Pipelines {#pipelines}

import Pipelines101 from 'usage/101/\_pipelines.md'

<Pipelines101 />

<Infobox title="📖 Processing pipelines">

To learn more about **how processing pipelines work** in detail, how to enable
and disable their components, and how to **create your own**, see the usage
guide on [language processing pipelines](/usage/processing-pipelines).

</Infobox>

## Vocab, hashes and lexemes {#vocab}

Whenever possible, spaCy tries to store data in a vocabulary, the
[`Vocab`](/api/vocab), that will be **shared by multiple documents**. To save
memory, spaCy also encodes all strings to **hash values** – in this case for
example, "coffee" has the hash `3197928453018144401`. Entity labels like "ORG"
and part-of-speech tags like "VERB" are also encoded. Internally, spaCy only
"speaks" in hash values.

> - **Token**: A word, punctuation mark etc. _in context_, including its
>   attributes, tags and dependencies.
> - **Lexeme**: A "word type" with no context. Includes the word shape and
>   flags, e.g. if it's lowercase, a digit or punctuation.
> - **Doc**: A processed container of tokens in context.
> - **Vocab**: The collection of lexemes.
> - **StringStore**: The dictionary mapping hash values to strings, for example
>   `3197928453018144401` → "coffee".

![Doc, Vocab, Lexeme and StringStore](../images/vocab_stringstore.svg)

If you process lots of documents containing the word "coffee" in all kinds of
different contexts, storing the exact string "coffee" every time would take up
way too much space. So instead, spaCy hashes the string and stores it in the
[`StringStore`](/api/stringstore). You can think of the `StringStore` as a
**lookup table that works in both directions** – you can look up a string to get
its hash, or a hash to get its string:

```python
### {executable="true"}
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("I love coffee")
print(doc.vocab.strings["coffee"])  # 3197928453018144401
print(doc.vocab.strings[3197928453018144401])  # 'coffee'
```

Now that all strings are encoded, the entries in the vocabulary **don't need to
include the word text** themselves. Instead, they can look it up in the
`StringStore` via its hash value. Each entry in the vocabulary, also called
[`Lexeme`](/api/lexeme), contains the **context-independent** information about
a word. For example, no matter if "love" is used as a verb or a noun in some
context, its spelling and whether it consists of alphabetic characters won't
ever change. Its hash value will also always be the same.

```python
### {executable="true"}
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("I love coffee")
for word in doc:
    lexeme = doc.vocab[word.text]
    print(lexeme.text, lexeme.orth, lexeme.shape_, lexeme.prefix_, lexeme.suffix_,
            lexeme.is_alpha, lexeme.is_digit, lexeme.is_title, lexeme.lang_)
```

> - **Text**: The original text of the lexeme.
> - **Orth**: The hash value of the lexeme.
> - **Shape**: The abstract word shape of the lexeme.
> - **Prefix**: By default, the first letter of the word string.
> - **Suffix**: By default, the last three letters of the word string.
> - **is alpha**: Does the lexeme consist of alphabetic characters?
> - **is digit**: Does the lexeme consist of digits?

| Text   | Orth                  | Shape  | Prefix | Suffix | is_alpha | is_digit |
| ------ | --------------------- | ------ | ------ | ------ | -------- | -------- |
| I      | `4690420944186131903` | `X`    | I      | I      | `True`   | `False`  |
| love   | `3702023516439754181` | `xxxx` | l      | ove    | `True`   | `False`  |
| coffee | `3197928453018144401` | `xxxx` | c      | fee    | `True`   | `False`  |

The mapping of words to hashes doesn't depend on any state. To make sure each
value is unique, spaCy uses a
[hash function](https://en.wikipedia.org/wiki/Hash_function) to calculate the
hash **based on the word string**. This also means that the hash for "coffee"
will always be the same, no matter which model you're using or how you've
configured spaCy.

However, hashes **cannot be reversed** and there's no way to resolve
`3197928453018144401` back to "coffee". All spaCy can do is look it up in the
vocabulary. That's why you always need to make sure all objects you create have
access to the same vocabulary. If they don't, spaCy might not be able to find
the strings it needs.

```python
### {executable="true"}
import spacy
from spacy.tokens import Doc
from spacy.vocab import Vocab

nlp = spacy.load("en_core_web_sm")
doc = nlp("I love coffee")  # Original Doc
print(doc.vocab.strings["coffee"])  # 3197928453018144401
print(doc.vocab.strings[3197928453018144401])  # 'coffee' 👍

empty_doc = Doc(Vocab())  # New Doc with empty Vocab
# empty_doc.vocab.strings[3197928453018144401] will raise an error :(

empty_doc.vocab.strings.add("coffee")  # Add "coffee" and generate hash
print(empty_doc.vocab.strings[3197928453018144401])  # 'coffee' 👍

new_doc = Doc(doc.vocab)  # Create new doc with first doc's vocab
print(new_doc.vocab.strings[3197928453018144401])  # 'coffee' 👍
```

If the vocabulary doesn't contain a string for `3197928453018144401`, spaCy will
raise an error. You can re-add "coffee" manually, but this only works if you
actually _know_ that the document contains that word. To prevent this problem,
spaCy will also export the `Vocab` when you save a `Doc` or `nlp` object. This
will give you the object and its encoded annotations, plus the "key" to decode
it.

## Knowledge Base {#kb}

To support the entity linking task, spaCy stores external knowledge in a
[`KnowledgeBase`](/api/kb). The knowledge base (KB) uses the `Vocab` to store
its data efficiently.

> - **Mention**: A textual occurrence of a named entity, e.g. 'Miss Lovelace'.
> - **KB ID**: A unique identifier referring to a particular real-world concept,
>   e.g. 'Q7259'.
> - **Alias**: A plausible synonym or description for a certain KB ID, e.g. 'Ada
>   Lovelace'.
> - **Prior probability**: The probability of a certain mention resolving to a
>   certain KB ID, prior to knowing anything about the context in which the
>   mention is used.
> - **Entity vector**: A pretrained word vector capturing the entity
>   description.

A knowledge base is created by first adding all entities to it. Next, for each
potential mention or alias, a list of relevant KB IDs and their prior
probabilities is added. The sum of these prior probabilities should never exceed
1 for any given alias.

```python
### {executable="true"}
import spacy
from spacy.kb import KnowledgeBase

# load the model and create an empty KB
nlp = spacy.load('en_core_web_sm')
kb = KnowledgeBase(vocab=nlp.vocab, entity_vector_length=3)

# adding entities
kb.add_entity(entity="Q1004791", freq=6, entity_vector=[0, 3, 5])
kb.add_entity(entity="Q42", freq=342, entity_vector=[1, 9, -3])
kb.add_entity(entity="Q5301561", freq=12, entity_vector=[-2, 4, 2])

# adding aliases
kb.add_alias(alias="Douglas", entities=["Q1004791", "Q42", "Q5301561"], probabilities=[0.6, 0.1, 0.2])
kb.add_alias(alias="Douglas Adams", entities=["Q42"], probabilities=[0.9])

print()
print("Number of entities in KB:",kb.get_size_entities()) # 3
print("Number of aliases in KB:", kb.get_size_aliases()) # 2
```

### Candidate generation

Given a textual entity, the Knowledge Base can provide a list of plausible
candidates or entity identifiers. The [`EntityLinker`](/api/entitylinker) will
take this list of candidates as input, and disambiguate the mention to the most
probable identifier, given the document context.

```python
### {executable="true"}
import spacy
from spacy.kb import KnowledgeBase

nlp = spacy.load('en_core_web_sm')
kb = KnowledgeBase(vocab=nlp.vocab, entity_vector_length=3)

# adding entities
kb.add_entity(entity="Q1004791", freq=6, entity_vector=[0, 3, 5])
kb.add_entity(entity="Q42", freq=342, entity_vector=[1, 9, -3])
kb.add_entity(entity="Q5301561", freq=12, entity_vector=[-2, 4, 2])

# adding aliases
kb.add_alias(alias="Douglas", entities=["Q1004791", "Q42", "Q5301561"], probabilities=[0.6, 0.1, 0.2])

candidates = kb.get_candidates("Douglas")
for c in candidates:
    print(" ", c.entity_, c.prior_prob, c.entity_vector)
```

## Serialization {#serialization}

import Serialization101 from 'usage/101/\_serialization.md'

<Serialization101 />

<Infobox title="📖 Saving and loading">

To learn more about how to **save and load your own models**, see the usage
guide on [saving and loading](/usage/saving-loading#models).

</Infobox>

## Training {#training}

import Training101 from 'usage/101/\_training.md'

<Training101 />

<Infobox title="📖 Training statistical models">

To learn more about **training and updating** models, how to create training
data and how to improve spaCy's named entity recognition models, see the usage
guides on [training](/usage/training).

</Infobox>

## Language data {#language-data}

import LanguageData101 from 'usage/101/\_language-data.md'

<LanguageData101 />

<Infobox title="📖 Language data">

To learn more about the individual components of the language data and how to
**add a new language** to spaCy in preparation for training a language model,
see the usage guide on [adding languages](/usage/adding-languages).

</Infobox>

## Lightning tour {#lightning-tour}

The following examples and code snippets give you an overview of spaCy's
functionality and its usage.

### Install models and process text {#lightning-tour-models}

```bash
python -m spacy download en_core_web_sm
python -m spacy download de_core_news_sm
```

```python
### {executable="true"}
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Hello, world. Here are two sentences.")
print([t.text for t in doc])

nlp_de = spacy.load("de_core_news_sm")
doc_de = nlp_de("Ich bin ein Berliner.")
print([t.text for t in doc_de])

```

<Infobox>

**API:** [`spacy.load()`](/api/top-level#spacy.load) **Usage:**
[Models](/usage/models), [spaCy 101](/usage/spacy-101)

</Infobox>

### Get tokens, noun chunks & sentences {#lightning-tour-tokens-sentences model="parser"}

```python
### {executable="true"}
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Peach emoji is where it has always been. Peach is the superior "
          "emoji. It's outranking eggplant 🍑 ")
print(doc[0].text)          # 'Peach'
print(doc[1].text)          # 'emoji'
print(doc[-1].text)         # '🍑'
print(doc[17:19].text)      # 'outranking eggplant'

noun_chunks = list(doc.noun_chunks)
print(noun_chunks[0].text)  # 'Peach emoji'

sentences = list(doc.sents)
assert len(sentences) == 3
print(sentences[1].text)    # 'Peach is the superior emoji.'
```

<Infobox>

**API:** [`Doc`](/api/doc), [`Token`](/api/token) **Usage:**
[spaCy 101](/usage/spacy-101)

</Infobox>

### Get part-of-speech tags and flags {#lightning-tour-pos-tags model="tagger"}

```python
### {executable="true"}
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
apple = doc[0]
print("Fine-grained POS tag", apple.pos_, apple.pos)
print("Coarse-grained POS tag", apple.tag_, apple.tag)
print("Word shape", apple.shape_, apple.shape)
print("Alphabetic characters?", apple.is_alpha)
print("Punctuation mark?", apple.is_punct)

billion = doc[10]
print("Digit?", billion.is_digit)
print("Like a number?", billion.like_num)
print("Like an email address?", billion.like_email)
```

<Infobox>

**API:** [`Token`](/api/token) **Usage:**
[Part-of-speech tagging](/usage/linguistic-features#pos-tagging)

</Infobox>

### Use hash values for any string {#lightning-tour-hashes}

```python
### {executable="true"}
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("I love coffee")

coffee_hash = nlp.vocab.strings["coffee"]  # 3197928453018144401
coffee_text = nlp.vocab.strings[coffee_hash]  # 'coffee'
print(coffee_hash, coffee_text)
print(doc[2].orth, coffee_hash)  # 3197928453018144401
print(doc[2].text, coffee_text)  # 'coffee'

beer_hash = doc.vocab.strings.add("beer")  # 3073001599257881079
beer_text = doc.vocab.strings[beer_hash]  # 'beer'
print(beer_hash, beer_text)

unicorn_hash = doc.vocab.strings.add("🦄")  # 18234233413267120783
unicorn_text = doc.vocab.strings[unicorn_hash]  # '🦄'
print(unicorn_hash, unicorn_text)
```

<Infobox>

**API:** [`StringStore`](/api/stringstore) **Usage:**
[Vocab, hashes and lexemes 101](/usage/spacy-101#vocab)

</Infobox>

### Recognize and update named entities {#lightning-tour-entities model="ner"}

```python
### {executable="true"}
import spacy
from spacy.tokens import Span

nlp = spacy.load("en_core_web_sm")
doc = nlp("San Francisco considers banning sidewalk delivery robots")
for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

doc = nlp("FB is hiring a new VP of global policy")
doc.ents = [Span(doc, 0, 1, label="ORG")]
for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)
```

<Infobox>

**Usage:** [Named entity recognition](/usage/linguistic-features#named-entities)

</Infobox>

### Train and update neural network models {#lightning-tour-training"}

```python
import spacy
import random

nlp = spacy.load("en_core_web_sm")
train_data = [("Uber blew through $1 million", {"entities": [(0, 4, "ORG")]})]

with nlp.select_pipes(enable="ner"):
    optimizer = nlp.begin_training()
    for i in range(10):
        random.shuffle(train_data)
        for text, annotations in train_data:
            nlp.update([text], [annotations], sgd=optimizer)
nlp.to_disk("/model")
```

<Infobox>

**API:** [`Language.update`](/api/language#update) **Usage:**
[Training spaCy's statistical models](/usage/training)

</Infobox>

### Visualize a dependency parse and named entities in your browser {#lightning-tour-displacy model="parser, ner" new="2"}

> #### Output
>
> ![displaCy visualization](../images/displacy-small.svg)

```python
from spacy import displacy

doc_dep = nlp("This is a sentence.")
displacy.serve(doc_dep, style="dep")

doc_ent = nlp("When Sebastian Thrun started working on self-driving cars at Google "
              "in 2007, few people outside of the company took him seriously.")
displacy.serve(doc_ent, style="ent")
```

<Infobox>

**API:** [`displacy`](/api/top-level#displacy) **Usage:**
[Visualizers](/usage/visualizers)

</Infobox>

### Get word vectors and similarity {#lightning-tour-word-vectors model="vectors"}

```python
### {executable="true"}
import spacy

nlp = spacy.load("en_core_web_md")
doc = nlp("Apple and banana are similar. Pasta and hippo aren't.")

apple = doc[0]
banana = doc[2]
pasta = doc[6]
hippo = doc[8]

print("apple <-> banana", apple.similarity(banana))
print("pasta <-> hippo", pasta.similarity(hippo))
print(apple.has_vector, banana.has_vector, pasta.has_vector, hippo.has_vector)
```

For the best results, you should run this example using the
[`en_vectors_web_lg`](/models/en-starters#en_vectors_web_lg) model (currently
not available in the live demo).

<Infobox>

**Usage:** [Word vectors and similarity](/usage/vectors-similarity)

</Infobox>

### Simple and efficient serialization {#lightning-tour-serialization}

```python
import spacy
from spacy.tokens import Doc
from spacy.vocab import Vocab

nlp = spacy.load("en_core_web_sm")
customer_feedback = open("customer_feedback_627.txt").read()
doc = nlp(customer_feedback)
doc.to_disk("/tmp/customer_feedback_627.bin")

new_doc = Doc(Vocab()).from_disk("/tmp/customer_feedback_627.bin")
```

<Infobox>

**API:** [`Language`](/api/language), [`Doc`](/api/doc) **Usage:**
[Saving and loading models](/usage/saving-loading#models)

</Infobox>

### Match text with token rules {#lightning-tour-rule-matcher}

```python
### {executable="true"}
import spacy
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)

def set_sentiment(matcher, doc, i, matches):
    doc.sentiment += 0.1

pattern1 = [[{"ORTH": "Google"}, {"ORTH": "I"}, {"ORTH": "/"}, {"ORTH": "O"}]]
patterns = [[{"ORTH": emoji, "OP": "+"}] for emoji in ["😀", "😂", "🤣", "😍"]]
matcher.add("GoogleIO", patterns1)  # Match "Google I/O" or "Google i/o"
matcher.add("HAPPY", patterns2, on_match=set_sentiment)  # Match one or more happy emoji

doc = nlp("A text about Google I/O 😀😀")
matches = matcher(doc)

for match_id, start, end in matches:
    string_id = nlp.vocab.strings[match_id]
    span = doc[start:end]
    print(string_id, span.text)
print("Sentiment", doc.sentiment)
```

<Infobox>

**API:** [`Matcher`](/api/matcher) **Usage:**
[Rule-based matching](/usage/rule-based-matching)

</Infobox>

### Minibatched stream processing {#lightning-tour-minibatched}

```python
texts = ["One document.", "...", "Lots of documents"]
# .pipe streams input, and produces streaming output
iter_texts = (texts[i % 3] for i in range(100000000))
for i, doc in enumerate(nlp.pipe(iter_texts, batch_size=50)):
    assert doc.is_parsed
    if i == 100:
        break
```

### Get syntactic dependencies {#lightning-tour-dependencies model="parser"}

```python
### {executable="true"}
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("When Sebastian Thrun started working on self-driving cars at Google "
          "in 2007, few people outside of the company took him seriously.")

dep_labels = []
for token in doc:
    while token.head != token:
        dep_labels.append(token.dep_)
        token = token.head
print(dep_labels)
```

<Infobox>

**API:** [`Token`](/api/token) **Usage:**
[Using the dependency parse](/usage/linguistic-features#dependency-parse)

</Infobox>

### Export to numpy arrays {#lightning-tour-numpy-arrays}

```python
### {executable="true"}
import spacy
from spacy.attrs import ORTH, LIKE_URL

nlp = spacy.load("en_core_web_sm")
doc = nlp("Check out https://spacy.io")
for token in doc:
    print(token.text, token.orth, token.like_url)

attr_ids = [ORTH, LIKE_URL]
doc_array = doc.to_array(attr_ids)
print(doc_array.shape)
print(len(doc), len(attr_ids))

assert doc[0].orth == doc_array[0, 0]
assert doc[1].orth == doc_array[1, 0]
assert doc[0].like_url == doc_array[0, 1]

assert list(doc_array[:, 1]) == [t.like_url for t in doc]
print(list(doc_array[:, 1]))
```

### Calculate inline markup on original string {#lightning-tour-inline}

```python
### {executable="true"}
import spacy

def put_spans_around_tokens(doc):
    """Here, we're building a custom "syntax highlighter" for
    part-of-speech tags and dependencies. We put each token in a
    span element, with the appropriate classes computed. All whitespace is
    preserved, outside of the spans. (Of course, HTML will only display
    multiple whitespace if enabled – but the point is, no information is lost
    and you can calculate what you need, e.g. <br />, <p> etc.)
    """
    output = []
    for token in doc:
        if token.is_space:
            output.append(token.text)
        else:
            classes = f"pos-{token.pos_} dep-{token.dep_}"
            output.append(f'<span class="{classes}">{token.text}</span>{token.whitespace_}')
    string = "".join(output)
    string = string.replace("\\n", "")
    string = string.replace("\\t", "    ")
    return f"<pre>{string}</pre>"


nlp = spacy.load("en_core_web_sm")
doc = nlp("This is a test.\\n\\nHello   world.")
html = put_spans_around_tokens(doc)
print(html)
```

## Architecture {#architecture}

import Architecture101 from 'usage/101/\_architecture.md'

<Architecture101 />

## Community & FAQ {#community-faq}

We're very happy to see the spaCy community grow and include a mix of people
from all kinds of different backgrounds – computational linguistics, data
science, deep learning, research and more. If you'd like to get involved, below
are some answers to the most important questions and resources for further
reading.

### Help, my code isn't working! {#faq-help-code}

Bugs suck, and we're doing our best to continuously improve the tests and fix
bugs as soon as possible. Before you submit an issue, do a quick search and
check if the problem has already been reported. If you're having installation or
loading problems, make sure to also check out the
[troubleshooting guide](/usage/#troubleshooting). Help with spaCy is available
via the following platforms:

> #### How do I know if something is a bug?
>
> Of course, it's always hard to know for sure, so don't worry – we're not going
> to be mad if a bug report turns out to be a typo in your code. As a simple
> rule, any C-level error without a Python traceback, like a **segmentation
> fault** or **memory error**, is **always** a spaCy bug.
>
> Because models are statistical, their performance will never be _perfect_.
> However, if you come across **patterns that might indicate an underlying
> issue**, please do file a report. Similarly, we also care about behaviors that
> **contradict our docs**.

- [Stack Overflow](https://stackoverflow.com/questions/tagged/spacy): **Usage
  questions** and everything related to problems with your specific code. The
  Stack Overflow community is much larger than ours, so if your problem can be
  solved by others, you'll receive help much quicker.
- [Gitter chat](https://gitter.im/explosion/spaCy): **General discussion** about
  spaCy, meeting other community members and exchanging **tips, tricks and best
  practices**.
- [GitHub issue tracker](https://github.com/explosion/spaCy/issues): **Bug
  reports** and **improvement suggestions**, i.e. everything that's likely
  spaCy's fault. This also includes problems with the models beyond statistical
  imprecisions, like patterns that point to a bug.

<Infobox title="Important note" variant="warning">

Please understand that we won't be able to provide individual support via email.
We also believe that help is much more valuable if it's shared publicly, so that
**more people can benefit from it**. If you come across an issue and you think
you might be able to help, consider posting a quick update with your solution.
No matter how simple, it can easily save someone a lot of time and headache –
and the next time you need help, they might repay the favor.

</Infobox>

### How can I contribute to spaCy? {#faq-contributing}

You don't have to be an NLP expert or Python pro to contribute, and we're happy
to help you get started. If you're new to spaCy, a good place to start is the
[`help wanted (easy)` label](https://github.com/explosion/spaCy/issues?q=is%3Aissue+is%3Aopen+label%3A"help+wanted+%28easy%29")
on GitHub, which we use to tag bugs and feature requests that are easy and
self-contained. We also appreciate contributions to the docs – whether it's
fixing a typo, improving an example or adding additional explanations. You'll
find a "Suggest edits" link at the bottom of each page that points you to the
source.

Another way of getting involved is to help us improve the
[language data](/usage/adding-languages#language-data) – especially if you
happen to speak one of the languages currently in
[alpha support](/usage/models#languages). Even adding simple tokenizer
exceptions, stop words or lemmatizer data can make a big difference. It will
also make it easier for us to provide a statistical model for the language in
the future. Submitting a test that documents a bug or performance issue, or
covers functionality that's especially important for your application is also
very helpful. This way, you'll also make sure we never accidentally introduce
regressions to the parts of the library that you care about the most.

**For more details on the types of contributions we're looking for, the code
conventions and other useful tips, make sure to check out the
[contributing guidelines](https://github.com/explosion/spaCy/tree/master/CONTRIBUTING.md).**

<Infobox title="Code of Conduct" variant="warning">

spaCy adheres to the
[Contributor Covenant Code of Conduct](http://contributor-covenant.org/version/1/4/).
By participating, you are expected to uphold this code.

</Infobox>

### I've built something cool with spaCy – how can I get the word out? {#faq-project-with-spacy}

First, congrats – we'd love to check it out! When you share your project on
Twitter, don't forget to tag [@spacy_io](https://twitter.com/spacy_io) so we
don't miss it. If you think your project would be a good fit for the
[spaCy Universe](/universe), **feel free to submit it!** Tutorials are also
incredibly valuable to other users and a great way to get exposure. So we
strongly encourage **writing up your experiences**, or sharing your code and
some tips and tricks on your blog. Since our website is open-source, you can add
your project or tutorial by making a pull request on GitHub.

If you would like to use the spaCy logo on your site, please get in touch and
ask us first. However, if you want to show support and tell others that your
project is using spaCy, you can grab one of our **spaCy badges** here:

<img src={`https://img.shields.io/badge/built%20with-spaCy-09a3d5.svg`} />

```markdown
[![Built with spaCy](https://img.shields.io/badge/built%20with-spaCy-09a3d5.svg)](https://spacy.io)
```

<img src={`https://img.shields.io/badge/made%20with%20❤%20and-spaCy-09a3d5.svg`}
/>

```markdown
[![Built with spaCy](https://img.shields.io/badge/made%20with%20❤%20and-spaCy-09a3d5.svg)](https://spacy.io)
```
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 18:31:19 +00:00
+								---
 								title: 'spaCy 101: Everything you need to know'
 								teaser: The most important concepts, explained in simple terms
 								menu:
 								  - ["What's spaCy?", 'whats-spacy']
 								  - ['Features', 'features']
 								  - ['Linguistic Annotations', 'annotations']
 								  - ['Pipelines', 'pipelines']
 								  - ['Vocab', 'vocab']
 								  - ['Serialization', 'serialization']
 								  - ['Training', 'training']
 								  - ['Language Data', 'language-data']
 								  - ['Lightning Tour', 'lightning-tour']
 								  - ['Architecture', 'architecture']
 								  - ['Community & FAQ', 'community-faq']
 								---
 								Whether you're new to spaCy, or just want to brush up on some NLP basics and
 								implementation details – this page should have you covered. Each section will
 								explain one of spaCy's features in simple terms and with examples or
 								illustrations. Some sections will also reappear across the usage guides as a
 								quick introduction.
 								> #### Help us improve the docs
 								>
 								> Did you spot a mistake or come across explanations that are unclear? We always
 								> appreciate improvement
 								> [suggestions](https://github.com/explosion/spaCy/issues) or
 								> [pull requests](https://github.com/explosion/spaCy/pulls). You can find a
 								> "Suggest edits" link at the bottom of each page that points you to the source.
-												Add course to 101

											
										
										
											2019-04-19 13:59:51 +00:00
+								<Infobox title="Take the free interactive course">
 								[![Advanced NLP with spaCy](../images/course.jpg)](https://course.spacy.io)
 								In this course you'll learn how to use spaCy to build advanced natural language
 								understanding systems, using both rule-based and machine learning approaches. It
 								includes 55 exercises featuring interactive coding practice, multiple-choice
 								questions and slide decks.
 								<p><Button to="https://course.spacy.io" variant="primary">Start the course</Button></p>
 								</Infobox>
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 18:31:19 +00:00
+								## What's spaCy? {#whats-spacy}
 								<Grid cols={2}>
 								<div>
 								spaCy is a **free, open-source library** for advanced **Natural Language
 								Processing** (NLP) in Python.
 								If you're working with a lot of text, you'll eventually want to know more about
 								it. For example, what's it about? What do the words mean in context? Who is
 								doing what to whom? What companies and products are mentioned? Which texts are
 								similar to each other?
 								spaCy is designed specifically for **production use** and helps you build
 								applications that process and "understand" large volumes of text. It can be used
 								to build **information extraction** or **natural language understanding**
 								systems, or to pre-process text for **deep learning**.
 								</div>
-												Fix small issues in the docs [ci skip]

											
										
										
											2019-03-12 21:57:15 +00:00
+								<Infobox title="Table of contents" id="toc">
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 18:31:19 +00:00
 								- [Features](#features)
 								- [Linguistic annotations](#annotations)
 								- [Tokenization](#annotations-token)
 								- [POS tags and dependencies](#annotations-pos-deps)
 								- [Named entities](#annotations-ner)
 								- [Word vectors and similarity](#vectors-similarity)
 								- [Pipelines](#pipelines)
 								- [Vocab, hashes and lexemes](#vocab)
 								- [Serialization](#serialization)
 								- [Training](#training)
 								- [Language data](#language-data)
 								- [Lightning tour](#lightning-tour)
 								- [Architecture](#architecture)
 								- [Community & FAQ](#community)
 								</Infobox>
 								</Grid>
 								### What spaCy isn't {#what-spacy-isnt}
 								- **spaCy is not a platform or "an API"**. Unlike a platform, spaCy does not
 								  provide a software as a service, or a web application. It's an open-source
 								  library designed to help you build NLP applications, not a consumable service.
 								- **spaCy is not an out-of-the-box chat bot engine**. While spaCy can be used to
 								  power conversational applications, it's not designed specifically for chat
 								  bots, and only provides the underlying text processing capabilities.
 								- **spaCy is not research software**. It's built on the latest research, but
 								  it's designed to get things done. This leads to fairly different design
 								  decisions than [NLTK](https://github.com/nltk/nltk) or
 								  [CoreNLP](https://stanfordnlp.github.io/CoreNLP/), which were created as
 								  platforms for teaching and research. The main difference is that spaCy is
 								  integrated and opinionated. spaCy tries to avoid asking the user to choose
 								  between multiple algorithms that deliver equivalent functionality. Keeping the
 								  menu small lets spaCy deliver generally better performance and developer
-												Remove dangling M (#3657)

I assume this is a typo. Sorry if it has a meaning that I'm not aware of.
											
										
										
											2019-04-29 17:44:43 +00:00
+								  experience.
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 18:31:19 +00:00
 								- **spaCy is not a company**. It's an open-source library. Our company
 								  publishing spaCy and other software is called
 								  [Explosion AI](https://explosion.ai).
 								## Features {#features}
 								In the documentation, you'll come across mentions of spaCy's features and
 								capabilities. Some of them refer to linguistic concepts, while others are
 								related to more general machine learning functionality.
 								| Name                                  | Description                                                                                                        |
 								| ------------------------------------- | ------------------------------------------------------------------------------------------------------------------ |
 								| **Tokenization**                      | Segmenting text into words, punctuations marks etc.                                                                |
 								| **Part-of-speech** (POS) **Tagging**  | Assigning word types to tokens, like verb or noun.                                                                 |
 								| **Dependency Parsing**                | Assigning syntactic dependency labels, describing the relations between individual tokens, like subject or object. |
 								| **Lemmatization**                     | Assigning the base forms of words. For example, the lemma of "was" is "be", and the lemma of "rats" is "rat".      |
 								| **Sentence Boundary Detection** (SBD) | Finding and segmenting individual sentences.                                                                       |
 								| **Named Entity Recognition** (NER)    | Labelling named "real-world" objects, like persons, companies or locations.                                        |
-												Documentation for Entity Linking (#4065)

* document token ent_kb_id

* document span kb_id

* update pipeline documentation

* prior and context weights as bool's instead

* entitylinker api documentation

* drop for both models

* finish entitylinker documentation

* small fixes

* documentation for KB

* candidate documentation

* links to api pages in code

* small fix

* frequency examples as counts for consistency

* consistent documentation about tensors returned by predict

* add entity linking to usage 101

* add entity linking infobox and KB section to 101

* entity-linking in linguistic features

* small typo corrections

* training example and docs for entity_linker

* predefined nlp and kb

* revert back to similarity encodings for simplicity (for now)

* set prior probabilities to 0 when excluded

* code clean up

* bugfix: deleting kb ID from tokens when entities were removed

* refactor train el example to use either model or vocab

* pretrain_kb example for example kb generation

* add to training docs for KB + EL example scripts

* small fixes

* error numbering

* ensure the language of vocab and nlp stay consistent across serialization

* equality with =

* avoid conflict in errors file

* add error 151

* final adjustements to the train scripts - consistency

* update of goldparse documentation

* small corrections

* push commit

* typo fix

* add candidate API to kb documentation

* update API sidebar with EntityLinker and KnowledgeBase

* remove EL from 101 docs

* remove entity linker from 101 pipelines / rephrase

* custom el model instead of existing model

* set version to 2.2 for EL functionality

* update documentation for 2 CLI scripts

											
										
										
											2019-09-12 09:38:34 +00:00
+								| **Entity Linking** (EL)               | Disambiguating textual entities to unique identifiers in a Knowledge Base.                                         |
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 18:31:19 +00:00
+								| **Similarity**                        | Comparing words, text spans and documents and how similar they are to each other.                                  |
 								| **Text Classification**               | Assigning categories or labels to a whole document, or parts of a document.                                        |
 								| **Rule-based Matching**               | Finding sequences of tokens based on their texts and linguistic annotations, similar to regular expressions.       |
 								| **Training**                          | Updating and improving a statistical model's predictions.                                                          |
 								| **Serialization**                     | Saving objects to files or byte strings.                                                                           |
 								### Statistical models {#statistical-models}
 								While some of spaCy's features work independently, others require
 								[ statistical models](/models) to be loaded, which enable spaCy to **predict**
 								linguistic annotations – for example, whether a word is a verb or a noun. spaCy
 								currently offers statistical models for a variety of languages, which can be
 								installed as individual Python modules. Models can differ in size, speed, memory
 								usage, accuracy and the data they include. The model you choose always depends
 								on your use case and the texts you're working with. For a general-purpose use
 								case, the small, default models are always a good start. They typically include
 								the following components:
 								- **Binary weights** for the part-of-speech tagger, dependency parser and named
 								  entity recognizer to predict those annotations in context.
 								- **Lexical entries** in the vocabulary, i.e. words and their
 								  context-independent attributes like the shape or spelling.
-												Update lemma data documentation [ci skip]

											
										
										
											2019-10-01 11:22:13 +00:00
+								- **Data files** like lemmatization rules and lookup tables.
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 18:31:19 +00:00
+								- **Word vectors**, i.e. multi-dimensional meaning representations of words that
 								  let you determine how similar they are to each other.
 								- **Configuration** options, like the language and processing pipeline settings,
 								  to put spaCy in the correct state when you load in the model.
 								## Linguistic annotations {#annotations}
 								spaCy provides a variety of linguistic annotations to give you **insights into a
 								text's grammatical structure**. This includes the word types, like the parts of
 								speech, and how the words are related to each other. For example, if you're
 								analyzing text, it makes a huge difference whether a noun is the subject of a
 								sentence, or the object – or whether "google" is used as a verb, or refers to
 								the website or company in a specific context.
 								> #### Loading models
 								>
 								> ```bash
 								> $ python -m spacy download en_core_web_sm
 								>
 								> >>> import spacy
 								> >>> nlp = spacy.load("en_core_web_sm")
 								> ```
 								Once you've [downloaded and installed](/usage/models) a model, you can load it
 								via [`spacy.load()`](/api/top-level#spacy.load). This will return a `Language`
 								object containing all components and data needed to process text. We usually
 								call it `nlp`. Calling the `nlp` object on a string of text will return a
 								processed `Doc`:
 								```python
 								### {executable="true"}
 								import spacy
 								nlp = spacy.load("en_core_web_sm")
-												Remove u-strings and fix formatting [ci skip]

											
										
										
											2019-09-12 14:11:15 +00:00
+								doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 18:31:19 +00:00
+								for token in doc:
 								    print(token.text, token.pos_, token.dep_)
 								```
 								Even though a `Doc` is processed – e.g. split into individual words and
 								annotated – it still holds **all information of the original text**, like
 								whitespace characters. You can always get the offset of a token into the
 								original string, or reconstruct the original by joining the tokens and their
 								trailing whitespace. This way, you'll never lose any information when processing
 								text with spaCy.
 								### Tokenization {#annotations-token}
 								import Tokenization101 from 'usage/101/\_tokenization.md'
 								<Tokenization101 />
 								<Infobox title="📖 Tokenization rules">
 								To learn more about how spaCy's tokenization rules work in detail, how to
 								**customize and replace** the default tokenizer and how to **add
 								language-specific data**, see the usage guides on
 								[adding languages](/usage/adding-languages) and
 								[customizing the tokenizer](/usage/linguistic-features#tokenization).
 								</Infobox>
 								### Part-of-speech tags and dependencies {#annotations-pos-deps model="parser"}
 								import PosDeps101 from 'usage/101/\_pos-deps.md'
 								<PosDeps101 />
 								<Infobox title="📖 Part-of-speech tagging and morphology">
 								To learn more about **part-of-speech tagging** and rule-based morphology, and
 								how to **navigate and use the parse tree** effectively, see the usage guides on
 								[part-of-speech tagging](/usage/linguistic-features#pos-tagging) and
 								[using the dependency parse](/usage/linguistic-features#dependency-parse).
 								</Infobox>
 								### Named Entities {#annotations-ner model="ner"}
 								import NER101 from 'usage/101/\_named-entities.md'
 								<NER101 />
 								<Infobox title="📖 Named Entity Recognition">
 								To learn more about entity recognition in spaCy, how to **add your own
 								entities** to a document and how to **train and update** the entity predictions
 								of a model, see the usage guides on
 								[named entity recognition](/usage/linguistic-features#named-entities) and
 								[training the named entity recognizer](/usage/training#ner).
 								</Infobox>
 								### Word vectors and similarity {#vectors-similarity model="vectors"}
 								import Vectors101 from 'usage/101/\_vectors-similarity.md'
 								<Vectors101 />
 								<Infobox title="📖 Word vectors">
 								To learn more about word vectors, how to **customize them** and how to load
 								**your own vectors** into spaCy, see the usage guide on
 								[using word vectors and semantic similarities](/usage/vectors-similarity).
 								</Infobox>
 								## Pipelines {#pipelines}
 								import Pipelines101 from 'usage/101/\_pipelines.md'
 								<Pipelines101 />
 								<Infobox title="📖 Processing pipelines">
 								To learn more about **how processing pipelines work** in detail, how to enable
 								and disable their components, and how to **create your own**, see the usage
 								guide on [language processing pipelines](/usage/processing-pipelines).
 								</Infobox>
 								## Vocab, hashes and lexemes {#vocab}
 								Whenever possible, spaCy tries to store data in a vocabulary, the
 								[`Vocab`](/api/vocab), that will be **shared by multiple documents**. To save
 								memory, spaCy also encodes all strings to **hash values** – in this case for
 								example, "coffee" has the hash `3197928453018144401`. Entity labels like "ORG"
 								and part-of-speech tags like "VERB" are also encoded. Internally, spaCy only
 								"speaks" in hash values.
 								> - **Token**: A word, punctuation mark etc. _in context_, including its
 								>   attributes, tags and dependencies.
 								> - **Lexeme**: A "word type" with no context. Includes the word shape and
 								>   flags, e.g. if it's lowercase, a digit or punctuation.
 								> - **Doc**: A processed container of tokens in context.
 								> - **Vocab**: The collection of lexemes.
 								> - **StringStore**: The dictionary mapping hash values to strings, for example
 								>   `3197928453018144401` → "coffee".
 								![Doc, Vocab, Lexeme and StringStore](../images/vocab_stringstore.svg)
 								If you process lots of documents containing the word "coffee" in all kinds of
 								different contexts, storing the exact string "coffee" every time would take up
 								way too much space. So instead, spaCy hashes the string and stores it in the
 								[`StringStore`](/api/stringstore). You can think of the `StringStore` as a
 								**lookup table that works in both directions** – you can look up a string to get
 								its hash, or a hash to get its string:
 								```python
 								### {executable="true"}
 								import spacy
 								nlp = spacy.load("en_core_web_sm")
-												Remove u-strings and fix formatting [ci skip]

											
										
										
											2019-09-12 14:11:15 +00:00
+								doc = nlp("I love coffee")
 								print(doc.vocab.strings["coffee"])  # 3197928453018144401
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 18:31:19 +00:00
+								print(doc.vocab.strings[3197928453018144401])  # 'coffee'
 								```
 								Now that all strings are encoded, the entries in the vocabulary **don't need to
 								include the word text** themselves. Instead, they can look it up in the
 								`StringStore` via its hash value. Each entry in the vocabulary, also called
 								[`Lexeme`](/api/lexeme), contains the **context-independent** information about
 								a word. For example, no matter if "love" is used as a verb or a noun in some
 								context, its spelling and whether it consists of alphabetic characters won't
 								ever change. Its hash value will also always be the same.
 								```python
 								### {executable="true"}
 								import spacy
 								nlp = spacy.load("en_core_web_sm")
-												Remove u-strings and fix formatting [ci skip]

											
										
										
											2019-09-12 14:11:15 +00:00
+								doc = nlp("I love coffee")
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 18:31:19 +00:00
+								for word in doc:
 								    lexeme = doc.vocab[word.text]
 								    print(lexeme.text, lexeme.orth, lexeme.shape_, lexeme.prefix_, lexeme.suffix_,
 								            lexeme.is_alpha, lexeme.is_digit, lexeme.is_title, lexeme.lang_)
 								```
 								> - **Text**: The original text of the lexeme.
 								> - **Orth**: The hash value of the lexeme.
 								> - **Shape**: The abstract word shape of the lexeme.
 								> - **Prefix**: By default, the first letter of the word string.
 								> - **Suffix**: By default, the last three letters of the word string.
 								> - **is alpha**: Does the lexeme consist of alphabetic characters?
 								> - **is digit**: Does the lexeme consist of digits?
 								| Text   | Orth                  | Shape  | Prefix | Suffix | is_alpha | is_digit |
 								| ------ | --------------------- | ------ | ------ | ------ | -------- | -------- |
 								| I      | `4690420944186131903` | `X`    | I      | I      | `True`   | `False`  |
 								| love   | `3702023516439754181` | `xxxx` | l      | ove    | `True`   | `False`  |
 								| coffee | `3197928453018144401` | `xxxx` | c      | fee    | `True`   | `False`  |
 								The mapping of words to hashes doesn't depend on any state. To make sure each
 								value is unique, spaCy uses a
 								[hash function](https://en.wikipedia.org/wiki/Hash_function) to calculate the
 								hash **based on the word string**. This also means that the hash for "coffee"
 								will always be the same, no matter which model you're using or how you've
 								configured spaCy.
 								However, hashes **cannot be reversed** and there's no way to resolve
 								`3197928453018144401` back to "coffee". All spaCy can do is look it up in the
 								vocabulary. That's why you always need to make sure all objects you create have
 								access to the same vocabulary. If they don't, spaCy might not be able to find
 								the strings it needs.
 								```python
 								### {executable="true"}
 								import spacy
 								from spacy.tokens import Doc
 								from spacy.vocab import Vocab
 								nlp = spacy.load("en_core_web_sm")
-												Remove u-strings and fix formatting [ci skip]

											
										
										
											2019-09-12 14:11:15 +00:00
+								doc = nlp("I love coffee")  # Original Doc
 								print(doc.vocab.strings["coffee"])  # 3197928453018144401
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 18:31:19 +00:00
+								print(doc.vocab.strings[3197928453018144401])  # 'coffee' 👍
 								empty_doc = Doc(Vocab())  # New Doc with empty Vocab
 								# empty_doc.vocab.strings[3197928453018144401] will raise an error :(
-												Remove u-strings and fix formatting [ci skip]

											
										
										
											2019-09-12 14:11:15 +00:00
+								empty_doc.vocab.strings.add("coffee")  # Add "coffee" and generate hash
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 18:31:19 +00:00
+								print(empty_doc.vocab.strings[3197928453018144401])  # 'coffee' 👍
 								new_doc = Doc(doc.vocab)  # Create new doc with first doc's vocab
 								print(new_doc.vocab.strings[3197928453018144401])  # 'coffee' 👍
 								```
 								If the vocabulary doesn't contain a string for `3197928453018144401`, spaCy will
 								raise an error. You can re-add "coffee" manually, but this only works if you
 								actually _know_ that the document contains that word. To prevent this problem,
 								spaCy will also export the `Vocab` when you save a `Doc` or `nlp` object. This
 								will give you the object and its encoded annotations, plus the "key" to decode
 								it.
-												Documentation for Entity Linking (#4065)

* document token ent_kb_id

* document span kb_id

* update pipeline documentation

* prior and context weights as bool's instead

* entitylinker api documentation

* drop for both models

* finish entitylinker documentation

* small fixes

* documentation for KB

* candidate documentation

* links to api pages in code

* small fix

* frequency examples as counts for consistency

* consistent documentation about tensors returned by predict

* add entity linking to usage 101

* add entity linking infobox and KB section to 101

* entity-linking in linguistic features

* small typo corrections

* training example and docs for entity_linker

* predefined nlp and kb

* revert back to similarity encodings for simplicity (for now)

* set prior probabilities to 0 when excluded

* code clean up

* bugfix: deleting kb ID from tokens when entities were removed

* refactor train el example to use either model or vocab

* pretrain_kb example for example kb generation

* add to training docs for KB + EL example scripts

* small fixes

* error numbering

* ensure the language of vocab and nlp stay consistent across serialization

* equality with =

* avoid conflict in errors file

* add error 151

* final adjustements to the train scripts - consistency

* update of goldparse documentation

* small corrections

* push commit

* typo fix

* add candidate API to kb documentation

* update API sidebar with EntityLinker and KnowledgeBase

* remove EL from 101 docs

* remove entity linker from 101 pipelines / rephrase

* custom el model instead of existing model

* set version to 2.2 for EL functionality

* update documentation for 2 CLI scripts

											
										
										
											2019-09-12 09:38:34 +00:00
+								## Knowledge Base {#kb}
 								To support the entity linking task, spaCy stores external knowledge in a
-												Remove u-strings and fix formatting [ci skip]

											
										
										
											2019-09-12 14:11:15 +00:00
+								[`KnowledgeBase`](/api/kb). The knowledge base (KB) uses the `Vocab` to store
 								its data efficiently.
-												Documentation for Entity Linking (#4065)

* document token ent_kb_id

* document span kb_id

* update pipeline documentation

* prior and context weights as bool's instead

* entitylinker api documentation

* drop for both models

* finish entitylinker documentation

* small fixes

* documentation for KB

* candidate documentation

* links to api pages in code

* small fix

* frequency examples as counts for consistency

* consistent documentation about tensors returned by predict

* add entity linking to usage 101

* add entity linking infobox and KB section to 101

* entity-linking in linguistic features

* small typo corrections

* training example and docs for entity_linker

* predefined nlp and kb

* revert back to similarity encodings for simplicity (for now)

* set prior probabilities to 0 when excluded

* code clean up

* bugfix: deleting kb ID from tokens when entities were removed

* refactor train el example to use either model or vocab

* pretrain_kb example for example kb generation

* add to training docs for KB + EL example scripts

* small fixes

* error numbering

* ensure the language of vocab and nlp stay consistent across serialization

* equality with =

* avoid conflict in errors file

* add error 151

* final adjustements to the train scripts - consistency

* update of goldparse documentation

* small corrections

* push commit

* typo fix

* add candidate API to kb documentation

* update API sidebar with EntityLinker and KnowledgeBase

* remove EL from 101 docs

* remove entity linker from 101 pipelines / rephrase

* custom el model instead of existing model

* set version to 2.2 for EL functionality

* update documentation for 2 CLI scripts

											
										
										
											2019-09-12 09:38:34 +00:00
 								> - **Mention**: A textual occurrence of a named entity, e.g. 'Miss Lovelace'.
-												Fix typos and formatting [ci skip]

											
										
										
											2019-10-01 10:30:04 +00:00
+								> - **KB ID**: A unique identifier referring to a particular real-world concept,
-												Remove u-strings and fix formatting [ci skip]

											
										
										
											2019-09-12 14:11:15 +00:00
+								>   e.g. 'Q7259'.
 								> - **Alias**: A plausible synonym or description for a certain KB ID, e.g. 'Ada
 								>   Lovelace'.
 								> - **Prior probability**: The probability of a certain mention resolving to a
 								>   certain KB ID, prior to knowing anything about the context in which the
 								>   mention is used.
 								> - **Entity vector**: A pretrained word vector capturing the entity
 								>   description.
 								A knowledge base is created by first adding all entities to it. Next, for each
 								potential mention or alias, a list of relevant KB IDs and their prior
 								probabilities is added. The sum of these prior probabilities should never exceed
 for any given alias.
-												Documentation for Entity Linking (#4065)

* document token ent_kb_id

* document span kb_id

* update pipeline documentation

* prior and context weights as bool's instead

* entitylinker api documentation

* drop for both models

* finish entitylinker documentation

* small fixes

* documentation for KB

* candidate documentation

* links to api pages in code

* small fix

* frequency examples as counts for consistency

* consistent documentation about tensors returned by predict

* add entity linking to usage 101

* add entity linking infobox and KB section to 101

* entity-linking in linguistic features

* small typo corrections

* training example and docs for entity_linker

* predefined nlp and kb

* revert back to similarity encodings for simplicity (for now)

* set prior probabilities to 0 when excluded

* code clean up

* bugfix: deleting kb ID from tokens when entities were removed

* refactor train el example to use either model or vocab

* pretrain_kb example for example kb generation

* add to training docs for KB + EL example scripts

* small fixes

* error numbering

* ensure the language of vocab and nlp stay consistent across serialization

* equality with =

* avoid conflict in errors file

* add error 151

* final adjustements to the train scripts - consistency

* update of goldparse documentation

* small corrections

* push commit

* typo fix

* add candidate API to kb documentation

* update API sidebar with EntityLinker and KnowledgeBase

* remove EL from 101 docs

* remove entity linker from 101 pipelines / rephrase

* custom el model instead of existing model

* set version to 2.2 for EL functionality

* update documentation for 2 CLI scripts

											
										
										
											2019-09-12 09:38:34 +00:00
 								```python
 								### {executable="true"}
 								import spacy
 								from spacy.kb import KnowledgeBase
 								# load the model and create an empty KB
 								nlp = spacy.load('en_core_web_sm')
 								kb = KnowledgeBase(vocab=nlp.vocab, entity_vector_length=3)
 								# adding entities
 								kb.add_entity(entity="Q1004791", freq=6, entity_vector=[0, 3, 5])
 								kb.add_entity(entity="Q42", freq=342, entity_vector=[1, 9, -3])
 								kb.add_entity(entity="Q5301561", freq=12, entity_vector=[-2, 4, 2])
 								# adding aliases
 								kb.add_alias(alias="Douglas", entities=["Q1004791", "Q42", "Q5301561"], probabilities=[0.6, 0.1, 0.2])
 								kb.add_alias(alias="Douglas Adams", entities=["Q42"], probabilities=[0.9])
 								print()
 								print("Number of entities in KB:",kb.get_size_entities()) # 3
 								print("Number of aliases in KB:", kb.get_size_aliases()) # 2
 								```
 								### Candidate generation
-												Remove u-strings and fix formatting [ci skip]

											
										
										
											2019-09-12 14:11:15 +00:00
+								Given a textual entity, the Knowledge Base can provide a list of plausible
 								candidates or entity identifiers. The [`EntityLinker`](/api/entitylinker) will
 								take this list of candidates as input, and disambiguate the mention to the most
 								probable identifier, given the document context.
-												Documentation for Entity Linking (#4065)

* document token ent_kb_id

* document span kb_id

* update pipeline documentation

* prior and context weights as bool's instead

* entitylinker api documentation

* drop for both models

* finish entitylinker documentation

* small fixes

* documentation for KB

* candidate documentation

* links to api pages in code

* small fix

* frequency examples as counts for consistency

* consistent documentation about tensors returned by predict

* add entity linking to usage 101

* add entity linking infobox and KB section to 101

* entity-linking in linguistic features

* small typo corrections

* training example and docs for entity_linker

* predefined nlp and kb

* revert back to similarity encodings for simplicity (for now)

* set prior probabilities to 0 when excluded

* code clean up

* bugfix: deleting kb ID from tokens when entities were removed

* refactor train el example to use either model or vocab

* pretrain_kb example for example kb generation

* add to training docs for KB + EL example scripts

* small fixes

* error numbering

* ensure the language of vocab and nlp stay consistent across serialization

* equality with =

* avoid conflict in errors file

* add error 151

* final adjustements to the train scripts - consistency

* update of goldparse documentation

* small corrections

* push commit

* typo fix

* add candidate API to kb documentation

* update API sidebar with EntityLinker and KnowledgeBase

* remove EL from 101 docs

* remove entity linker from 101 pipelines / rephrase

* custom el model instead of existing model

* set version to 2.2 for EL functionality

* update documentation for 2 CLI scripts

											
										
										
											2019-09-12 09:38:34 +00:00
 								```python
 								### {executable="true"}
 								import spacy
 								from spacy.kb import KnowledgeBase
 								nlp = spacy.load('en_core_web_sm')
 								kb = KnowledgeBase(vocab=nlp.vocab, entity_vector_length=3)
 								# adding entities
 								kb.add_entity(entity="Q1004791", freq=6, entity_vector=[0, 3, 5])
 								kb.add_entity(entity="Q42", freq=342, entity_vector=[1, 9, -3])
 								kb.add_entity(entity="Q5301561", freq=12, entity_vector=[-2, 4, 2])
 								# adding aliases
 								kb.add_alias(alias="Douglas", entities=["Q1004791", "Q42", "Q5301561"], probabilities=[0.6, 0.1, 0.2])
 								candidates = kb.get_candidates("Douglas")
 								for c in candidates:
 								    print(" ", c.entity_, c.prior_prob, c.entity_vector)
 								```
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 18:31:19 +00:00
+								## Serialization {#serialization}
 								import Serialization101 from 'usage/101/\_serialization.md'
 								<Serialization101 />
 								<Infobox title="📖 Saving and loading">
 								To learn more about how to **save and load your own models**, see the usage
-												Fix links [ci skip]

											
										
										
											2019-02-17 21:25:50 +00:00
+								guide on [saving and loading](/usage/saving-loading#models).
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 18:31:19 +00:00
 								</Infobox>
 								## Training {#training}
 								import Training101 from 'usage/101/\_training.md'
 								<Training101 />
 								<Infobox title="📖 Training statistical models">
 								To learn more about **training and updating** models, how to create training
 								data and how to improve spaCy's named entity recognition models, see the usage
 								guides on [training](/usage/training).
 								</Infobox>
 								## Language data {#language-data}
 								import LanguageData101 from 'usage/101/\_language-data.md'
 								<LanguageData101 />
 								<Infobox title="📖 Language data">
 								To learn more about the individual components of the language data and how to
 								**add a new language** to spaCy in preparation for training a language model,
 								see the usage guide on [adding languages](/usage/adding-languages).
 								</Infobox>
 								## Lightning tour {#lightning-tour}
 								The following examples and code snippets give you an overview of spaCy's
 								functionality and its usage.
 								### Install models and process text {#lightning-tour-models}
 								```bash
 								python -m spacy download en_core_web_sm
 								python -m spacy download de_core_news_sm
 								```
 								```python
 								### {executable="true"}
 								import spacy
 								nlp = spacy.load("en_core_web_sm")
-												Remove u-strings and fix formatting [ci skip]

											
										
										
											2019-09-12 14:11:15 +00:00
+								doc = nlp("Hello, world. Here are two sentences.")
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 18:31:19 +00:00
+								print([t.text for t in doc])
 								nlp_de = spacy.load("de_core_news_sm")
-												Remove u-strings and fix formatting [ci skip]

											
										
										
											2019-09-12 14:11:15 +00:00
+								doc_de = nlp_de("Ich bin ein Berliner.")
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 18:31:19 +00:00
+								print([t.text for t in doc_de])
 								```
 								<Infobox>
 								**API:** [`spacy.load()`](/api/top-level#spacy.load) **Usage:**
 								[Models](/usage/models), [spaCy 101](/usage/spacy-101)
 								</Infobox>
 								### Get tokens, noun chunks & sentences {#lightning-tour-tokens-sentences model="parser"}
 								```python
 								### {executable="true"}
 								import spacy
 								nlp = spacy.load("en_core_web_sm")
-												Remove u-strings and fix formatting [ci skip]

											
										
										
											2019-09-12 14:11:15 +00:00
+								doc = nlp("Peach emoji is where it has always been. Peach is the superior "
 								          "emoji. It's outranking eggplant 🍑 ")
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 18:31:19 +00:00
+								print(doc[0].text)          # 'Peach'
 								print(doc[1].text)          # 'emoji'
 								print(doc[-1].text)         # '🍑'
 								print(doc[17:19].text)      # 'outranking eggplant'
 								noun_chunks = list(doc.noun_chunks)
 								print(noun_chunks[0].text)  # 'Peach emoji'
 								sentences = list(doc.sents)
 								assert len(sentences) == 3
 								print(sentences[1].text)    # 'Peach is the superior emoji.'
 								```
 								<Infobox>
 								**API:** [`Doc`](/api/doc), [`Token`](/api/token) **Usage:**
 								[spaCy 101](/usage/spacy-101)
 								</Infobox>
 								### Get part-of-speech tags and flags {#lightning-tour-pos-tags model="tagger"}
 								```python
 								### {executable="true"}
 								import spacy
 								nlp = spacy.load("en_core_web_sm")
-												Remove u-strings and fix formatting [ci skip]

											
										
										
											2019-09-12 14:11:15 +00:00
+								doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 18:31:19 +00:00
+								apple = doc[0]
 								print("Fine-grained POS tag", apple.pos_, apple.pos)
 								print("Coarse-grained POS tag", apple.tag_, apple.tag)
 								print("Word shape", apple.shape_, apple.shape)
-												Alphanumeric -> alphabetic [ci skip]

see ines/spacy-course#38

											
										
										
											2019-10-06 11:30:01 +00:00
+								print("Alphabetic characters?", apple.is_alpha)
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 18:31:19 +00:00
+								print("Punctuation mark?", apple.is_punct)
 								billion = doc[10]
 								print("Digit?", billion.is_digit)
 								print("Like a number?", billion.like_num)
 								print("Like an email address?", billion.like_email)
 								```
 								<Infobox>
 								**API:** [`Token`](/api/token) **Usage:**
 								[Part-of-speech tagging](/usage/linguistic-features#pos-tagging)
 								</Infobox>
 								### Use hash values for any string {#lightning-tour-hashes}
 								```python
 								### {executable="true"}
 								import spacy
 								nlp = spacy.load("en_core_web_sm")
-												Remove u-strings and fix formatting [ci skip]

											
										
										
											2019-09-12 14:11:15 +00:00
+								doc = nlp("I love coffee")
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 18:31:19 +00:00
-												Remove u-strings and fix formatting [ci skip]

											
										
										
											2019-09-12 14:11:15 +00:00
+								coffee_hash = nlp.vocab.strings["coffee"]  # 3197928453018144401
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 18:31:19 +00:00
+								coffee_text = nlp.vocab.strings[coffee_hash]  # 'coffee'
 								print(coffee_hash, coffee_text)
 								print(doc[2].orth, coffee_hash)  # 3197928453018144401
 								print(doc[2].text, coffee_text)  # 'coffee'
-												Remove u-strings and fix formatting [ci skip]

											
										
										
											2019-09-12 14:11:15 +00:00
+								beer_hash = doc.vocab.strings.add("beer")  # 3073001599257881079
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 18:31:19 +00:00
+								beer_text = doc.vocab.strings[beer_hash]  # 'beer'
 								print(beer_hash, beer_text)
-												Remove u-strings and fix formatting [ci skip]

											
										
										
											2019-09-12 14:11:15 +00:00
+								unicorn_hash = doc.vocab.strings.add("🦄")  # 18234233413267120783
 								unicorn_text = doc.vocab.strings[unicorn_hash]  # '🦄'
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 18:31:19 +00:00
+								print(unicorn_hash, unicorn_text)
 								```
 								<Infobox>
 								**API:** [`StringStore`](/api/stringstore) **Usage:**
 								[Vocab, hashes and lexemes 101](/usage/spacy-101#vocab)
 								</Infobox>
 								### Recognize and update named entities {#lightning-tour-entities model="ner"}
 								```python
 								### {executable="true"}
 								import spacy
-												Remove u-strings and fix formatting [ci skip]

											
										
										
											2019-09-12 14:11:15 +00:00
+								from spacy.tokens import Span
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 18:31:19 +00:00
 								nlp = spacy.load("en_core_web_sm")
-												Remove u-strings and fix formatting [ci skip]

											
										
										
											2019-09-12 14:11:15 +00:00
+								doc = nlp("San Francisco considers banning sidewalk delivery robots")
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 18:31:19 +00:00
+								for ent in doc.ents:
 								    print(ent.text, ent.start_char, ent.end_char, ent.label_)
-												Remove u-strings and fix formatting [ci skip]

											
										
										
											2019-09-12 14:11:15 +00:00
+								doc = nlp("FB is hiring a new VP of global policy")
 								doc.ents = [Span(doc, 0, 1, label="ORG")]
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 18:31:19 +00:00
+								for ent in doc.ents:
 								    print(ent.text, ent.start_char, ent.end_char, ent.label_)
 								```
 								<Infobox>
 								**Usage:** [Named entity recognition](/usage/linguistic-features#named-entities)
 								</Infobox>
 								### Train and update neural network models {#lightning-tour-training"}
 								```python
 								import spacy
 								import random
 								nlp = spacy.load("en_core_web_sm")
-												Remove u-strings and fix formatting [ci skip]

											
										
										
											2019-09-12 14:11:15 +00:00
+								train_data = [("Uber blew through $1 million", {"entities": [(0, 4, "ORG")]})]
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 18:31:19 +00:00
-												Feature toggle_pipes (#5378)

* make disable_pipes deprecated in favour of the new toggle_pipes

* rewrite disable_pipes statements

* update documentation

* remove bin/wiki_entity_linking folder

* one more fix

* remove deprecated link to documentation

* few more doc fixes

* add note about name change to the docs

* restore original disable_pipes

* small fixes

* fix typo

* fix error number to W096

* rename to select_pipes

* also make changes to the documentation

Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
											
										
										
											2020-05-18 20:27:10 +00:00
+								with nlp.select_pipes(enable="ner"):
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 18:31:19 +00:00
+								    optimizer = nlp.begin_training()
 								    for i in range(10):
 								        random.shuffle(train_data)
 								        for text, annotations in train_data:
 								            nlp.update([text], [annotations], sgd=optimizer)
 								nlp.to_disk("/model")
 								```
 								<Infobox>
 								**API:** [`Language.update`](/api/language#update) **Usage:**
 								[Training spaCy's statistical models](/usage/training)
 								</Infobox>
 								### Visualize a dependency parse and named entities in your browser {#lightning-tour-displacy model="parser, ner" new="2"}
 								> #### Output
 								>
 								> ![displaCy visualization](../images/displacy-small.svg)
 								```python
 								from spacy import displacy
-												Remove u-strings and fix formatting [ci skip]

											
										
										
											2019-09-12 14:11:15 +00:00
+								doc_dep = nlp("This is a sentence.")
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 18:31:19 +00:00
+								displacy.serve(doc_dep, style="dep")
-												Remove u-strings and fix formatting [ci skip]

											
										
										
											2019-09-12 14:11:15 +00:00
+								doc_ent = nlp("When Sebastian Thrun started working on self-driving cars at Google "
 								              "in 2007, few people outside of the company took him seriously.")
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 18:31:19 +00:00
+								displacy.serve(doc_ent, style="ent")
 								```
 								<Infobox>
 								**API:** [`displacy`](/api/top-level#displacy) **Usage:**
 								[Visualizers](/usage/visualizers)
 								</Infobox>
 								### Get word vectors and similarity {#lightning-tour-word-vectors model="vectors"}
 								```python
 								### {executable="true"}
 								import spacy
 								nlp = spacy.load("en_core_web_md")
-												Remove u-strings and fix formatting [ci skip]

											
										
										
											2019-09-12 14:11:15 +00:00
+								doc = nlp("Apple and banana are similar. Pasta and hippo aren't.")
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 18:31:19 +00:00
 								apple = doc[0]
 								banana = doc[2]
 								pasta = doc[6]
 								hippo = doc[8]
 								print("apple <-> banana", apple.similarity(banana))
 								print("pasta <-> hippo", pasta.similarity(hippo))
 								print(apple.has_vector, banana.has_vector, pasta.has_vector, hippo.has_vector)
 								```
 								For the best results, you should run this example using the
-												Divide models into core and starters [ci skip]

											
										
										
											2019-12-21 13:10:22 +00:00
+								[`en_vectors_web_lg`](/models/en-starters#en_vectors_web_lg) model (currently
 								not available in the live demo).
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 18:31:19 +00:00
 								<Infobox>
 								**Usage:** [Word vectors and similarity](/usage/vectors-similarity)
 								</Infobox>
 								### Simple and efficient serialization {#lightning-tour-serialization}
 								```python
 								import spacy
 								from spacy.tokens import Doc
 								from spacy.vocab import Vocab
 								nlp = spacy.load("en_core_web_sm")
 								customer_feedback = open("customer_feedback_627.txt").read()
 								doc = nlp(customer_feedback)
 								doc.to_disk("/tmp/customer_feedback_627.bin")
 								new_doc = Doc(Vocab()).from_disk("/tmp/customer_feedback_627.bin")
 								```
 								<Infobox>
 								**API:** [`Language`](/api/language), [`Doc`](/api/doc) **Usage:**
-												Fix links [ci skip]

											
										
										
											2019-02-17 21:25:50 +00:00
+								[Saving and loading models](/usage/saving-loading#models)
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 18:31:19 +00:00
 								</Infobox>
 								### Match text with token rules {#lightning-tour-rule-matcher}
 								```python
 								### {executable="true"}
 								import spacy
 								from spacy.matcher import Matcher
 								nlp = spacy.load("en_core_web_sm")
 								matcher = Matcher(nlp.vocab)
 								def set_sentiment(matcher, doc, i, matches):
 								    doc.sentiment += 0.1
-												Update matcher usage examples [ci skip]

											
										
										
											2020-07-02 13:39:45 +00:00
+								pattern1 = [[{"ORTH": "Google"}, {"ORTH": "I"}, {"ORTH": "/"}, {"ORTH": "O"}]]
 								patterns = [[{"ORTH": emoji, "OP": "+"}] for emoji in ["😀", "😂", "🤣", "😍"]]
 								matcher.add("GoogleIO", patterns1)  # Match "Google I/O" or "Google i/o"
 								matcher.add("HAPPY", patterns2, on_match=set_sentiment)  # Match one or more happy emoji
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 18:31:19 +00:00
-												Remove u-strings and fix formatting [ci skip]

											
										
										
											2019-09-12 14:11:15 +00:00
+								doc = nlp("A text about Google I/O 😀😀")
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 18:31:19 +00:00
+								matches = matcher(doc)
 								for match_id, start, end in matches:
 								    string_id = nlp.vocab.strings[match_id]
 								    span = doc[start:end]
 								    print(string_id, span.text)
 								print("Sentiment", doc.sentiment)
 								```
 								<Infobox>
 								**API:** [`Matcher`](/api/matcher) **Usage:**
-												Fix links [ci skip]

											
										
										
											2019-02-17 21:25:50 +00:00
+								[Rule-based matching](/usage/rule-based-matching)
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 18:31:19 +00:00
 								</Infobox>
 								### Minibatched stream processing {#lightning-tour-minibatched}
 								```python
-												Remove u-strings and fix formatting [ci skip]

											
										
										
											2019-09-12 14:11:15 +00:00
+								texts = ["One document.", "...", "Lots of documents"]
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 18:31:19 +00:00
+								# .pipe streams input, and produces streaming output
 								iter_texts = (texts[i % 3] for i in range(100000000))
 								for i, doc in enumerate(nlp.pipe(iter_texts, batch_size=50)):
 								    assert doc.is_parsed
 								    if i == 100:
 								        break
 								```
 								### Get syntactic dependencies {#lightning-tour-dependencies model="parser"}
 								```python
 								### {executable="true"}
 								import spacy
 								nlp = spacy.load("en_core_web_sm")
-												Remove u-strings and fix formatting [ci skip]

											
										
										
											2019-09-12 14:11:15 +00:00
+								doc = nlp("When Sebastian Thrun started working on self-driving cars at Google "
 								          "in 2007, few people outside of the company took him seriously.")
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 18:31:19 +00:00
 								dep_labels = []
 								for token in doc:
 								    while token.head != token:
 								        dep_labels.append(token.dep_)
 								        token = token.head
 								print(dep_labels)
 								```
 								<Infobox>
 								**API:** [`Token`](/api/token) **Usage:**
 								[Using the dependency parse](/usage/linguistic-features#dependency-parse)
 								</Infobox>
 								### Export to numpy arrays {#lightning-tour-numpy-arrays}
 								```python
 								### {executable="true"}
 								import spacy
 								from spacy.attrs import ORTH, LIKE_URL
 								nlp = spacy.load("en_core_web_sm")
-												Remove u-strings and fix formatting [ci skip]

											
										
										
											2019-09-12 14:11:15 +00:00
+								doc = nlp("Check out https://spacy.io")
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 18:31:19 +00:00
+								for token in doc:
 								    print(token.text, token.orth, token.like_url)
 								attr_ids = [ORTH, LIKE_URL]
 								doc_array = doc.to_array(attr_ids)
 								print(doc_array.shape)
 								print(len(doc), len(attr_ids))
 								assert doc[0].orth == doc_array[0, 0]
 								assert doc[1].orth == doc_array[1, 0]
 								assert doc[0].like_url == doc_array[0, 1]
 								assert list(doc_array[:, 1]) == [t.like_url for t in doc]
 								print(list(doc_array[:, 1]))
 								```
 								### Calculate inline markup on original string {#lightning-tour-inline}
 								```python
 								### {executable="true"}
 								import spacy
 								def put_spans_around_tokens(doc):
 								    """Here, we're building a custom "syntax highlighter" for
 								    part-of-speech tags and dependencies. We put each token in a
 								    span element, with the appropriate classes computed. All whitespace is
 								    preserved, outside of the spans. (Of course, HTML will only display
 								    multiple whitespace if enabled – but the point is, no information is lost
 								    and you can calculate what you need, e.g. <br />, <p> etc.)
 								    """
 								    output = []
 								    for token in doc:
 								        if token.is_space:
 								            output.append(token.text)
 								        else:
-												Drop Python 2.7 and 3.5 (#4828)

* Remove unicode declarations

* Remove Python 3.5 and 2.7 from CI

* Don't require pathlib

* Replace compat helpers

* Remove OrderedDict

* Use f-strings

* Set Cython compiler language level

* Fix typo

* Re-add OrderedDict for Table

* Update setup.cfg

* Revert CONTRIBUTING.md

* Revert lookups.md

* Revert top-level.md

* Small adjustments and docs [ci skip]

											
										
										
											2019-12-22 00:53:56 +00:00
+								            classes = f"pos-{token.pos_} dep-{token.dep_}"
 								            output.append(f'<span class="{classes}">{token.text}</span>{token.whitespace_}')
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 18:31:19 +00:00
+								    string = "".join(output)
 								    string = string.replace("\\n", "")
 								    string = string.replace("\\t", "    ")
-												Drop Python 2.7 and 3.5 (#4828)

* Remove unicode declarations

* Remove Python 3.5 and 2.7 from CI

* Don't require pathlib

* Replace compat helpers

* Remove OrderedDict

* Use f-strings

* Set Cython compiler language level

* Fix typo

* Re-add OrderedDict for Table

* Update setup.cfg

* Revert CONTRIBUTING.md

* Revert lookups.md

* Revert top-level.md

* Small adjustments and docs [ci skip]

											
										
										
											2019-12-22 00:53:56 +00:00
+								    return f"<pre>{string}</pre>"
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 18:31:19 +00:00
 								nlp = spacy.load("en_core_web_sm")
-												Remove u-strings and fix formatting [ci skip]

											
										
										
											2019-09-12 14:11:15 +00:00
+								doc = nlp("This is a test.\\n\\nHello   world.")
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 18:31:19 +00:00
+								html = put_spans_around_tokens(doc)
 								print(html)
 								```
 								## Architecture {#architecture}
 								import Architecture101 from 'usage/101/\_architecture.md'
 								<Architecture101 />
 								## Community & FAQ {#community-faq}
 								We're very happy to see the spaCy community grow and include a mix of people
 								from all kinds of different backgrounds – computational linguistics, data
 								science, deep learning, research and more. If you'd like to get involved, below
 								are some answers to the most important questions and resources for further
 								reading.
 								### Help, my code isn't working! {#faq-help-code}
 								Bugs suck, and we're doing our best to continuously improve the tests and fix
 								bugs as soon as possible. Before you submit an issue, do a quick search and
 								check if the problem has already been reported. If you're having installation or
 								loading problems, make sure to also check out the
 								[troubleshooting guide](/usage/#troubleshooting). Help with spaCy is available
 								via the following platforms:
 								> #### How do I know if something is a bug?
 								>
 								> Of course, it's always hard to know for sure, so don't worry – we're not going
 								> to be mad if a bug report turns out to be a typo in your code. As a simple
 								> rule, any C-level error without a Python traceback, like a **segmentation
 								> fault** or **memory error**, is **always** a spaCy bug.
 								>
 								> Because models are statistical, their performance will never be _perfect_.
 								> However, if you come across **patterns that might indicate an underlying
 								> issue**, please do file a report. Similarly, we also care about behaviors that
 								> **contradict our docs**.
 								- [Stack Overflow](https://stackoverflow.com/questions/tagged/spacy): **Usage
 								  questions** and everything related to problems with your specific code. The
 								  Stack Overflow community is much larger than ours, so if your problem can be
 								  solved by others, you'll receive help much quicker.
 								- [Gitter chat](https://gitter.im/explosion/spaCy): **General discussion** about
 								  spaCy, meeting other community members and exchanging **tips, tricks and best
 								  practices**.
 								- [GitHub issue tracker](https://github.com/explosion/spaCy/issues): **Bug
 								  reports** and **improvement suggestions**, i.e. everything that's likely
 								  spaCy's fault. This also includes problems with the models beyond statistical
 								  imprecisions, like patterns that point to a bug.
 								<Infobox title="Important note" variant="warning">
 								Please understand that we won't be able to provide individual support via email.
 								We also believe that help is much more valuable if it's shared publicly, so that
 								**more people can benefit from it**. If you come across an issue and you think
 								you might be able to help, consider posting a quick update with your solution.
 								No matter how simple, it can easily save someone a lot of time and headache –
 								and the next time you need help, they might repay the favor.
 								</Infobox>
 								### How can I contribute to spaCy? {#faq-contributing}
 								You don't have to be an NLP expert or Python pro to contribute, and we're happy
 								to help you get started. If you're new to spaCy, a good place to start is the
 								[`help wanted (easy)` label](https://github.com/explosion/spaCy/issues?q=is%3Aissue+is%3Aopen+label%3A"help+wanted+%28easy%29")
 								on GitHub, which we use to tag bugs and feature requests that are easy and
 								self-contained. We also appreciate contributions to the docs – whether it's
 								fixing a typo, improving an example or adding additional explanations. You'll
 								find a "Suggest edits" link at the bottom of each page that points you to the
 								source.
 								Another way of getting involved is to help us improve the
 								[language data](/usage/adding-languages#language-data) – especially if you
 								happen to speak one of the languages currently in
 								[alpha support](/usage/models#languages). Even adding simple tokenizer
 								exceptions, stop words or lemmatizer data can make a big difference. It will
 								also make it easier for us to provide a statistical model for the language in
 								the future. Submitting a test that documents a bug or performance issue, or
 								covers functionality that's especially important for your application is also
 								very helpful. This way, you'll also make sure we never accidentally introduce
 								regressions to the parts of the library that you care about the most.
 								**For more details on the types of contributions we're looking for, the code
 								conventions and other useful tips, make sure to check out the
 								[contributing guidelines](https://github.com/explosion/spaCy/tree/master/CONTRIBUTING.md).**
 								<Infobox title="Code of Conduct" variant="warning">
 								spaCy adheres to the
 								[Contributor Covenant Code of Conduct](http://contributor-covenant.org/version/1/4/).
 								By participating, you are expected to uphold this code.
 								</Infobox>
 								### I've built something cool with spaCy – how can I get the word out? {#faq-project-with-spacy}
 								First, congrats – we'd love to check it out! When you share your project on
 								Twitter, don't forget to tag [@spacy_io](https://twitter.com/spacy_io) so we
 								don't miss it. If you think your project would be a good fit for the
 								[spaCy Universe](/universe), **feel free to submit it!** Tutorials are also
 								incredibly valuable to other users and a great way to get exposure. So we
 								strongly encourage **writing up your experiences**, or sharing your code and
 								some tips and tricks on your blog. Since our website is open-source, you can add
 								your project or tutorial by making a pull request on GitHub.
 								If you would like to use the spaCy logo on your site, please get in touch and
 								ask us first. However, if you want to show support and tell others that your
 								project is using spaCy, you can grab one of our **spaCy badges** here:
 								<img src={`https://img.shields.io/badge/built%20with-spaCy-09a3d5.svg`} />
 								```markdown
 								[![Built with spaCy](https://img.shields.io/badge/built%20with-spaCy-09a3d5.svg)](https://spacy.io)
 								```
 								<img src={`https://img.shields.io/badge/made%20with%20❤%20and-spaCy-09a3d5.svg`}
 								/>
 								```markdown
 								[![Built with spaCy](https://img.shields.io/badge/made%20with%20❤%20and-spaCy-09a3d5.svg)](https://spacy.io)
 								```