spaCy/website/docs/usage/processing-pipelines.md

---
title: Language Processing Pipelines
next: vectors-similarity
menu:
  - ['Processing Text', 'processing']
  - ['How Pipelines Work', 'pipelines']
  - ['Custom Components', 'custom-components']
  - ['Extension Attributes', 'custom-components-attributes']
  - ['Plugins & Wrappers', 'plugins']
---

import Pipelines101 from 'usage/101/\_pipelines.md'

<Pipelines101 />

## Processing text {#processing}

When you call `nlp` on a text, spaCy will **tokenize** it and then **call each
component** on the `Doc`, in order. It then returns the processed `Doc` that you
can work with.

```python
doc = nlp("This is a text")
```

When processing large volumes of text, the statistical models are usually more
efficient if you let them work on batches of texts. spaCy's
[`nlp.pipe`](/api/language#pipe) method takes an iterable of texts and yields
processed `Doc` objects. The batching is done internally.

```diff
texts = ["This is a text", "These are lots of texts", "..."]
- docs = [nlp(text) for text in texts]
+ docs = list(nlp.pipe(texts))
```

<Infobox title="Tips for efficient processing">

- Process the texts **as a stream** using [`nlp.pipe`](/api/language#pipe) and
  buffer them in batches, instead of one-by-one. This is usually much more
  efficient.
- Only apply the **pipeline components you need**. Getting predictions from the
  model that you don't actually need adds up and becomes very inefficient at
  scale. To prevent this, use the `disable` keyword argument to disable
  components you don't need – either when loading a model, or during processing
  with `nlp.pipe`. See the section on
  [disabling pipeline components](#disabling) for more details and examples.

</Infobox>

In this example, we're using [`nlp.pipe`](/api/language#pipe) to process a
(potentially very large) iterable of texts as a stream. Because we're only
accessing the named entities in `doc.ents` (set by the `ner` component), we'll
disable all other statistical components (the `tagger` and `parser`) during
processing. `nlp.pipe` yields `Doc` objects, so we can iterate over them and
access the named entity predictions:

> #### ✏️ Things to try
>
> 1. Also disable the `"ner"` component. You'll see that the `doc.ents` are now
>    empty, because the entity recognizer didn't run.

```python
### {executable="true"}
import spacy

texts = [
    "Net income was $9.4 million compared to the prior year of $2.7 million.",
    "Revenue exceeded twelve billion dollars, with a loss of $1b.",
]

nlp = spacy.load("en_core_web_sm")
for doc in nlp.pipe(texts, disable=["tagger", "parser"]):
    # Do something with the doc here
    print([(ent.text, ent.label_) for ent in doc.ents])
```

<Infobox title="Important note" variant="warning">

When using [`nlp.pipe`](/api/language#pipe), keep in mind that it returns a
[generator](https://realpython.com/introduction-to-python-generators/) that
yields `Doc` objects – not a list. So if you want to use it like a list, you'll
have to call `list()` on it first:

```diff
- docs = nlp.pipe(texts)[0]         # will raise an error
+ docs = list(nlp.pipe(texts))[0]   # works as expected
```

</Infobox>

## How pipelines work {#pipelines}

spaCy makes it very easy to create your own pipelines consisting of reusable
components – this includes spaCy's default tagger, parser and entity recognizer,
but also your own custom processing functions. A pipeline component can be added
to an already existing `nlp` object, specified when initializing a `Language`
class, or defined within a [model package](/usage/saving-loading#models).

When you load a model, spaCy first consults the model's
[`meta.json`](/usage/saving-loading#models). The meta typically includes the
model details, the ID of a language class, and an optional list of pipeline
components. spaCy then does the following:

> #### meta.json (excerpt)
>
> ```json
> {
>   "lang": "en",
>   "name": "core_web_sm",
>   "description": "Example model for spaCy",
>   "pipeline": ["tagger", "parser", "ner"]
> }
> ```

1. Load the **language class and data** for the given ID via
   [`get_lang_class`](/api/top-level#util.get_lang_class) and initialize it. The
   `Language` class contains the shared vocabulary, tokenization rules and the
   language-specific annotation scheme.
2. Iterate over the **pipeline names** and create each component using
   [`create_pipe`](/api/language#create_pipe), which looks them up in
   `Language.factories`.
3. Add each pipeline component to the pipeline in order, using
   [`add_pipe`](/api/language#add_pipe).
4. Make the **model data** available to the `Language` class by calling
   [`from_disk`](/api/language#from_disk) with the path to the model data
   directory.

So when you call this...

```python
nlp = spacy.load("en_core_web_sm")
```

... the model's `meta.json` tells spaCy to use the language `"en"` and the
pipeline `["tagger", "parser", "ner"]`. spaCy will then initialize
`spacy.lang.en.English`, and create each pipeline component and add it to the
processing pipeline. It'll then load in the model's data from its data directory
and return the modified `Language` class for you to use as the `nlp` object.

Fundamentally, a [spaCy model](/models) consists of three components: **the
weights**, i.e. binary data loaded in from a directory, a **pipeline** of
functions called in order, and **language data** like the tokenization rules and
annotation scheme. All of this is specific to each model, and defined in the
model's `meta.json` – for example, a Spanish NER model requires different
weights, language data and pipeline components than an English parsing and
tagging model. This is also why the pipeline state is always held by the
`Language` class. [`spacy.load`](/api/top-level#spacy.load) puts this all
together and returns an instance of `Language` with a pipeline set and access to
the binary data:

```python
### spacy.load under the hood
lang = "en"
pipeline = ["tagger", "parser", "ner"]
data_path = "path/to/en_core_web_sm/en_core_web_sm-2.0.0"

cls = spacy.util.get_lang_class(lang)   # 1. Get Language instance, e.g. English()
nlp = cls()                             # 2. Initialize it
for name in pipeline:
    component = nlp.create_pipe(name)   # 3. Create the pipeline components
    nlp.add_pipe(component)             # 4. Add the component to the pipeline
nlp.from_disk(model_data_path)          # 5. Load in the binary data
```

When you call `nlp` on a text, spaCy will **tokenize** it and then **call each
component** on the `Doc`, in order. Since the model data is loaded, the
components can access it to assign annotations to the `Doc` object, and
subsequently to the `Token` and `Span` which are only views of the `Doc`, and
don't own any data themselves. All components return the modified document,
which is then processed by the component next in the pipeline.

```python
### The pipeline under the hood
doc = nlp.make_doc("This is a sentence")   # create a Doc from raw text
for name, proc in nlp.pipeline:             # iterate over components in order
    doc = proc(doc)                         # apply each component
```

The current processing pipeline is available as `nlp.pipeline`, which returns a
list of `(name, component)` tuples, or `nlp.pipe_names`, which only returns a
list of human-readable component names.

```python
print(nlp.pipeline)
# [('tagger', <spacy.pipeline.Tagger>), ('parser', <spacy.pipeline.DependencyParser>), ('ner', <spacy.pipeline.EntityRecognizer>)]
print(nlp.pipe_names)
# ['tagger', 'parser', 'ner']
```

### Built-in pipeline components {#built-in}

spaCy ships with several built-in pipeline components that are also available in
the `Language.factories`. This means that you can initialize them by calling
[`nlp.create_pipe`](/api/language#create_pipe) with their string names and
require them in the pipeline settings in your model's `meta.json`.

> #### Usage
>
> ```python
> # Option 1: Import and initialize
> from spacy.pipeline import EntityRuler
> ruler = EntityRuler(nlp)
> nlp.add_pipe(ruler)
>
> # Option 2: Using nlp.create_pipe
> sentencizer = nlp.create_pipe("sentencizer")
> nlp.add_pipe(sentencizer)
> ```

| String name         | Component                                                        | Description                                                                                   |
| ------------------- | ---------------------------------------------------------------- | --------------------------------------------------------------------------------------------- |
| `tagger`            | [`Tagger`](/api/tagger)                                          | Assign part-of-speech-tags.                                                                   |
| `parser`            | [`DependencyParser`](/api/dependencyparser)                      | Assign dependency labels.                                                                     |
| `ner`               | [`EntityRecognizer`](/api/entityrecognizer)                      | Assign named entities.                                                                        |
| `entity_linker`     | [`EntityLinker`](/api/entitylinker)                              | Assign knowledge base IDs to named entities. Should be added after the entity recognizer.     |
| `textcat`           | [`TextCategorizer`](/api/textcategorizer)                        | Assign text categories.                                                                       |
| `entity_ruler`      | [`EntityRuler`](/api/entityruler)                                | Assign named entities based on pattern rules.                                                 |
| `sentencizer`       | [`Sentencizer`](/api/sentencizer)                                | Add rule-based sentence segmentation without the dependency parse.                            |
| `merge_noun_chunks` | [`merge_noun_chunks`](/api/pipeline-functions#merge_noun_chunks) | Merge all noun chunks into a single token. Should be added after the tagger and parser.       |
| `merge_entities`    | [`merge_entities`](/api/pipeline-functions#merge_entities)       | Merge all entities into a single token. Should be added after the entity recognizer.          |
| `merge_subtokens`   | [`merge_subtokens`](/api/pipeline-functions#merge_subtokens)     | Merge subtokens predicted by the parser into single tokens. Should be added after the parser. |

### Disabling and modifying pipeline components {#disabling}

If you don't need a particular component of the pipeline – for example, the
tagger or the parser, you can **disable loading** it. This can sometimes make a
big difference and improve loading speed. Disabled component names can be
provided to [`spacy.load`](/api/top-level#spacy.load),
[`Language.from_disk`](/api/language#from_disk) or the `nlp` object itself as a
list:

```python
### Disable loading
nlp = spacy.load("en_core_web_sm", disable=["tagger", "parser"])
nlp = English().from_disk("/model", disable=["ner"])
```

In some cases, you do want to load all pipeline components and their weights,
because you need them at different points in your application. However, if you
only need a `Doc` object with named entities, there's no need to run all
pipeline components on it – that can potentially make processing much slower.
Instead, you can use the `disable` keyword argument on
[`nlp.pipe`](/api/language#pipe) to temporarily disable the components **during
processing**:

```python
### Disable for processing
for doc in nlp.pipe(texts, disable=["tagger", "parser"]):
    # Do something with the doc here
```

If you need to **execute more code** with components disabled – e.g. to reset
the weights or update only some components during training – you can use the
[`nlp.select_pipes`](/api/language#select_pipes) contextmanager. At the end of
the `with` block, the disabled pipeline components will be restored
automatically. Alternatively, `select_pipes` returns an object that lets you
call its `restore()` method to restore the disabled components when needed. This
can be useful if you want to prevent unnecessary code indentation of large
blocks.

```python
### Disable for block
# 1. Use as a contextmanager
with nlp.select_pipes(disable=["tagger", "parser"]):
    doc = nlp("I won't be tagged and parsed")
doc = nlp("I will be tagged and parsed")

# 2. Restore manually
disabled = nlp.select_pipes(disable="ner")
doc = nlp("I won't have named entities")
disabled.restore()
```

If you want to disable all pipes except for one or a few, you can use the
`enable` keyword. Just like the `disable` keyword, it takes a list of pipe
names, or a string defining just one pipe.

```python
# Enable only the parser
with nlp.select_pipes(enable="parser"):
    doc = nlp("I will only be parsed")
```

Finally, you can also use the [`remove_pipe`](/api/language#remove_pipe) method
to remove pipeline components from an existing pipeline, the
[`rename_pipe`](/api/language#rename_pipe) method to rename them, or the
[`replace_pipe`](/api/language#replace_pipe) method to replace them with a
custom component entirely (more details on this in the section on
[custom components](#custom-components).

```python
nlp.remove_pipe("parser")
nlp.rename_pipe("ner", "entityrecognizer")
nlp.replace_pipe("tagger", my_custom_tagger)
```

<Infobox title="Important note: disabling pipeline components" variant="warning">

Since spaCy v2.0 comes with better support for customizing the processing
pipeline components, the `parser`, `tagger` and `entity` keyword arguments have
been replaced with `disable`, which takes a list of pipeline component names.
This lets you disable pre-defined components when loading a model, or
initializing a Language class via [`from_disk`](/api/language#from_disk).

```diff
- nlp = spacy.load('en', tagger=False, entity=False)
- doc = nlp("I don't want parsed", parse=False)

+ nlp = spacy.load("en", disable=["ner"])
+ nlp.remove_pipe("parser")
+ doc = nlp("I don't want parsed")
```

</Infobox>

## Creating custom pipeline components {#custom-components}

A component receives a `Doc` object and can modify it – for example, by using
the current weights to make a prediction and set some annotation on the
document. By adding a component to the pipeline, you'll get access to the `Doc`
at any point **during processing** – instead of only being able to modify it
afterwards.

> #### Example
>
> ```python
> def my_component(doc):
>    # do something to the doc here
>    return doc
> ```

| Argument    | Type  | Description                                            |
| ----------- | ----- | ------------------------------------------------------ |
| `doc`       | `Doc` | The `Doc` object processed by the previous component.  |
| **RETURNS** | `Doc` | The `Doc` object processed by this pipeline component. |

Custom components can be added to the pipeline using the
[`add_pipe`](/api/language#add_pipe) method. Optionally, you can either specify
a component to add it **before or after**, tell spaCy to add it **first or
last** in the pipeline, or define a **custom name**. If no name is set and no
`name` attribute is present on your component, the function name is used.

> #### Example
>
> ```python
> nlp.add_pipe(my_component)
> nlp.add_pipe(my_component, first=True)
> nlp.add_pipe(my_component, before="parser")
> ```

| Argument | Type | Description                                                              |
| -------- | ---- | ------------------------------------------------------------------------ |
| `last`   | bool | If set to `True`, component is added **last** in the pipeline (default). |
| `first`  | bool | If set to `True`, component is added **first** in the pipeline.          |
| `before` | str  | String name of component to add the new component **before**.            |
| `after`  | str  | String name of component to add the new component **after**.             |

### Example: A simple pipeline component {#custom-components-simple}

The following component receives the `Doc` in the pipeline and prints some
information about it: the number of tokens, the part-of-speech tags of the
tokens and a conditional message based on the document length.

> #### ✏️ Things to try
>
> 1. Add the component first in the pipeline by setting `first=True`. You'll see
>    that the part-of-speech tags are empty, because the component now runs
>    before the tagger and the tags aren't available yet.
> 2. Change the component `name` or remove the `name` argument. You should see
>    this change reflected in `nlp.pipe_names`.
> 3. Print `nlp.pipeline`. You'll see a list of tuples describing the component
>    name and the function that's called on the `Doc` object in the pipeline.

```python
### {executable="true"}
import spacy

def my_component(doc):
    print(f"After tokenization, this doc has {len(doc)} tokens.")
    print("The part-of-speech tags are:", [token.pos_ for token in doc])
    if len(doc) < 10:
        print("This is a pretty short document.")
    return doc

nlp = spacy.load("en_core_web_sm")
nlp.add_pipe(my_component, name="print_info", last=True)
print(nlp.pipe_names)  # ['tagger', 'parser', 'ner', 'print_info']
doc = nlp("This is a sentence.")

```

Of course, you can also wrap your component as a class to allow initializing it
with custom settings and hold state within the component. This is useful for
**stateful components**, especially ones which **depend on shared data**. In the
following example, the custom component `EntityMatcher` can be initialized with
`nlp` object, a terminology list and an entity label. Using the
[`PhraseMatcher`](/api/phrasematcher), it then matches the terms in the `Doc`
and adds them to the existing entities.

<Infobox title="Important note" variant="warning">

As of v2.1.0, spaCy ships with the [`EntityRuler`](/api/entityruler), a pipeline
component for easy, rule-based named entity recognition. Its implementation is
similar to the `EntityMatcher` code shown below, but it includes some additional
features like support for phrase patterns and token patterns, handling overlaps
with existing entities and pattern export as JSONL.

We'll still keep the pipeline component example below, as it works well to
illustrate complex components. But if you're planning on using this type of
component in your application, you might find the `EntityRuler` more convenient.
[See here](/usage/rule-based-matching#entityruler) for more details and
examples.

</Infobox>

```python
### {executable="true"}
import spacy
from spacy.matcher import PhraseMatcher
from spacy.tokens import Span

class EntityMatcher(object):
    name = "entity_matcher"

    def __init__(self, nlp, terms, label):
        patterns = [nlp.make_doc(text) for text in terms]
        self.matcher = PhraseMatcher(nlp.vocab)
        self.matcher.add(label, None, *patterns)

    def __call__(self, doc):
        matches = self.matcher(doc)
        for match_id, start, end in matches:
            span = Span(doc, start, end, label=match_id)
            doc.ents = list(doc.ents) + [span]
        return doc

nlp = spacy.load("en_core_web_sm")
terms = ("cat", "dog", "tree kangaroo", "giant sea spider")
entity_matcher = EntityMatcher(nlp, terms, "ANIMAL")

nlp.add_pipe(entity_matcher, after="ner")

print(nlp.pipe_names)  # The components in the pipeline

doc = nlp("This is a text about Barack Obama and a tree kangaroo")
print([(ent.text, ent.label_) for ent in doc.ents])
```

### Example: Custom sentence segmentation logic {#component-example1}

Let's say you want to implement custom logic to improve spaCy's sentence
boundary detection. Currently, sentence segmentation is based on the dependency
parse, which doesn't always produce ideal results. The custom logic should
therefore be applied **after** tokenization, but _before_ the dependency parsing
– this way, the parser can also take advantage of the sentence boundaries.

> #### ✏️ Things to try
>
> 1. Print `[token.dep_ for token in doc]` with and without the custom pipeline
>    component. You'll see that the predicted dependency parse changes to match
>    the sentence boundaries.
> 2. Remove the `else` block. All other tokens will now have `is_sent_start` set
>    to `None` (missing value), the parser will assign sentence boundaries in
>    between.

```python
### {executable="true"}
import spacy

def custom_sentencizer(doc):
    for i, token in enumerate(doc[:-2]):
        # Define sentence start if pipe + titlecase token
        if token.text == "|" and doc[i+1].is_title:
            doc[i+1].is_sent_start = True
        else:
            # Explicitly set sentence start to False otherwise, to tell
            # the parser to leave those tokens alone
            doc[i+1].is_sent_start = False
    return doc

nlp = spacy.load("en_core_web_sm")
nlp.add_pipe(custom_sentencizer, before="parser")  # Insert before the parser
doc = nlp("This is. A sentence. | This is. Another sentence.")
for sent in doc.sents:
    print(sent.text)
```

### Example: Pipeline component for entity matching and tagging with custom attributes {#component-example2}

This example shows how to create a spaCy extension that takes a terminology list
(in this case, single- and multi-word company names), matches the occurrences in
a document, labels them as `ORG` entities, merges the tokens and sets custom
`is_tech_org` and `has_tech_org` attributes. For efficient matching, the example
uses the [`PhraseMatcher`](/api/phrasematcher) which accepts `Doc` objects as
match patterns and works well for large terminology lists. It also ensures your
patterns will always match, even when you customize spaCy's tokenization rules.
When you call `nlp` on a text, the custom pipeline component is applied to the
`Doc`.

```python
https://github.com/explosion/spaCy/tree/master/examples/pipeline/custom_component_entities.py
```

Wrapping this functionality in a pipeline component allows you to reuse the
module with different settings, and have all pre-processing taken care of when
you call `nlp` on your text and receive a `Doc` object.

### Adding factories {#custom-components-factories}

When spaCy loads a model via its `meta.json`, it will iterate over the
`"pipeline"` setting, look up every component name in the internal factories and
call [`nlp.create_pipe`](/api/language#create_pipe) to initialize the individual
components, like the tagger, parser or entity recognizer. If your model uses
custom components, this won't work – so you'll have to tell spaCy **where to
find your component**. You can do this by writing to the `Language.factories`:

```python
from spacy.language import Language
Language.factories["entity_matcher"] = lambda nlp, **cfg: EntityMatcher(nlp, **cfg)
```

You can also ship the above code and your custom component in your packaged
model's `__init__.py`, so it's executed when you load your model. The `**cfg`
config parameters are passed all the way down from
[`spacy.load`](/api/top-level#spacy.load), so you can load the model and its
components with custom settings:

```python
nlp = spacy.load("your_custom_model", terms=["tree kangaroo"], label="ANIMAL")
```

<Infobox title="Important note" variant="warning">

When you load a model via its shortcut or package name, like `en_core_web_sm`,
spaCy will import the package and then call its `load()` method. This means that
custom code in the model's `__init__.py` will be executed, too. This is **not
the case** if you're loading a model from a path containing the model data.
Here, spaCy will only read in the `meta.json`. If you want to use custom
factories with a model loaded from a path, you need to add them to
`Language.factories` _before_ you load the model.

</Infobox>

## Extension attributes {#custom-components-attributes new="2"}

As of v2.0, spaCy allows you to set any custom attributes and methods on the
`Doc`, `Span` and `Token`, which become available as `Doc._`, `Span._` and
`Token._` – for example, `Token._.my_attr`. This lets you store additional
information relevant to your application, add new features and functionality to
spaCy, and implement your own models trained with other machine learning
libraries. It also lets you take advantage of spaCy's data structures and the
`Doc` object as the "single source of truth".

<Accordion title="Why ._ and not just a top-level attribute?" id="why-dot-underscore">

Writing to a `._` attribute instead of to the `Doc` directly keeps a clearer
separation and makes it easier to ensure backwards compatibility. For example,
if you've implemented your own `.coref` property and spaCy claims it one day,
it'll break your code. Similarly, just by looking at the code, you'll
immediately know what's built-in and what's custom – for example,
`doc.sentiment` is spaCy, while `doc._.sent_score` isn't.

</Accordion>

<Accordion title="How is the ._ implemented?" id="dot-underscore-implementation">

Extension definitions – the defaults, methods, getters and setters you pass in
to `set_extension` – are stored in class attributes on the `Underscore` class.
If you write to an extension attribute, e.g. `doc._.hello = True`, the data is
stored within the [`Doc.user_data`](/api/doc#attributes) dictionary. To keep the
underscore data separate from your other dictionary entries, the string `"._."`
is placed before the name, in a tuple.

</Accordion>

---

There are three main types of extensions, which can be defined using the
[`Doc.set_extension`](/api/doc#set_extension),
[`Span.set_extension`](/api/span#set_extension) and
[`Token.set_extension`](/api/token#set_extension) methods.

1. **Attribute extensions.** Set a default value for an attribute, which can be
   overwritten manually at any time. Attribute extensions work like "normal"
   variables and are the quickest way to store arbitrary information on a `Doc`,
   `Span` or `Token`.

   ```python
    Doc.set_extension("hello", default=True)
    assert doc._.hello
    doc._.hello = False
   ```

2. **Property extensions.** Define a getter and an optional setter function. If
   no setter is provided, the extension is immutable. Since the getter and
   setter functions are only called when you _retrieve_ the attribute, you can
   also access values of previously added attribute extensions. For example, a
   `Doc` getter can average over `Token` attributes. For `Span` extensions,
   you'll almost always want to use a property – otherwise, you'd have to write
   to _every possible_ `Span` in the `Doc` to set up the values correctly.

   ```python
   Doc.set_extension("hello", getter=get_hello_value, setter=set_hello_value)
   assert doc._.hello
   doc._.hello = "Hi!"
   ```

3. **Method extensions.** Assign a function that becomes available as an object
   method. Method extensions are always immutable. For more details and
   implementation ideas, see
   [these examples](/usage/examples#custom-components-attr-methods).

   ```python
   Doc.set_extension("hello", method=lambda doc, name: f"Hi {name}!")
   assert doc._.hello("Bob") == "Hi Bob!"
   ```

Before you can access a custom extension, you need to register it using the
`set_extension` method on the object you want to add it to, e.g. the `Doc`. Keep
in mind that extensions are always **added globally** and not just on a
particular instance. If an attribute of the same name already exists, or if
you're trying to access an attribute that hasn't been registered, spaCy will
raise an `AttributeError`.

```python
### Example
from spacy.tokens import Doc, Span, Token

fruits = ["apple", "pear", "banana", "orange", "strawberry"]
is_fruit_getter = lambda token: token.text in fruits
has_fruit_getter = lambda obj: any([t.text in fruits for t in obj])

Token.set_extension("is_fruit", getter=is_fruit_getter)
Doc.set_extension("has_fruit", getter=has_fruit_getter)
Span.set_extension("has_fruit", getter=has_fruit_getter)
```

> #### Usage example
>
> ```python
> doc = nlp("I have an apple and a melon")
> assert doc[3]._.is_fruit      # get Token attributes
> assert not doc[0]._.is_fruit
> assert doc._.has_fruit        # get Doc attributes
> assert doc[1:4]._.has_fruit   # get Span attributes
> ```

Once you've registered your custom attribute, you can also use the built-in
`set`, `get` and `has` methods to modify and retrieve the attributes. This is
especially useful it you want to pass in a string instead of calling
`doc._.my_attr`.

### Example: Pipeline component for GPE entities and country meta data via a REST API {#component-example3}

This example shows the implementation of a pipeline component that fetches
country meta data via the [REST Countries API](https://restcountries.eu), sets
entity annotations for countries, merges entities into one token and sets custom
attributes on the `Doc`, `Span` and `Token` – for example, the capital,
latitude/longitude coordinates and even the country flag.

```python
https://github.com/explosion/spaCy/tree/master/examples/pipeline/custom_component_countries_api.py
```

In this case, all data can be fetched on initialization in one request. However,
if you're working with text that contains incomplete country names, spelling
mistakes or foreign-language versions, you could also implement a
`like_country`-style getter function that makes a request to the search API
endpoint and returns the best-matching result.

### User hooks {#custom-components-user-hooks}

While it's generally recommended to use the `Doc._`, `Span._` and `Token._`
proxies to add your own custom attributes, spaCy offers a few exceptions to
allow **customizing the built-in methods** like
[`Doc.similarity`](/api/doc#similarity) or [`Doc.vector`](/api/doc#vector) with
your own hooks, which can rely on statistical models you train yourself. For
instance, you can provide your own on-the-fly sentence segmentation algorithm or
document similarity method.

Hooks let you customize some of the behaviors of the `Doc`, `Span` or `Token`
objects by adding a component to the pipeline. For instance, to customize the
[`Doc.similarity`](/api/doc#similarity) method, you can add a component that
sets a custom function to `doc.user_hooks['similarity']`. The built-in
`Doc.similarity` method will check the `user_hooks` dict, and delegate to your
function if you've set one. Similar results can be achieved by setting functions
to `Doc.user_span_hooks` and `Doc.user_token_hooks`.

> #### Implementation note
>
> The hooks live on the `Doc` object because the `Span` and `Token` objects are
> created lazily, and don't own any data. They just proxy to their parent `Doc`.
> This turns out to be convenient here — we only have to worry about installing
> hooks in one place.

| Name               | Customizes                                                                                                                                                                                                              |
| ------------------ | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `user_hooks`       | [`Doc.vector`](/api/doc#vector), [`Doc.has_vector`](/api/doc#has_vector), [`Doc.vector_norm`](/api/doc#vector_norm), [`Doc.sents`](/api/doc#sents)                                                                      |
| `user_token_hooks` | [`Token.similarity`](/api/token#similarity), [`Token.vector`](/api/token#vector), [`Token.has_vector`](/api/token#has_vector), [`Token.vector_norm`](/api/token#vector_norm), [`Token.conjuncts`](/api/token#conjuncts) |
| `user_span_hooks`  | [`Span.similarity`](/api/span#similarity), [`Span.vector`](/api/span#vector), [`Span.has_vector`](/api/span#has_vector), [`Span.vector_norm`](/api/span#vector_norm), [`Span.root`](/api/span#root)                     |

```python
### Add custom similarity hooks
class SimilarityModel(object):
    def __init__(self, model):
        self._model = model

    def __call__(self, doc):
        doc.user_hooks["similarity"] = self.similarity
        doc.user_span_hooks["similarity"] = self.similarity
        doc.user_token_hooks["similarity"] = self.similarity

    def similarity(self, obj1, obj2):
        y = self._model([obj1.vector, obj2.vector])
        return float(y[0])
```

## Developing plugins and wrappers {#plugins}

We're very excited about all the new possibilities for community extensions and
plugins in spaCy v2.0, and we can't wait to see what you build with it! To get
you started, here are a few tips, tricks and best
practices. [See here](/universe/?category=pipeline) for examples of other spaCy
extensions.

### Usage ideas {#custom-components-usage-ideas}

- **Adding new features and hooking in models.** For example, a sentiment
  analysis model, or your preferred solution for lemmatization or sentiment
  analysis. spaCy's built-in tagger, parser and entity recognizer respect
  annotations that were already set on the `Doc` in a previous step of the
  pipeline.
- **Integrating other libraries and APIs.** For example, your pipeline component
  can write additional information and data directly to the `Doc` or `Token` as
  custom attributes, while making sure no information is lost in the process.
  This can be output generated by other libraries and models, or an external
  service with a REST API.
- **Debugging and logging.** For example, a component which stores and/or
  exports relevant information about the current state of the processed
  document, and insert it at any point of your pipeline.

### Best practices {#custom-components-best-practices}

Extensions can claim their own `._` namespace and exist as standalone packages.
If you're developing a tool or library and want to make it easy for others to
use it with spaCy and add it to their pipeline, all you have to do is expose a
function that takes a `Doc`, modifies it and returns it.

- Make sure to choose a **descriptive and specific name** for your pipeline
  component class, and set it as its `name` attribute. Avoid names that are too
  common or likely to clash with built-in or a user's other custom components.
  While it's fine to call your package `"spacy_my_extension"`, avoid component
  names including `"spacy"`, since this can easily lead to confusion.

  ```diff
  + name = "myapp_lemmatizer"
  - name = "lemmatizer"
  ```

- When writing to `Doc`, `Token` or `Span` objects, **use getter functions**
  wherever possible, and avoid setting values explicitly. Tokens and spans don't
  own any data themselves, and they're implemented as C extension classes – so
  you can't usually add new attributes to them like you could with most pure
  Python objects.

  ```diff
  + is_fruit = lambda token: token.text in ("apple", "orange")
  + Token.set_extension("is_fruit", getter=is_fruit)

  - token._.set_extension("is_fruit", default=False)
  - if token.text in ('"apple", "orange"):
  -     token._.set("is_fruit", True)
  ```

- Always add your custom attributes to the **global** `Doc`, `Token` or `Span`
  objects, not a particular instance of them. Add the attributes **as early as
  possible**, e.g. in your extension's `__init__` method or in the global scope
  of your module. This means that in the case of namespace collisions, the user
  will see an error immediately, not just when they run their pipeline.

  ```diff
  + from spacy.tokens import Doc
  + def __init__(attr="my_attr"):
  +     Doc.set_extension(attr, getter=self.get_doc_attr)

  - def __call__(doc):
  -     doc.set_extension("my_attr", getter=self.get_doc_attr)
  ```

- If your extension is setting properties on the `Doc`, `Token` or `Span`,
  include an option to **let the user to change those attribute names**. This
  makes it easier to avoid namespace collisions and accommodate users with
  different naming preferences. We recommend adding an `attrs` argument to the
  `__init__` method of your class so you can write the names to class attributes
  and reuse them across your component.

  ```diff
  + Doc.set_extension(self.doc_attr, default="some value")
  - Doc.set_extension("my_doc_attr", default="some value")
  ```

- Ideally, extensions should be **standalone packages** with spaCy and
  optionally, other packages specified as a dependency. They can freely assign
  to their own `._` namespace, but should stick to that. If your extension's
  only job is to provide a better `.similarity` implementation, and your docs
  state this explicitly, there's no problem with writing to the
  [`user_hooks`](#custom-components-user-hooks) and overwriting spaCy's built-in
  method. However, a third-party extension should **never silently overwrite
  built-ins**, or attributes set by other extensions.

- If you're looking to publish a model that depends on a custom pipeline
  component, you can either **require it** in the model package's dependencies,
  or – if the component is specific and lightweight – choose to **ship it with
  your model package** and add it to the `Language` instance returned by the
  model's `load()` method. For examples of this, check out the implementations
  of spaCy's
  [`load_model_from_init_py`](/api/top-level#util.load_model_from_init_py)
  [`load_model_from_path`](/api/top-level#util.load_model_from_path) utility
  functions.

  ```diff
  + nlp.add_pipe(my_custom_component)
  +     return nlp.from_disk(model_path)
  ```

- Once you're ready to share your extension with others, make sure to **add docs
  and installation instructions** (you can always link to this page for more
  info). Make it easy for others to install and use your extension, for example
  by uploading it to [PyPi](https://pypi.python.org). If you're sharing your
  code on GitHub, don't forget to tag it with
  [`spacy`](https://github.com/topics/spacy?o=desc&s=stars) and
  [`spacy-extension`](https://github.com/topics/spacy-extension?o=desc&s=stars)
  to help people find it. If you post it on Twitter, feel free to tag
  [@spacy_io](https://twitter.com/spacy_io) so we can check it out.

### Wrapping other models and libraries {#wrapping-models-libraries}

Let's say you have a custom entity recognizer that takes a list of strings and
returns their [BILUO tags](/api/annotation#biluo). Given an input like
`["A", "text", "about", "Facebook"]`, it will predict and return
`["O", "O", "O", "U-ORG"]`. To integrate it into your spaCy pipeline and make it
add those entities to the `doc.ents`, you can wrap it in a custom pipeline
component function and pass it the token texts from the `Doc` object received by
the component.

The [`gold.spans_from_biluo_tags`](/api/goldparse#spans_from_biluo_tags) is very
helpful here, because it takes a `Doc` object and token-based BILUO tags and
returns a sequence of `Span` objects in the `Doc` with added labels. So all your
wrapper has to do is compute the entity spans and overwrite the `doc.ents`.

> #### How the doc.ents work
>
> When you add spans to the `doc.ents`, spaCy will automatically resolve them
> back to the underlying tokens and set the `Token.ent_type` and `Token.ent_iob`
> attributes. By definition, each token can only be part of one entity, so
> overlapping entity spans are not allowed.

```python
### {highlight="1,6-7"}
import your_custom_entity_recognizer
from spacy.gold import offsets_from_biluo_tags

def custom_ner_wrapper(doc):
    words = [token.text for token in doc]
    custom_entities = your_custom_entity_recognizer(words)
    doc.ents = spans_from_biluo_tags(doc, custom_entities)
    return doc
```

The `custom_ner_wrapper` can then be added to the pipeline of a blank model
using [`nlp.add_pipe`](/api/language#add_pipe). You can also replace the
existing entity recognizer of a pretrained model with
[`nlp.replace_pipe`](/api/language#replace_pipe).

Here's another example of a custom model, `your_custom_model`, that takes a list
of tokens and returns lists of fine-grained part-of-speech tags, coarse-grained
part-of-speech tags, dependency labels and head token indices. Here, we can use
the [`Doc.from_array`](/api/doc#from_array) to create a new `Doc` object using
those values. To create a numpy array we need integers, so we can look up the
string labels in the [`StringStore`](/api/stringstore). The
[`doc.vocab.strings.add`](/api/stringstore#add) method comes in handy here,
because it returns the integer ID of the string _and_ makes sure it's added to
the vocab. This is especially important if the custom model uses a different
label scheme than spaCy's default models.

> #### Example: spacy-stanfordnlp
>
> For an example of an end-to-end wrapper for statistical tokenization, tagging
> and parsing, check out
> [`spacy-stanfordnlp`](https://github.com/explosion/spacy-stanfordnlp). It uses
> a very similar approach to the example in this section – the only difference
> is that it fully replaces the `nlp` object instead of providing a pipeline
> component, since it also needs to handle tokenization.

```python
### {highlight="1,9,15-17"}
import your_custom_model
from spacy.symbols import POS, TAG, DEP, HEAD
from spacy.tokens import Doc
import numpy

def custom_model_wrapper(doc):
    words = [token.text for token in doc]
    spaces = [token.whitespace for token in doc]
    pos, tags, deps, heads = your_custom_model(words)
    # Convert the strings to integers and add them to the string store
    pos = [doc.vocab.strings.add(label) for label in pos]
    tags = [doc.vocab.strings.add(label) for label in tags]
    deps = [doc.vocab.strings.add(label) for label in deps]
    # Create a new Doc from a numpy array
    attrs = [POS, TAG, DEP, HEAD]
    arr = numpy.array(list(zip(pos, tags, deps, heads)), dtype="uint64")
    new_doc = Doc(doc.vocab, words=words, spaces=spaces).from_array(attrs, arr)
    return new_doc
```

<Infobox title="Sentence boundaries and heads" variant="warning">

If you create a `Doc` object with dependencies and heads, spaCy is able to
resolve the sentence boundaries automatically. However, note that the `HEAD`
value used to construct a `Doc` is the token index **relative** to the current
token – e.g. `-1` for the previous token. The CoNLL format typically annotates
heads as `1`-indexed absolute indices with `0` indicating the root. If that's
the case in your annotations, you need to convert them first:

```python
heads = [2, 0, 4, 2, 2]
new_heads = [head - i - 1 if head != 0 else 0 for i, head in enumerate(heads)]
```

</Infobox>

<Infobox title="📖 Advanced usage, serialization and entry points">

For more details on how to write and package custom components, make them
available to spaCy via entry points and implement your own serialization
methods, check out the usage guide on
[saving and loading](/usage/saving-loading).

</Infobox>
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 18:31:19 +00:00
+								---
 								title: Language Processing Pipelines
 								next: vectors-similarity
 								menu:
-												Add "Processing text" section [ci skip]

											
										
										
											2019-07-25 15:38:03 +00:00
+								  - ['Processing Text', 'processing']
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 18:31:19 +00:00
+								  - ['How Pipelines Work', 'pipelines']
 								  - ['Custom Components', 'custom-components']
 								  - ['Extension Attributes', 'custom-components-attributes']
 								  - ['Plugins & Wrappers', 'plugins']
 								---
 								import Pipelines101 from 'usage/101/\_pipelines.md'
 								<Pipelines101 />
-												Add "Processing text" section [ci skip]

											
										
										
											2019-07-25 15:38:03 +00:00
+								## Processing text {#processing}
 								When you call `nlp` on a text, spaCy will **tokenize** it and then **call each
 								component** on the `Doc`, in order. It then returns the processed `Doc` that you
 								can work with.
 								```python
-												Remove u-strings and fix formatting [ci skip]

											
										
										
											2019-09-12 14:11:15 +00:00
+								doc = nlp("This is a text")
-												Add "Processing text" section [ci skip]

											
										
										
											2019-07-25 15:38:03 +00:00
+								```
 								When processing large volumes of text, the statistical models are usually more
 								efficient if you let them work on batches of texts. spaCy's
 								[`nlp.pipe`](/api/language#pipe) method takes an iterable of texts and yields
 								processed `Doc` objects. The batching is done internally.
 								```diff
-												Remove u-strings and fix formatting [ci skip]

											
										
										
											2019-09-12 14:11:15 +00:00
+								texts = ["This is a text", "These are lots of texts", "..."]
-												Add "Processing text" section [ci skip]

											
										
										
											2019-07-25 15:38:03 +00:00
+								- docs = [nlp(text) for text in texts]
 								+ docs = list(nlp.pipe(texts))
 								```
 								<Infobox title="Tips for efficient processing">
 								- Process the texts **as a stream** using [`nlp.pipe`](/api/language#pipe) and
 								  buffer them in batches, instead of one-by-one. This is usually much more
 								  efficient.
 								- Only apply the **pipeline components you need**. Getting predictions from the
 								  model that you don't actually need adds up and becomes very inefficient at
 								  scale. To prevent this, use the `disable` keyword argument to disable
 								  components you don't need – either when loading a model, or during processing
 								  with `nlp.pipe`. See the section on
 								  [disabling pipeline components](#disabling) for more details and examples.
 								</Infobox>
 								In this example, we're using [`nlp.pipe`](/api/language#pipe) to process a
 								(potentially very large) iterable of texts as a stream. Because we're only
 								accessing the named entities in `doc.ents` (set by the `ner` component), we'll
 								disable all other statistical components (the `tagger` and `parser`) during
 								processing. `nlp.pipe` yields `Doc` objects, so we can iterate over them and
 								access the named entity predictions:
 								> #### ✏️ Things to try
 								>
 								> 1. Also disable the `"ner"` component. You'll see that the `doc.ents` are now
 								>    empty, because the entity recognizer didn't run.
 								```python
 								### {executable="true"}
 								import spacy
 								texts = [
 								    "Net income was $9.4 million compared to the prior year of $2.7 million.",
 								    "Revenue exceeded twelve billion dollars, with a loss of $1b.",
 								]
 								nlp = spacy.load("en_core_web_sm")
 								for doc in nlp.pipe(texts, disable=["tagger", "parser"]):
 								    # Do something with the doc here
 								    print([(ent.text, ent.label_) for ent in doc.ents])
 								```
 								<Infobox title="Important note" variant="warning">
 								When using [`nlp.pipe`](/api/language#pipe), keep in mind that it returns a
 								[generator](https://realpython.com/introduction-to-python-generators/) that
 								yields `Doc` objects – not a list. So if you want to use it like a list, you'll
 								have to call `list()` on it first:
 								```diff
 								- docs = nlp.pipe(texts)[0]         # will raise an error
 								+ docs = list(nlp.pipe(texts))[0]   # works as expected
 								```
 								</Infobox>
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 18:31:19 +00:00
+								## How pipelines work {#pipelines}
 								spaCy makes it very easy to create your own pipelines consisting of reusable
 								components – this includes spaCy's default tagger, parser and entity recognizer,
 								but also your own custom processing functions. A pipeline component can be added
 								to an already existing `nlp` object, specified when initializing a `Language`
-												Fix links [ci skip]

											
										
										
											2019-02-17 21:25:50 +00:00
+								class, or defined within a [model package](/usage/saving-loading#models).
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 18:31:19 +00:00
 								When you load a model, spaCy first consults the model's
-												Fix links [ci skip]

											
										
										
											2019-02-17 21:25:50 +00:00
+								[`meta.json`](/usage/saving-loading#models). The meta typically includes the
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 18:31:19 +00:00
+								model details, the ID of a language class, and an optional list of pipeline
 								components. spaCy then does the following:
 								> #### meta.json (excerpt)
 								>
 								> ```json
 								> {
 								>   "lang": "en",
-												Improve pipeline model and meta example [ci skip]

											
										
										
											2019-02-24 17:45:39 +00:00
+								>   "name": "core_web_sm",
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 18:31:19 +00:00
+								>   "description": "Example model for spaCy",
-												Improve pipeline model and meta example [ci skip]

											
										
										
											2019-02-24 17:45:39 +00:00
+								>   "pipeline": ["tagger", "parser", "ner"]
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 18:31:19 +00:00
+								> }
 								> ```
 . Load the **language class and data** for the given ID via
 								   [`get_lang_class`](/api/top-level#util.get_lang_class) and initialize it. The
 								   `Language` class contains the shared vocabulary, tokenization rules and the
 								   language-specific annotation scheme.
 . Iterate over the **pipeline names** and create each component using
-												minor fix to broken link in documentation (#3819) [ci skip]


											
										
										
											2019-06-04 09:15:35 +00:00
+								   [`create_pipe`](/api/language#create_pipe), which looks them up in
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 18:31:19 +00:00
+								   `Language.factories`.
 . Add each pipeline component to the pipeline in order, using
 								   [`add_pipe`](/api/language#add_pipe).
 . Make the **model data** available to the `Language` class by calling
-												Fix small issues in the docs [ci skip]

											
										
										
											2019-03-12 21:57:15 +00:00
+								   [`from_disk`](/api/language#from_disk) with the path to the model data
 								   directory.
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 18:31:19 +00:00
 								So when you call this...
 								```python
-												Improve pipeline model and meta example [ci skip]

											
										
										
											2019-02-24 17:45:39 +00:00
+								nlp = spacy.load("en_core_web_sm")
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 18:31:19 +00:00
+								```
-												Improve pipeline model and meta example [ci skip]

											
										
										
											2019-02-24 17:45:39 +00:00
+								... the model's `meta.json` tells spaCy to use the language `"en"` and the
 								pipeline `["tagger", "parser", "ner"]`. spaCy will then initialize
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 18:31:19 +00:00
+								`spacy.lang.en.English`, and create each pipeline component and add it to the
 								processing pipeline. It'll then load in the model's data from its data directory
 								and return the modified `Language` class for you to use as the `nlp` object.
 								Fundamentally, a [spaCy model](/models) consists of three components: **the
 								weights**, i.e. binary data loaded in from a directory, a **pipeline** of
 								functions called in order, and **language data** like the tokenization rules and
 								annotation scheme. All of this is specific to each model, and defined in the
 								model's `meta.json` – for example, a Spanish NER model requires different
 								weights, language data and pipeline components than an English parsing and
 								tagging model. This is also why the pipeline state is always held by the
 								`Language` class. [`spacy.load`](/api/top-level#spacy.load) puts this all
 								together and returns an instance of `Language` with a pipeline set and access to
 								the binary data:
 								```python
 								### spacy.load under the hood
 								lang = "en"
 								pipeline = ["tagger", "parser", "ner"]
 								data_path = "path/to/en_core_web_sm/en_core_web_sm-2.0.0"
 								cls = spacy.util.get_lang_class(lang)   # 1. Get Language instance, e.g. English()
 								nlp = cls()                             # 2. Initialize it
 								for name in pipeline:
 								    component = nlp.create_pipe(name)   # 3. Create the pipeline components
 								    nlp.add_pipe(component)             # 4. Add the component to the pipeline
 								nlp.from_disk(model_data_path)          # 5. Load in the binary data
 								```
 								When you call `nlp` on a text, spaCy will **tokenize** it and then **call each
 								component** on the `Doc`, in order. Since the model data is loaded, the
 								components can access it to assign annotations to the `Doc` object, and
 								subsequently to the `Token` and `Span` which are only views of the `Doc`, and
 								don't own any data themselves. All components return the modified document,
 								which is then processed by the component next in the pipeline.
 								```python
 								### The pipeline under the hood
-												Remove u-strings and fix formatting [ci skip]

											
										
										
											2019-09-12 14:11:15 +00:00
+								doc = nlp.make_doc("This is a sentence")   # create a Doc from raw text
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 18:31:19 +00:00
+								for name, proc in nlp.pipeline:             # iterate over components in order
 								    doc = proc(doc)                         # apply each component
 								```
 								The current processing pipeline is available as `nlp.pipeline`, which returns a
 								list of `(name, component)` tuples, or `nlp.pipe_names`, which only returns a
 								list of human-readable component names.
 								```python
 								print(nlp.pipeline)
 								# [('tagger', <spacy.pipeline.Tagger>), ('parser', <spacy.pipeline.DependencyParser>), ('ner', <spacy.pipeline.EntityRecognizer>)]
 								print(nlp.pipe_names)
 								# ['tagger', 'parser', 'ner']
 								```
-												Fix missing ids

											
										
										
											2019-03-14 16:56:53 +00:00
+								### Built-in pipeline components {#built-in}
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 18:31:19 +00:00
 								spaCy ships with several built-in pipeline components that are also available in
 								the `Language.factories`. This means that you can initialize them by calling
 								[`nlp.create_pipe`](/api/language#create_pipe) with their string names and
 								require them in the pipeline settings in your model's `meta.json`.
 								> #### Usage
 								>
 								> ```python
 								> # Option 1: Import and initialize
 								> from spacy.pipeline import EntityRuler
 								> ruler = EntityRuler(nlp)
 								> nlp.add_pipe(ruler)
 								>
 								> # Option 2: Using nlp.create_pipe
 								> sentencizer = nlp.create_pipe("sentencizer")
 								> nlp.add_pipe(sentencizer)
 								> ```
 								| String name         | Component                                                        | Description                                                                                   |
 								| ------------------- | ---------------------------------------------------------------- | --------------------------------------------------------------------------------------------- |
 								| `tagger`            | [`Tagger`](/api/tagger)                                          | Assign part-of-speech-tags.                                                                   |
 								| `parser`            | [`DependencyParser`](/api/dependencyparser)                      | Assign dependency labels.                                                                     |
 								| `ner`               | [`EntityRecognizer`](/api/entityrecognizer)                      | Assign named entities.                                                                        |
-												Documentation for Entity Linking (#4065)

* document token ent_kb_id

* document span kb_id

* update pipeline documentation

* prior and context weights as bool's instead

* entitylinker api documentation

* drop for both models

* finish entitylinker documentation

* small fixes

* documentation for KB

* candidate documentation

* links to api pages in code

* small fix

* frequency examples as counts for consistency

* consistent documentation about tensors returned by predict

* add entity linking to usage 101

* add entity linking infobox and KB section to 101

* entity-linking in linguistic features

* small typo corrections

* training example and docs for entity_linker

* predefined nlp and kb

* revert back to similarity encodings for simplicity (for now)

* set prior probabilities to 0 when excluded

* code clean up

* bugfix: deleting kb ID from tokens when entities were removed

* refactor train el example to use either model or vocab

* pretrain_kb example for example kb generation

* add to training docs for KB + EL example scripts

* small fixes

* error numbering

* ensure the language of vocab and nlp stay consistent across serialization

* equality with =

* avoid conflict in errors file

* add error 151

* final adjustements to the train scripts - consistency

* update of goldparse documentation

* small corrections

* push commit

* typo fix

* add candidate API to kb documentation

* update API sidebar with EntityLinker and KnowledgeBase

* remove EL from 101 docs

* remove entity linker from 101 pipelines / rephrase

* custom el model instead of existing model

* set version to 2.2 for EL functionality

* update documentation for 2 CLI scripts

											
										
										
											2019-09-12 09:38:34 +00:00
+								| `entity_linker`     | [`EntityLinker`](/api/entitylinker)                              | Assign knowledge base IDs to named entities. Should be added after the entity recognizer.     |
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 18:31:19 +00:00
+								| `textcat`           | [`TextCategorizer`](/api/textcategorizer)                        | Assign text categories.                                                                       |
 								| `entity_ruler`      | [`EntityRuler`](/api/entityruler)                                | Assign named entities based on pattern rules.                                                 |
-												💫 Add better and serializable sentencizer (#3471)

* Add better serializable sentencizer component

* Replace default factory

* Add tests

* Tidy up

* Pass test

* Update docs

											
										
										
											2019-03-23 14:45:02 +00:00
+								| `sentencizer`       | [`Sentencizer`](/api/sentencizer)                                | Add rule-based sentence segmentation without the dependency parse.                            |
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 18:31:19 +00:00
+								| `merge_noun_chunks` | [`merge_noun_chunks`](/api/pipeline-functions#merge_noun_chunks) | Merge all noun chunks into a single token. Should be added after the tagger and parser.       |
 								| `merge_entities`    | [`merge_entities`](/api/pipeline-functions#merge_entities)       | Merge all entities into a single token. Should be added after the entity recognizer.          |
 								| `merge_subtokens`   | [`merge_subtokens`](/api/pipeline-functions#merge_subtokens)     | Merge subtokens predicted by the parser into single tokens. Should be added after the parser. |
 								### Disabling and modifying pipeline components {#disabling}
 								If you don't need a particular component of the pipeline – for example, the
-												Improve section on disabling pipes [ci skip]

											
										
										
											2019-07-25 12:25:34 +00:00
+								tagger or the parser, you can **disable loading** it. This can sometimes make a
 								big difference and improve loading speed. Disabled component names can be
 								provided to [`spacy.load`](/api/top-level#spacy.load),
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 18:31:19 +00:00
+								[`Language.from_disk`](/api/language#from_disk) or the `nlp` object itself as a
 								list:
 								```python
-												Improve section on disabling pipes [ci skip]

											
										
										
											2019-07-25 12:25:34 +00:00
+								### Disable loading
 								nlp = spacy.load("en_core_web_sm", disable=["tagger", "parser"])
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 18:31:19 +00:00
+								nlp = English().from_disk("/model", disable=["ner"])
 								```
-												Improve section on disabling pipes [ci skip]

											
										
										
											2019-07-25 12:25:34 +00:00
+								In some cases, you do want to load all pipeline components and their weights,
 								because you need them at different points in your application. However, if you
 								only need a `Doc` object with named entities, there's no need to run all
 								pipeline components on it – that can potentially make processing much slower.
 								Instead, you can use the `disable` keyword argument on
 								[`nlp.pipe`](/api/language#pipe) to temporarily disable the components **during
 								processing**:
 								```python
 								### Disable for processing
 								for doc in nlp.pipe(texts, disable=["tagger", "parser"]):
 								    # Do something with the doc here
 								```
 								If you need to **execute more code** with components disabled – e.g. to reset
 								the weights or update only some components during training – you can use the
-												Feature toggle_pipes (#5378)

* make disable_pipes deprecated in favour of the new toggle_pipes

* rewrite disable_pipes statements

* update documentation

* remove bin/wiki_entity_linking folder

* one more fix

* remove deprecated link to documentation

* few more doc fixes

* add note about name change to the docs

* restore original disable_pipes

* small fixes

* fix typo

* fix error number to W096

* rename to select_pipes

* also make changes to the documentation

Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
											
										
										
											2020-05-18 20:27:10 +00:00
+								[`nlp.select_pipes`](/api/language#select_pipes) contextmanager. At the end of
-												Improve section on disabling pipes [ci skip]

											
										
										
											2019-07-25 12:25:34 +00:00
+								the `with` block, the disabled pipeline components will be restored
-												Feature toggle_pipes (#5378)

* make disable_pipes deprecated in favour of the new toggle_pipes

* rewrite disable_pipes statements

* update documentation

* remove bin/wiki_entity_linking folder

* one more fix

* remove deprecated link to documentation

* few more doc fixes

* add note about name change to the docs

* restore original disable_pipes

* small fixes

* fix typo

* fix error number to W096

* rename to select_pipes

* also make changes to the documentation

Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
											
										
										
											2020-05-18 20:27:10 +00:00
+								automatically. Alternatively, `select_pipes` returns an object that lets you
-												Improve section on disabling pipes [ci skip]

											
										
										
											2019-07-25 12:25:34 +00:00
+								call its `restore()` method to restore the disabled components when needed. This
 								can be useful if you want to prevent unnecessary code indentation of large
 								blocks.
 								```python
 								### Disable for block
 								# 1. Use as a contextmanager
-												Feature toggle_pipes (#5378)

* make disable_pipes deprecated in favour of the new toggle_pipes

* rewrite disable_pipes statements

* update documentation

* remove bin/wiki_entity_linking folder

* one more fix

* remove deprecated link to documentation

* few more doc fixes

* add note about name change to the docs

* restore original disable_pipes

* small fixes

* fix typo

* fix error number to W096

* rename to select_pipes

* also make changes to the documentation

Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
											
										
										
											2020-05-18 20:27:10 +00:00
+								with nlp.select_pipes(disable=["tagger", "parser"]):
-												Remove u-strings and fix formatting [ci skip]

											
										
										
											2019-09-12 14:11:15 +00:00
+								    doc = nlp("I won't be tagged and parsed")
 								doc = nlp("I will be tagged and parsed")
-												Improve section on disabling pipes [ci skip]

											
										
										
											2019-07-25 12:25:34 +00:00
 								# 2. Restore manually
-												Feature toggle_pipes (#5378)

* make disable_pipes deprecated in favour of the new toggle_pipes

* rewrite disable_pipes statements

* update documentation

* remove bin/wiki_entity_linking folder

* one more fix

* remove deprecated link to documentation

* few more doc fixes

* add note about name change to the docs

* restore original disable_pipes

* small fixes

* fix typo

* fix error number to W096

* rename to select_pipes

* also make changes to the documentation

Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
											
										
										
											2020-05-18 20:27:10 +00:00
+								disabled = nlp.select_pipes(disable="ner")
-												Remove u-strings and fix formatting [ci skip]

											
										
										
											2019-09-12 14:11:15 +00:00
+								doc = nlp("I won't have named entities")
-												Improve section on disabling pipes [ci skip]

											
										
										
											2019-07-25 12:25:34 +00:00
+								disabled.restore()
 								```
-												unicode -> str consistency

											
										
										
											2020-05-24 15:23:00 +00:00
+								If you want to disable all pipes except for one or a few, you can use the
 								`enable` keyword. Just like the `disable` keyword, it takes a list of pipe
 								names, or a string defining just one pipe.
-												Feature toggle_pipes (#5378)

* make disable_pipes deprecated in favour of the new toggle_pipes

* rewrite disable_pipes statements

* update documentation

* remove bin/wiki_entity_linking folder

* one more fix

* remove deprecated link to documentation

* few more doc fixes

* add note about name change to the docs

* restore original disable_pipes

* small fixes

* fix typo

* fix error number to W096

* rename to select_pipes

* also make changes to the documentation

Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
											
										
										
											2020-05-18 20:27:10 +00:00
+								```python
 								# Enable only the parser
 								with nlp.select_pipes(enable="parser"):
 								    doc = nlp("I will only be parsed")
 								```
-												Improve section on disabling pipes [ci skip]

											
										
										
											2019-07-25 12:25:34 +00:00
+								Finally, you can also use the [`remove_pipe`](/api/language#remove_pipe) method
 								to remove pipeline components from an existing pipeline, the
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 18:31:19 +00:00
+								[`rename_pipe`](/api/language#rename_pipe) method to rename them, or the
 								[`replace_pipe`](/api/language#replace_pipe) method to replace them with a
 								custom component entirely (more details on this in the section on
 								[custom components](#custom-components).
 								```python
 								nlp.remove_pipe("parser")
 								nlp.rename_pipe("ner", "entityrecognizer")
 								nlp.replace_pipe("tagger", my_custom_tagger)
 								```
 								<Infobox title="Important note: disabling pipeline components" variant="warning">
 								Since spaCy v2.0 comes with better support for customizing the processing
 								pipeline components, the `parser`, `tagger` and `entity` keyword arguments have
 								been replaced with `disable`, which takes a list of pipeline component names.
 								This lets you disable pre-defined components when loading a model, or
 								initializing a Language class via [`from_disk`](/api/language#from_disk).
 								```diff
 								- nlp = spacy.load('en', tagger=False, entity=False)
-												Remove u-strings and fix formatting [ci skip]

											
										
										
											2019-09-12 14:11:15 +00:00
+								- doc = nlp("I don't want parsed", parse=False)
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 18:31:19 +00:00
-												Improve consistency of docs examples [ci skip]

											
										
										
											2019-07-25 12:24:56 +00:00
+								+ nlp = spacy.load("en", disable=["ner"])
 								+ nlp.remove_pipe("parser")
-												Remove u-strings and fix formatting [ci skip]

											
										
										
											2019-09-12 14:11:15 +00:00
+								+ doc = nlp("I don't want parsed")
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 18:31:19 +00:00
+								```
 								</Infobox>
 								## Creating custom pipeline components {#custom-components}
 								A component receives a `Doc` object and can modify it – for example, by using
 								the current weights to make a prediction and set some annotation on the
 								document. By adding a component to the pipeline, you'll get access to the `Doc`
 								at any point **during processing** – instead of only being able to modify it
 								afterwards.
 								> #### Example
 								>
 								> ```python
 								> def my_component(doc):
 								>    # do something to the doc here
 								>    return doc
 								> ```
 								| Argument    | Type  | Description                                            |
 								| ----------- | ----- | ------------------------------------------------------ |
 								| `doc`       | `Doc` | The `Doc` object processed by the previous component.  |
 								| **RETURNS** | `Doc` | The `Doc` object processed by this pipeline component. |
 								Custom components can be added to the pipeline using the
 								[`add_pipe`](/api/language#add_pipe) method. Optionally, you can either specify
 								a component to add it **before or after**, tell spaCy to add it **first or
 								last** in the pipeline, or define a **custom name**. If no name is set and no
 								`name` attribute is present on your component, the function name is used.
 								> #### Example
 								>
 								> ```python
 								> nlp.add_pipe(my_component)
 								> nlp.add_pipe(my_component, first=True)
 								> nlp.add_pipe(my_component, before="parser")
 								> ```
-												unicode -> str consistency

											
										
										
											2020-05-24 15:23:00 +00:00
+								| Argument | Type | Description                                                              |
 								| -------- | ---- | ------------------------------------------------------------------------ |
 								| `last`   | bool | If set to `True`, component is added **last** in the pipeline (default). |
 								| `first`  | bool | If set to `True`, component is added **first** in the pipeline.          |
 								| `before` | str  | String name of component to add the new component **before**.            |
 								| `after`  | str  | String name of component to add the new component **after**.             |
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 18:31:19 +00:00
 								### Example: A simple pipeline component {#custom-components-simple}
 								The following component receives the `Doc` in the pipeline and prints some
 								information about it: the number of tokens, the part-of-speech tags of the
 								tokens and a conditional message based on the document length.
 								> #### ✏️ Things to try
 								>
 								> 1. Add the component first in the pipeline by setting `first=True`. You'll see
 								>    that the part-of-speech tags are empty, because the component now runs
 								>    before the tagger and the tags aren't available yet.
 								> 2. Change the component `name` or remove the `name` argument. You should see
 								>    this change reflected in `nlp.pipe_names`.
 								> 3. Print `nlp.pipeline`. You'll see a list of tuples describing the component
 								>    name and the function that's called on the `Doc` object in the pipeline.
 								```python
 								### {executable="true"}
 								import spacy
 								def my_component(doc):
-												Drop Python 2.7 and 3.5 (#4828)

* Remove unicode declarations

* Remove Python 3.5 and 2.7 from CI

* Don't require pathlib

* Replace compat helpers

* Remove OrderedDict

* Use f-strings

* Set Cython compiler language level

* Fix typo

* Re-add OrderedDict for Table

* Update setup.cfg

* Revert CONTRIBUTING.md

* Revert lookups.md

* Revert top-level.md

* Small adjustments and docs [ci skip]

											
										
										
											2019-12-22 00:53:56 +00:00
+								    print(f"After tokenization, this doc has {len(doc)} tokens.")
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 18:31:19 +00:00
+								    print("The part-of-speech tags are:", [token.pos_ for token in doc])
 								    if len(doc) < 10:
 								        print("This is a pretty short document.")
 								    return doc
 								nlp = spacy.load("en_core_web_sm")
 								nlp.add_pipe(my_component, name="print_info", last=True)
-												update response after calling add_pipe (#3661)

* update response after calling add_pipe

component:print_info is appened in the last, so need show it at the end of  pipeline

* Create henry860916.md

											
										
										
											2019-05-01 10:02:18 +00:00
+								print(nlp.pipe_names)  # ['tagger', 'parser', 'ner', 'print_info']
-												Remove u-strings and fix formatting [ci skip]

											
										
										
											2019-09-12 14:11:15 +00:00
+								doc = nlp("This is a sentence.")
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 18:31:19 +00:00
 								```
 								Of course, you can also wrap your component as a class to allow initializing it
 								with custom settings and hold state within the component. This is useful for
 								**stateful components**, especially ones which **depend on shared data**. In the
 								following example, the custom component `EntityMatcher` can be initialized with
 								`nlp` object, a terminology list and an entity label. Using the
 								[`PhraseMatcher`](/api/phrasematcher), it then matches the terms in the `Doc`
 								and adds them to the existing entities.
 								<Infobox title="Important note" variant="warning">
 								As of v2.1.0, spaCy ships with the [`EntityRuler`](/api/entityruler), a pipeline
 								component for easy, rule-based named entity recognition. Its implementation is
 								similar to the `EntityMatcher` code shown below, but it includes some additional
 								features like support for phrase patterns and token patterns, handling overlaps
 								with existing entities and pattern export as JSONL.
 								We'll still keep the pipeline component example below, as it works well to
 								illustrate complex components. But if you're planning on using this type of
 								component in your application, you might find the `EntityRuler` more convenient.
 								[See here](/usage/rule-based-matching#entityruler) for more details and
 								examples.
 								</Infobox>
 								```python
 								### {executable="true"}
 								import spacy
 								from spacy.matcher import PhraseMatcher
 								from spacy.tokens import Span
 								class EntityMatcher(object):
 								    name = "entity_matcher"
 								    def __init__(self, nlp, terms, label):
 								        patterns = [nlp.make_doc(text) for text in terms]
 								        self.matcher = PhraseMatcher(nlp.vocab)
 								        self.matcher.add(label, None, *patterns)
 								    def __call__(self, doc):
 								        matches = self.matcher(doc)
 								        for match_id, start, end in matches:
 								            span = Span(doc, start, end, label=match_id)
 								            doc.ents = list(doc.ents) + [span]
 								        return doc
 								nlp = spacy.load("en_core_web_sm")
-												Remove u-strings and fix formatting [ci skip]

											
										
										
											2019-09-12 14:11:15 +00:00
+								terms = ("cat", "dog", "tree kangaroo", "giant sea spider")
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 18:31:19 +00:00
+								entity_matcher = EntityMatcher(nlp, terms, "ANIMAL")
 								nlp.add_pipe(entity_matcher, after="ner")
 								print(nlp.pipe_names)  # The components in the pipeline
-												Remove u-strings and fix formatting [ci skip]

											
										
										
											2019-09-12 14:11:15 +00:00
+								doc = nlp("This is a text about Barack Obama and a tree kangaroo")
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 18:31:19 +00:00
+								print([(ent.text, ent.label_) for ent in doc.ents])
 								```
 								### Example: Custom sentence segmentation logic {#component-example1}
 								Let's say you want to implement custom logic to improve spaCy's sentence
 								boundary detection. Currently, sentence segmentation is based on the dependency
 								parse, which doesn't always produce ideal results. The custom logic should
 								therefore be applied **after** tokenization, but _before_ the dependency parsing
 								– this way, the parser can also take advantage of the sentence boundaries.
 								> #### ✏️ Things to try
 								>
 								> 1. Print `[token.dep_ for token in doc]` with and without the custom pipeline
 								>    component. You'll see that the predicted dependency parse changes to match
 								>    the sentence boundaries.
 								> 2. Remove the `else` block. All other tokens will now have `is_sent_start` set
 								>    to `None` (missing value), the parser will assign sentence boundaries in
 								>    between.
 								```python
 								### {executable="true"}
 								import spacy
 								def custom_sentencizer(doc):
 								    for i, token in enumerate(doc[:-2]):
 								        # Define sentence start if pipe + titlecase token
 								        if token.text == "|" and doc[i+1].is_title:
 								            doc[i+1].is_sent_start = True
 								        else:
 								            # Explicitly set sentence start to False otherwise, to tell
 								            # the parser to leave those tokens alone
 								            doc[i+1].is_sent_start = False
 								    return doc
 								nlp = spacy.load("en_core_web_sm")
 								nlp.add_pipe(custom_sentencizer, before="parser")  # Insert before the parser
-												Remove u-strings and fix formatting [ci skip]

											
										
										
											2019-09-12 14:11:15 +00:00
+								doc = nlp("This is. A sentence. | This is. Another sentence.")
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 18:31:19 +00:00
+								for sent in doc.sents:
 								    print(sent.text)
 								```
 								### Example: Pipeline component for entity matching and tagging with custom attributes {#component-example2}
 								This example shows how to create a spaCy extension that takes a terminology list
 								(in this case, single- and multi-word company names), matches the occurrences in
 								a document, labels them as `ORG` entities, merges the tokens and sets custom
 								`is_tech_org` and `has_tech_org` attributes. For efficient matching, the example
 								uses the [`PhraseMatcher`](/api/phrasematcher) which accepts `Doc` objects as
 								match patterns and works well for large terminology lists. It also ensures your
 								patterns will always match, even when you customize spaCy's tokenization rules.
 								When you call `nlp` on a text, the custom pipeline component is applied to the
 								`Doc`.
 								```python
 								https://github.com/explosion/spaCy/tree/master/examples/pipeline/custom_component_entities.py
 								```
 								Wrapping this functionality in a pipeline component allows you to reuse the
 								module with different settings, and have all pre-processing taken care of when
 								you call `nlp` on your text and receive a `Doc` object.
 								### Adding factories {#custom-components-factories}
 								When spaCy loads a model via its `meta.json`, it will iterate over the
 								`"pipeline"` setting, look up every component name in the internal factories and
 								call [`nlp.create_pipe`](/api/language#create_pipe) to initialize the individual
 								components, like the tagger, parser or entity recognizer. If your model uses
 								custom components, this won't work – so you'll have to tell spaCy **where to
 								find your component**. You can do this by writing to the `Language.factories`:
 								```python
 								from spacy.language import Language
 								Language.factories["entity_matcher"] = lambda nlp, **cfg: EntityMatcher(nlp, **cfg)
 								```
 								You can also ship the above code and your custom component in your packaged
 								model's `__init__.py`, so it's executed when you load your model. The `**cfg`
 								config parameters are passed all the way down from
 								[`spacy.load`](/api/top-level#spacy.load), so you can load the model and its
 								components with custom settings:
 								```python
-												Remove u-strings and fix formatting [ci skip]

											
										
										
											2019-09-12 14:11:15 +00:00
+								nlp = spacy.load("your_custom_model", terms=["tree kangaroo"], label="ANIMAL")
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 18:31:19 +00:00
+								```
 								<Infobox title="Important note" variant="warning">
 								When you load a model via its shortcut or package name, like `en_core_web_sm`,
 								spaCy will import the package and then call its `load()` method. This means that
 								custom code in the model's `__init__.py` will be executed, too. This is **not
 								the case** if you're loading a model from a path containing the model data.
 								Here, spaCy will only read in the `meta.json`. If you want to use custom
 								factories with a model loaded from a path, you need to add them to
 								`Language.factories` _before_ you load the model.
 								</Infobox>
 								## Extension attributes {#custom-components-attributes new="2"}
 								As of v2.0, spaCy allows you to set any custom attributes and methods on the
 								`Doc`, `Span` and `Token`, which become available as `Doc._`, `Span._` and
 								`Token._` – for example, `Token._.my_attr`. This lets you store additional
 								information relevant to your application, add new features and functionality to
 								spaCy, and implement your own models trained with other machine learning
 								libraries. It also lets you take advantage of spaCy's data structures and the
 								`Doc` object as the "single source of truth".
-												Don't auto-slugify accordion links [ci skip]

											
										
										
											2019-03-12 14:30:49 +00:00
+								<Accordion title="Why ._ and not just a top-level attribute?" id="why-dot-underscore">
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 18:31:19 +00:00
 								Writing to a `._` attribute instead of to the `Doc` directly keeps a clearer
 								separation and makes it easier to ensure backwards compatibility. For example,
 								if you've implemented your own `.coref` property and spaCy claims it one day,
 								it'll break your code. Similarly, just by looking at the code, you'll
 								immediately know what's built-in and what's custom – for example,
 								`doc.sentiment` is spaCy, while `doc._.sent_score` isn't.
 								</Accordion>
-												Don't auto-slugify accordion links [ci skip]

											
										
										
											2019-03-12 14:30:49 +00:00
+								<Accordion title="How is the ._ implemented?" id="dot-underscore-implementation">
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 18:31:19 +00:00
 								Extension definitions – the defaults, methods, getters and setters you pass in
 								to `set_extension` – are stored in class attributes on the `Underscore` class.
 								If you write to an extension attribute, e.g. `doc._.hello = True`, the data is
 								stored within the [`Doc.user_data`](/api/doc#attributes) dictionary. To keep the
 								underscore data separate from your other dictionary entries, the string `"._."`
 								is placed before the name, in a tuple.
 								</Accordion>
 								---
 								There are three main types of extensions, which can be defined using the
 								[`Doc.set_extension`](/api/doc#set_extension),
 								[`Span.set_extension`](/api/span#set_extension) and
 								[`Token.set_extension`](/api/token#set_extension) methods.
 . **Attribute extensions.** Set a default value for an attribute, which can be
 								   overwritten manually at any time. Attribute extensions work like "normal"
 								   variables and are the quickest way to store arbitrary information on a `Doc`,
-												💫 Support mutable default values for extension attributes (#3389)

* Support mutable default values in extensions

* Update documentation

											
										
										
											2019-03-11 11:50:44 +00:00
+								   `Span` or `Token`.
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 18:31:19 +00:00
 								   ```python
 								    Doc.set_extension("hello", default=True)
 								    assert doc._.hello
 								    doc._.hello = False
 								   ```
 . **Property extensions.** Define a getter and an optional setter function. If
 								   no setter is provided, the extension is immutable. Since the getter and
 								   setter functions are only called when you _retrieve_ the attribute, you can
 								   also access values of previously added attribute extensions. For example, a
 								   `Doc` getter can average over `Token` attributes. For `Span` extensions,
 								   you'll almost always want to use a property – otherwise, you'd have to write
 								   to _every possible_ `Span` in the `Doc` to set up the values correctly.
 								   ```python
 								   Doc.set_extension("hello", getter=get_hello_value, setter=set_hello_value)
 								   assert doc._.hello
 								   doc._.hello = "Hi!"
 								   ```
 . **Method extensions.** Assign a function that becomes available as an object
 								   method. Method extensions are always immutable. For more details and
 								   implementation ideas, see
 								   [these examples](/usage/examples#custom-components-attr-methods).
 								   ```python
-												Drop Python 2.7 and 3.5 (#4828)

* Remove unicode declarations

* Remove Python 3.5 and 2.7 from CI

* Don't require pathlib

* Replace compat helpers

* Remove OrderedDict

* Use f-strings

* Set Cython compiler language level

* Fix typo

* Re-add OrderedDict for Table

* Update setup.cfg

* Revert CONTRIBUTING.md

* Revert lookups.md

* Revert top-level.md

* Small adjustments and docs [ci skip]

											
										
										
											2019-12-22 00:53:56 +00:00
+								   Doc.set_extension("hello", method=lambda doc, name: f"Hi {name}!")
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 18:31:19 +00:00
+								   assert doc._.hello("Bob") == "Hi Bob!"
 								   ```
 								Before you can access a custom extension, you need to register it using the
 								`set_extension` method on the object you want to add it to, e.g. the `Doc`. Keep
 								in mind that extensions are always **added globally** and not just on a
 								particular instance. If an attribute of the same name already exists, or if
 								you're trying to access an attribute that hasn't been registered, spaCy will
 								raise an `AttributeError`.
 								```python
 								### Example
 								from spacy.tokens import Doc, Span, Token
-												Remove u-strings and fix formatting [ci skip]

											
										
										
											2019-09-12 14:11:15 +00:00
+								fruits = ["apple", "pear", "banana", "orange", "strawberry"]
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 18:31:19 +00:00
+								is_fruit_getter = lambda token: token.text in fruits
 								has_fruit_getter = lambda obj: any([t.text in fruits for t in obj])
 								Token.set_extension("is_fruit", getter=is_fruit_getter)
 								Doc.set_extension("has_fruit", getter=has_fruit_getter)
 								Span.set_extension("has_fruit", getter=has_fruit_getter)
 								```
 								> #### Usage example
 								>
 								> ```python
-												Remove u-strings and fix formatting [ci skip]

											
										
										
											2019-09-12 14:11:15 +00:00
+								> doc = nlp("I have an apple and a melon")
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 18:31:19 +00:00
+								> assert doc[3]._.is_fruit      # get Token attributes
 								> assert not doc[0]._.is_fruit
 								> assert doc._.has_fruit        # get Doc attributes
 								> assert doc[1:4]._.has_fruit   # get Span attributes
 								> ```
 								Once you've registered your custom attribute, you can also use the built-in
 								`set`, `get` and `has` methods to modify and retrieve the attributes. This is
 								especially useful it you want to pass in a string instead of calling
 								`doc._.my_attr`.
 								### Example: Pipeline component for GPE entities and country meta data via a REST API {#component-example3}
 								This example shows the implementation of a pipeline component that fetches
 								country meta data via the [REST Countries API](https://restcountries.eu), sets
 								entity annotations for countries, merges entities into one token and sets custom
 								attributes on the `Doc`, `Span` and `Token` – for example, the capital,
 								latitude/longitude coordinates and even the country flag.
 								```python
 								https://github.com/explosion/spaCy/tree/master/examples/pipeline/custom_component_countries_api.py
 								```
 								In this case, all data can be fetched on initialization in one request. However,
 								if you're working with text that contains incomplete country names, spelling
 								mistakes or foreign-language versions, you could also implement a
 								`like_country`-style getter function that makes a request to the search API
 								endpoint and returns the best-matching result.
 								### User hooks {#custom-components-user-hooks}
 								While it's generally recommended to use the `Doc._`, `Span._` and `Token._`
 								proxies to add your own custom attributes, spaCy offers a few exceptions to
 								allow **customizing the built-in methods** like
 								[`Doc.similarity`](/api/doc#similarity) or [`Doc.vector`](/api/doc#vector) with
 								your own hooks, which can rely on statistical models you train yourself. For
 								instance, you can provide your own on-the-fly sentence segmentation algorithm or
 								document similarity method.
 								Hooks let you customize some of the behaviors of the `Doc`, `Span` or `Token`
 								objects by adding a component to the pipeline. For instance, to customize the
 								[`Doc.similarity`](/api/doc#similarity) method, you can add a component that
 								sets a custom function to `doc.user_hooks['similarity']`. The built-in
 								`Doc.similarity` method will check the `user_hooks` dict, and delegate to your
 								function if you've set one. Similar results can be achieved by setting functions
 								to `Doc.user_span_hooks` and `Doc.user_token_hooks`.
 								> #### Implementation note
 								>
 								> The hooks live on the `Doc` object because the `Span` and `Token` objects are
 								> created lazily, and don't own any data. They just proxy to their parent `Doc`.
 								> This turns out to be convenient here — we only have to worry about installing
 								> hooks in one place.
 								| Name               | Customizes                                                                                                                                                                                                              |
 								| ------------------ | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
 								| `user_hooks`       | [`Doc.vector`](/api/doc#vector), [`Doc.has_vector`](/api/doc#has_vector), [`Doc.vector_norm`](/api/doc#vector_norm), [`Doc.sents`](/api/doc#sents)                                                                      |
 								| `user_token_hooks` | [`Token.similarity`](/api/token#similarity), [`Token.vector`](/api/token#vector), [`Token.has_vector`](/api/token#has_vector), [`Token.vector_norm`](/api/token#vector_norm), [`Token.conjuncts`](/api/token#conjuncts) |
 								| `user_span_hooks`  | [`Span.similarity`](/api/span#similarity), [`Span.vector`](/api/span#vector), [`Span.has_vector`](/api/span#has_vector), [`Span.vector_norm`](/api/span#vector_norm), [`Span.root`](/api/span#root)                     |
 								```python
 								### Add custom similarity hooks
 								class SimilarityModel(object):
 								    def __init__(self, model):
 								        self._model = model
 								    def __call__(self, doc):
 								        doc.user_hooks["similarity"] = self.similarity
 								        doc.user_span_hooks["similarity"] = self.similarity
 								        doc.user_token_hooks["similarity"] = self.similarity
 								    def similarity(self, obj1, obj2):
 								        y = self._model([obj1.vector, obj2.vector])
 								        return float(y[0])
 								```
 								## Developing plugins and wrappers {#plugins}
 								We're very excited about all the new possibilities for community extensions and
 								plugins in spaCy v2.0, and we can't wait to see what you build with it! To get
 								you started, here are a few tips, tricks and best
 								practices. [See here](/universe/?category=pipeline) for examples of other spaCy
 								extensions.
 								### Usage ideas {#custom-components-usage-ideas}
 								- **Adding new features and hooking in models.** For example, a sentiment
 								  analysis model, or your preferred solution for lemmatization or sentiment
 								  analysis. spaCy's built-in tagger, parser and entity recognizer respect
 								  annotations that were already set on the `Doc` in a previous step of the
 								  pipeline.
 								- **Integrating other libraries and APIs.** For example, your pipeline component
 								  can write additional information and data directly to the `Doc` or `Token` as
 								  custom attributes, while making sure no information is lost in the process.
 								  This can be output generated by other libraries and models, or an external
 								  service with a REST API.
 								- **Debugging and logging.** For example, a component which stores and/or
 								  exports relevant information about the current state of the processed
 								  document, and insert it at any point of your pipeline.
 								### Best practices {#custom-components-best-practices}
 								Extensions can claim their own `._` namespace and exist as standalone packages.
 								If you're developing a tool or library and want to make it easy for others to
 								use it with spaCy and add it to their pipeline, all you have to do is expose a
 								function that takes a `Doc`, modifies it and returns it.
 								- Make sure to choose a **descriptive and specific name** for your pipeline
 								  component class, and set it as its `name` attribute. Avoid names that are too
 								  common or likely to clash with built-in or a user's other custom components.
 								  While it's fine to call your package `"spacy_my_extension"`, avoid component
 								  names including `"spacy"`, since this can easily lead to confusion.
 								  ```diff
 								  + name = "myapp_lemmatizer"
 								  - name = "lemmatizer"
 								  ```
 								- When writing to `Doc`, `Token` or `Span` objects, **use getter functions**
 								  wherever possible, and avoid setting values explicitly. Tokens and spans don't
 								  own any data themselves, and they're implemented as C extension classes – so
 								  you can't usually add new attributes to them like you could with most pure
 								  Python objects.
 								  ```diff
 								  + is_fruit = lambda token: token.text in ("apple", "orange")
 								  + Token.set_extension("is_fruit", getter=is_fruit)
 								  - token._.set_extension("is_fruit", default=False)
 								  - if token.text in ('"apple", "orange"):
 								  -     token._.set("is_fruit", True)
 								  ```
 								- Always add your custom attributes to the **global** `Doc`, `Token` or `Span`
 								  objects, not a particular instance of them. Add the attributes **as early as
 								  possible**, e.g. in your extension's `__init__` method or in the global scope
 								  of your module. This means that in the case of namespace collisions, the user
 								  will see an error immediately, not just when they run their pipeline.
 								  ```diff
 								  + from spacy.tokens import Doc
 								  + def __init__(attr="my_attr"):
 								  +     Doc.set_extension(attr, getter=self.get_doc_attr)
 								  - def __call__(doc):
 								  -     doc.set_extension("my_attr", getter=self.get_doc_attr)
 								  ```
 								- If your extension is setting properties on the `Doc`, `Token` or `Span`,
 								  include an option to **let the user to change those attribute names**. This
 								  makes it easier to avoid namespace collisions and accommodate users with
 								  different naming preferences. We recommend adding an `attrs` argument to the
 								  `__init__` method of your class so you can write the names to class attributes
 								  and reuse them across your component.
 								  ```diff
 								  + Doc.set_extension(self.doc_attr, default="some value")
 								  - Doc.set_extension("my_doc_attr", default="some value")
 								  ```
 								- Ideally, extensions should be **standalone packages** with spaCy and
 								  optionally, other packages specified as a dependency. They can freely assign
 								  to their own `._` namespace, but should stick to that. If your extension's
 								  only job is to provide a better `.similarity` implementation, and your docs
 								  state this explicitly, there's no problem with writing to the
 								  [`user_hooks`](#custom-components-user-hooks) and overwriting spaCy's built-in
 								  method. However, a third-party extension should **never silently overwrite
 								  built-ins**, or attributes set by other extensions.
 								- If you're looking to publish a model that depends on a custom pipeline
 								  component, you can either **require it** in the model package's dependencies,
 								  or – if the component is specific and lightweight – choose to **ship it with
 								  your model package** and add it to the `Language` instance returned by the
 								  model's `load()` method. For examples of this, check out the implementations
 								  of spaCy's
 								  [`load_model_from_init_py`](/api/top-level#util.load_model_from_init_py)
 								  [`load_model_from_path`](/api/top-level#util.load_model_from_path) utility
 								  functions.
 								  ```diff
 								  + nlp.add_pipe(my_custom_component)
 								  +     return nlp.from_disk(model_path)
 								  ```
 								- Once you're ready to share your extension with others, make sure to **add docs
 								  and installation instructions** (you can always link to this page for more
 								  info). Make it easy for others to install and use your extension, for example
 								  by uploading it to [PyPi](https://pypi.python.org). If you're sharing your
 								  code on GitHub, don't forget to tag it with
 								  [`spacy`](https://github.com/topics/spacy?o=desc&s=stars) and
 								  [`spacy-extension`](https://github.com/topics/spacy-extension?o=desc&s=stars)
 								  to help people find it. If you post it on Twitter, feel free to tag
 								  [@spacy_io](https://twitter.com/spacy_io) so we can check it out.
 								### Wrapping other models and libraries {#wrapping-models-libraries}
 								Let's say you have a custom entity recognizer that takes a list of strings and
 								returns their [BILUO tags](/api/annotation#biluo). Given an input like
 								`["A", "text", "about", "Facebook"]`, it will predict and return
 								`["O", "O", "O", "U-ORG"]`. To integrate it into your spaCy pipeline and make it
 								add those entities to the `doc.ents`, you can wrap it in a custom pipeline
 								component function and pass it the token texts from the `Doc` object received by
 								the component.
 								The [`gold.spans_from_biluo_tags`](/api/goldparse#spans_from_biluo_tags) is very
 								helpful here, because it takes a `Doc` object and token-based BILUO tags and
 								returns a sequence of `Span` objects in the `Doc` with added labels. So all your
 								wrapper has to do is compute the entity spans and overwrite the `doc.ents`.
 								> #### How the doc.ents work
 								>
 								> When you add spans to the `doc.ents`, spaCy will automatically resolve them
 								> back to the underlying tokens and set the `Token.ent_type` and `Token.ent_iob`
 								> attributes. By definition, each token can only be part of one entity, so
 								> overlapping entity spans are not allowed.
 								```python
 								### {highlight="1,6-7"}
 								import your_custom_entity_recognizer
 								from spacy.gold import offsets_from_biluo_tags
 								def custom_ner_wrapper(doc):
 								    words = [token.text for token in doc]
 								    custom_entities = your_custom_entity_recognizer(words)
 								    doc.ents = spans_from_biluo_tags(doc, custom_entities)
 								    return doc
 								```
 								The `custom_ner_wrapper` can then be added to the pipeline of a blank model
 								using [`nlp.add_pipe`](/api/language#add_pipe). You can also replace the
-												Use consistent spelling

											
										
										
											2019-10-02 08:37:39 +00:00
+								existing entity recognizer of a pretrained model with
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 18:31:19 +00:00
+								[`nlp.replace_pipe`](/api/language#replace_pipe).
 								Here's another example of a custom model, `your_custom_model`, that takes a list
 								of tokens and returns lists of fine-grained part-of-speech tags, coarse-grained
 								part-of-speech tags, dependency labels and head token indices. Here, we can use
 								the [`Doc.from_array`](/api/doc#from_array) to create a new `Doc` object using
 								those values. To create a numpy array we need integers, so we can look up the
 								string labels in the [`StringStore`](/api/stringstore). The
 								[`doc.vocab.strings.add`](/api/stringstore#add) method comes in handy here,
 								because it returns the integer ID of the string _and_ makes sure it's added to
 								the vocab. This is especially important if the custom model uses a different
 								label scheme than spaCy's default models.
 								> #### Example: spacy-stanfordnlp
 								>
 								> For an example of an end-to-end wrapper for statistical tokenization, tagging
 								> and parsing, check out
 								> [`spacy-stanfordnlp`](https://github.com/explosion/spacy-stanfordnlp). It uses
 								> a very similar approach to the example in this section – the only difference
 								> is that it fully replaces the `nlp` object instead of providing a pipeline
 								> component, since it also needs to handle tokenization.
 								```python
 								### {highlight="1,9,15-17"}
 								import your_custom_model
 								from spacy.symbols import POS, TAG, DEP, HEAD
 								from spacy.tokens import Doc
 								import numpy
 								def custom_model_wrapper(doc):
 								    words = [token.text for token in doc]
 								    spaces = [token.whitespace for token in doc]
 								    pos, tags, deps, heads = your_custom_model(words)
 								    # Convert the strings to integers and add them to the string store
 								    pos = [doc.vocab.strings.add(label) for label in pos]
 								    tags = [doc.vocab.strings.add(label) for label in tags]
 								    deps = [doc.vocab.strings.add(label) for label in deps]
 								    # Create a new Doc from a numpy array
 								    attrs = [POS, TAG, DEP, HEAD]
 								    arr = numpy.array(list(zip(pos, tags, deps, heads)), dtype="uint64")
 								    new_doc = Doc(doc.vocab, words=words, spaces=spaces).from_array(attrs, arr)
 								    return new_doc
 								```
 								<Infobox title="Sentence boundaries and heads" variant="warning">
 								If you create a `Doc` object with dependencies and heads, spaCy is able to
 								resolve the sentence boundaries automatically. However, note that the `HEAD`
 								value used to construct a `Doc` is the token index **relative** to the current
 								token – e.g. `-1` for the previous token. The CoNLL format typically annotates
 								heads as `1`-indexed absolute indices with `0` indicating the root. If that's
 								the case in your annotations, you need to convert them first:
 								```python
 								heads = [2, 0, 4, 2, 2]
 								new_heads = [head - i - 1 if head != 0 else 0 for i, head in enumerate(heads)]
 								```
 								</Infobox>
 								<Infobox title="📖 Advanced usage, serialization and entry points">
 								For more details on how to write and package custom components, make them
 								available to spaCy via entry points and implement your own serialization
 								methods, check out the usage guide on
 								[saving and loading](/usage/saving-loading).
 								</Infobox>