💫 Update website (#3285)
<!--- Provide a general summary of your changes in the title. -->
## Description
The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.
This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.
### Types of change
enhancement
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-02-17 18:31:19 +00:00
|
|
|
|
---
|
|
|
|
|
title: What's New in v2.0
|
|
|
|
|
teaser: New features, backwards incompatibilities and migration guide
|
|
|
|
|
menu:
|
|
|
|
|
- ['Summary', 'summary']
|
|
|
|
|
- ['New Features', 'features']
|
|
|
|
|
- ['Backwards Incompatibilities', 'incompat']
|
|
|
|
|
- ['Migrating from v1.x', 'migrating']
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
We're very excited to finally introduce spaCy v2.0! On this page, you'll find a
|
|
|
|
|
summary of the new features, information on the backwards incompatibilities,
|
|
|
|
|
including a handy overview of what's been renamed or deprecated. To help you
|
|
|
|
|
make the most of v2.0, we also **re-wrote almost all of the usage guides and API
|
|
|
|
|
docs**, and added more [real-world examples](/usage/examples). If you're new to
|
|
|
|
|
spaCy, or just want to brush up on some NLP basics and the details of the
|
|
|
|
|
library, check out the [spaCy 101 guide](/usage/spacy-101) that explains the
|
|
|
|
|
most important concepts with examples and illustrations.
|
|
|
|
|
|
|
|
|
|
## Summary {#summary}
|
|
|
|
|
|
|
|
|
|
<Grid cols={2}>
|
|
|
|
|
|
|
|
|
|
<div>
|
|
|
|
|
|
|
|
|
|
This release features entirely new **deep learning-powered models** for spaCy's
|
|
|
|
|
tagger, parser and entity recognizer. The new models are **10× smaller**, **20%
|
|
|
|
|
more accurate** and **even cheaper to run** than the previous generation.
|
|
|
|
|
|
|
|
|
|
We've also made several usability improvements that are particularly helpful for
|
|
|
|
|
**production deployments**. spaCy v2 now fully supports the Pickle protocol,
|
|
|
|
|
making it easy to use spaCy with [Apache Spark](https://spark.apache.org/). The
|
|
|
|
|
string-to-integer mapping is **no longer stateful**, making it easy to reconcile
|
|
|
|
|
annotations made in different processes. Models are smaller and use less memory,
|
|
|
|
|
and the APIs for serialization are now much more consistent. Custom pipeline
|
|
|
|
|
components let you modify the `Doc` at any stage in the pipeline. You can now
|
|
|
|
|
also add your own custom attributes, properties and methods to the `Doc`,
|
|
|
|
|
`Token` and `Span`.
|
|
|
|
|
|
|
|
|
|
</div>
|
|
|
|
|
|
2019-03-12 21:57:15 +00:00
|
|
|
|
<Infobox title="Table of Contents" id="toc">
|
💫 Update website (#3285)
<!--- Provide a general summary of your changes in the title. -->
## Description
The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.
This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.
### Types of change
enhancement
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-02-17 18:31:19 +00:00
|
|
|
|
|
|
|
|
|
- [Summary](#summary)
|
|
|
|
|
- [New features](#features)
|
|
|
|
|
- [Neural network models](#features-models)
|
|
|
|
|
- [Improved processing pipelines](#features-pipelines)
|
|
|
|
|
- [Text classification](#features-text-classification)
|
|
|
|
|
- [Hash values as IDs](#features-hash-ids)
|
|
|
|
|
- [Improved word vectors support](#features-vectors)
|
|
|
|
|
- [Saving, loading and serialization](#features-serializer)
|
|
|
|
|
- [displaCy visualizer](#features-displacy)
|
|
|
|
|
- [Language data and lazy loading](#features-language)
|
|
|
|
|
- [Revised matcher API and phrase matcher](#features-matcher)
|
|
|
|
|
- [Backwards incompatibilities](#incompat)
|
|
|
|
|
- [Migrating from spaCy v1.x](#migrating)
|
|
|
|
|
|
|
|
|
|
</Infobox>
|
|
|
|
|
|
|
|
|
|
</Grid>
|
|
|
|
|
|
|
|
|
|
The main usability improvements you'll notice in spaCy v2.0 are around
|
|
|
|
|
**defining, training and loading your own models** and components. The new
|
|
|
|
|
neural network models make it much easier to train a model from scratch, or
|
|
|
|
|
update an existing model with a few examples. In v1.x, the statistical models
|
|
|
|
|
depended on the state of the `Vocab`. If you taught the model a new word, you
|
|
|
|
|
would have to save and load a lot of data — otherwise the model wouldn't
|
|
|
|
|
correctly recall the features of your new example. That's no longer the case.
|
|
|
|
|
|
|
|
|
|
Due to some clever use of hashing, the statistical models **never change size**,
|
|
|
|
|
even as they learn new vocabulary items. The whole pipeline is also now fully
|
|
|
|
|
differentiable. Even if you don't have explicitly annotated data, you can update
|
|
|
|
|
spaCy using all the **latest deep learning tricks** like adversarial training,
|
|
|
|
|
noise contrastive estimation or reinforcement learning.
|
|
|
|
|
|
|
|
|
|
## New features {#features}
|
|
|
|
|
|
|
|
|
|
This section contains an overview of the most important **new features and
|
|
|
|
|
improvements**. The [API docs](/api) include additional deprecation notes. New
|
|
|
|
|
methods and functions that were introduced in this version are marked with the
|
|
|
|
|
tag <Tag variant="new">2</Tag>.
|
|
|
|
|
|
|
|
|
|
### Convolutional neural network models {#features-models}
|
|
|
|
|
|
|
|
|
|
> #### Example
|
|
|
|
|
>
|
|
|
|
|
> ```bash
|
|
|
|
|
> python -m spacy download en_core_web_sm
|
|
|
|
|
> python -m spacy download de_core_news_sm
|
|
|
|
|
> python -m spacy download xx_ent_wiki_sm
|
|
|
|
|
> ```
|
|
|
|
|
|
|
|
|
|
spaCy v2.0 features new neural models for tagging, parsing and entity
|
|
|
|
|
recognition. The models have been designed and implemented from scratch
|
|
|
|
|
specifically for spaCy, to give you an unmatched balance of speed, size and
|
|
|
|
|
accuracy. The new models are **10× smaller**, **20% more accurate**, and **even
|
|
|
|
|
cheaper to run** than the previous generation.
|
|
|
|
|
|
|
|
|
|
spaCy v2.0's new neural network models bring significant improvements in
|
|
|
|
|
accuracy, especially for English Named Entity Recognition. The new
|
|
|
|
|
[`en_core_web_lg`](/models/en#en_core_web_lg) model makes about **25% fewer
|
|
|
|
|
mistakes** than the corresponding v1.x model and is within **1% of the current
|
|
|
|
|
state-of-the-art**
|
|
|
|
|
([Strubell et al., 2017](https://arxiv.org/pdf/1702.02098.pdf)). The v2.0 models
|
|
|
|
|
are also cheaper to run at scale, as they require **under 1 GB of memory** per
|
|
|
|
|
process.
|
|
|
|
|
|
|
|
|
|
<Infobox>
|
|
|
|
|
|
2019-09-14 14:41:19 +00:00
|
|
|
|
**Usage:** [Models directory](/models)
|
💫 Update website (#3285)
<!--- Provide a general summary of your changes in the title. -->
## Description
The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.
This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.
### Types of change
enhancement
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-02-17 18:31:19 +00:00
|
|
|
|
|
|
|
|
|
</Infobox>
|
|
|
|
|
|
|
|
|
|
### Improved processing pipelines {#features-pipelines}
|
|
|
|
|
|
|
|
|
|
> #### Example
|
|
|
|
|
>
|
|
|
|
|
> ```python
|
|
|
|
|
> # Set custom attributes
|
|
|
|
|
> Doc.set_extension("my_attr", default=False)
|
|
|
|
|
> Token.set_extension("my_attr", getter=my_token_getter)
|
|
|
|
|
> assert doc._.my_attr, token._.my_attr
|
|
|
|
|
>
|
|
|
|
|
> # Add components to the pipeline
|
|
|
|
|
> my_component = lambda doc: doc
|
|
|
|
|
> nlp.add_pipe(my_component)
|
|
|
|
|
> ```
|
|
|
|
|
|
|
|
|
|
It's now much easier to **customize the pipeline** with your own components:
|
|
|
|
|
functions that receive a `Doc` object, modify and return it. Extensions let you
|
|
|
|
|
write any **attributes, properties and methods** to the `Doc`, `Token` and
|
|
|
|
|
`Span`. You can add data, implement new features, integrate other libraries with
|
|
|
|
|
spaCy or plug in your own machine learning models.
|
|
|
|
|
|
|
|
|
|
![The processing pipeline](../images/pipeline.svg)
|
|
|
|
|
|
|
|
|
|
<Infobox>
|
|
|
|
|
|
|
|
|
|
**API:** [`Language`](/api/language),
|
|
|
|
|
[`Doc.set_extension`](/api/doc#set_extension),
|
|
|
|
|
[`Span.set_extension`](/api/span#set_extension),
|
|
|
|
|
[`Token.set_extension`](/api/token#set_extension) **Usage:**
|
|
|
|
|
[Processing pipelines](/usage/processing-pipelines) **Code:**
|
|
|
|
|
[Pipeline examples](/usage/examples#section-pipeline)
|
|
|
|
|
|
|
|
|
|
</Infobox>
|
|
|
|
|
|
|
|
|
|
### Text classification {#features-text-classification}
|
|
|
|
|
|
|
|
|
|
> #### Example
|
|
|
|
|
>
|
|
|
|
|
> ```python
|
|
|
|
|
> textcat = nlp.create_pipe("textcat")
|
|
|
|
|
> nlp.add_pipe(textcat, last=True)
|
|
|
|
|
> nlp.begin_training()
|
|
|
|
|
> for itn in range(100):
|
|
|
|
|
> for doc, gold in train_data:
|
|
|
|
|
> nlp.update([doc], [gold])
|
|
|
|
|
> doc = nlp(u"This is a text.")
|
|
|
|
|
> print(doc.cats)
|
|
|
|
|
> ```
|
|
|
|
|
|
|
|
|
|
spaCy v2.0 lets you add text categorization models to spaCy pipelines. The model
|
|
|
|
|
supports classification with multiple, non-mutually exclusive labels – so
|
|
|
|
|
multiple labels can apply at once. You can change the model architecture rather
|
|
|
|
|
easily, but by default, the `TextCategorizer` class uses a convolutional neural
|
|
|
|
|
network to assign position-sensitive vectors to each word in the document.
|
|
|
|
|
|
|
|
|
|
<Infobox>
|
|
|
|
|
|
|
|
|
|
**API:** [`TextCategorizer`](/api/textcategorizer),
|
|
|
|
|
[`Doc.cats`](/api/doc#attributes), [`GoldParse.cats`](/api/goldparse#attributes)
|
|
|
|
|
**Usage:** [Training a text classification model](/usage/training#textcat)
|
|
|
|
|
|
|
|
|
|
</Infobox>
|
|
|
|
|
|
|
|
|
|
### Hash values instead of integer IDs {#features-hash-ids}
|
|
|
|
|
|
|
|
|
|
> #### Example
|
|
|
|
|
>
|
|
|
|
|
> ```python
|
|
|
|
|
> doc = nlp(u"I love coffee")
|
|
|
|
|
> assert doc.vocab.strings[u"coffee"] == 3197928453018144401
|
|
|
|
|
> assert doc.vocab.strings[3197928453018144401] == u"coffee"
|
|
|
|
|
>
|
|
|
|
|
> beer_hash = doc.vocab.strings.add(u"beer")
|
|
|
|
|
> assert doc.vocab.strings[u"beer"] == beer_hash
|
|
|
|
|
> assert doc.vocab.strings[beer_hash] == u"beer"
|
|
|
|
|
> ```
|
|
|
|
|
|
|
|
|
|
The [`StringStore`](/api/stringstore) now resolves all strings to hash values
|
|
|
|
|
instead of integer IDs. This means that the string-to-int mapping **no longer
|
|
|
|
|
depends on the vocabulary state**, making a lot of workflows much simpler,
|
|
|
|
|
especially during training. Unlike integer IDs in spaCy v1.x, hash values will
|
|
|
|
|
**always match** – even across models. Strings can now be added explicitly using
|
|
|
|
|
the new [`Stringstore.add`](/api/stringstore#add) method. A token's hash is
|
|
|
|
|
available via `token.orth`.
|
|
|
|
|
|
|
|
|
|
<Infobox>
|
|
|
|
|
|
|
|
|
|
**API:** [`StringStore`](/api/stringstore) **Usage:**
|
|
|
|
|
[Vocab, hashes and lexemes 101](/usage/spacy-101#vocab)
|
|
|
|
|
|
|
|
|
|
</Infobox>
|
|
|
|
|
|
|
|
|
|
### Improved word vectors support {#features-vectors}
|
|
|
|
|
|
|
|
|
|
> #### Example
|
|
|
|
|
>
|
|
|
|
|
> ```python
|
|
|
|
|
> for word, vector in vector_data:
|
|
|
|
|
> nlp.vocab.set_vector(word, vector)
|
|
|
|
|
> nlp.vocab.vectors.from_glove("/path/to/vectors")
|
|
|
|
|
> # Keep 10000 unique vectors and remap the rest
|
|
|
|
|
> nlp.vocab.prune_vectors(10000)
|
|
|
|
|
> nlp.to_disk("/model")
|
|
|
|
|
> ```
|
|
|
|
|
|
|
|
|
|
The new [`Vectors`](/api/vectors) class helps the `Vocab` manage the vectors
|
|
|
|
|
assigned to strings, and lets you assign vectors individually, or
|
|
|
|
|
[load in GloVe vectors](/usage/vectors-similarity#custom-loading-glove) from a
|
|
|
|
|
directory. To help you strike a good balance between coverage and memory usage,
|
|
|
|
|
the `Vectors` class lets you map **multiple keys** to the **same row** of the
|
|
|
|
|
table. If you're using the [`spacy init-model`](/api/cli#init-model) command to
|
|
|
|
|
create a vocabulary, pruning the vectors will be taken care of automatically if
|
|
|
|
|
you set the `--prune-vectors` flag. Otherwise, you can use the new
|
|
|
|
|
[`Vocab.prune_vectors`](/api/vocab#prune_vectors).
|
|
|
|
|
|
|
|
|
|
<Infobox>
|
|
|
|
|
|
|
|
|
|
**API:** [`Vectors`](/api/vectors), [`Vocab`](/api/vocab) **Usage:**
|
|
|
|
|
[Word vectors and semantic similarity](/usage/vectors-similarity)
|
|
|
|
|
|
|
|
|
|
</Infobox>
|
|
|
|
|
|
|
|
|
|
### Saving, loading and serialization {#features-serializer}
|
|
|
|
|
|
|
|
|
|
> #### Example
|
|
|
|
|
>
|
|
|
|
|
> ```python
|
|
|
|
|
> nlp = spacy.load("en") # shortcut link
|
|
|
|
|
> nlp = spacy.load("en_core_web_sm") # package
|
|
|
|
|
> nlp = spacy.load("/path/to/en") # unicode path
|
|
|
|
|
> nlp = spacy.load(Path("/path/to/en")) # pathlib Path
|
|
|
|
|
>
|
|
|
|
|
> nlp.to_disk("/path/to/nlp")
|
|
|
|
|
> nlp = English().from_disk("/path/to/nlp")
|
|
|
|
|
> ```
|
|
|
|
|
|
|
|
|
|
spaCy's serialization API has been made consistent across classes and objects.
|
|
|
|
|
All container classes, i.e. `Language`, `Doc`, `Vocab` and `StringStore` now
|
|
|
|
|
have a `to_bytes()`, `from_bytes()`, `to_disk()` and `from_disk()` method that
|
|
|
|
|
supports the Pickle protocol.
|
|
|
|
|
|
|
|
|
|
The improved `spacy.load` makes loading models easier and more transparent. You
|
|
|
|
|
can load a model by supplying its [shortcut link](/usage/models#usage), the name
|
|
|
|
|
of an installed [model package](/models) or a path. The `Language` class to
|
|
|
|
|
initialize will be determined based on the model's settings. For a blank
|
|
|
|
|
language, you can import the class directly, e.g.
|
|
|
|
|
`from spacy.lang.en import English` or use
|
|
|
|
|
[`spacy.blank()`](/api/top-level#spacy.blank).
|
|
|
|
|
|
|
|
|
|
<Infobox>
|
|
|
|
|
|
|
|
|
|
**API:** [`spacy.load`](/api/top-level#spacy.load),
|
|
|
|
|
[`Language.to_disk`](/api/language#to_disk) **Usage:**
|
|
|
|
|
[Models](/usage/models#usage),
|
2019-02-17 21:25:50 +00:00
|
|
|
|
[Saving and loading](/usage/saving-loading#models)
|
💫 Update website (#3285)
<!--- Provide a general summary of your changes in the title. -->
## Description
The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.
This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.
### Types of change
enhancement
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-02-17 18:31:19 +00:00
|
|
|
|
|
|
|
|
|
</Infobox>
|
|
|
|
|
|
|
|
|
|
### displaCy visualizer with Jupyter support {#features-displacy}
|
|
|
|
|
|
|
|
|
|
> #### Example
|
|
|
|
|
>
|
|
|
|
|
> ```python
|
|
|
|
|
> from spacy import displacy
|
|
|
|
|
> doc = nlp(u"This is a sentence about Facebook.")
|
|
|
|
|
> displacy.serve(doc, style="dep") # run the web server
|
|
|
|
|
> html = displacy.render(doc, style="ent") # generate HTML
|
|
|
|
|
> ```
|
|
|
|
|
|
|
|
|
|
Our popular dependency and named entity visualizers are now an official part of
|
|
|
|
|
the spaCy library. displaCy can run a simple web server, or generate raw HTML
|
|
|
|
|
markup or SVG files to be exported. You can pass in one or more docs, and
|
|
|
|
|
customize the style. displaCy also auto-detects whether you're running
|
|
|
|
|
[Jupyter](https://jupyter.org) and will render the visualizations in your
|
|
|
|
|
notebook.
|
|
|
|
|
|
|
|
|
|
<Infobox>
|
|
|
|
|
|
|
|
|
|
**API:** [`displacy`](/api/top-level#displacy) **Usage:**
|
|
|
|
|
[Visualizing spaCy](/usage/visualizers)
|
|
|
|
|
|
|
|
|
|
</Infobox>
|
|
|
|
|
|
|
|
|
|
### Improved language data and lazy loading {#features-language}
|
|
|
|
|
|
|
|
|
|
Language-specific data now lives in its own submodule, `spacy.lang`. Languages
|
|
|
|
|
are lazy-loaded, i.e. only loaded when you import a `Language` class, or load a
|
|
|
|
|
model that initializes one. This allows languages to contain more custom data,
|
|
|
|
|
e.g. lemmatizer lookup tables, or complex regular expressions. The language data
|
|
|
|
|
has also been tidied up and simplified. spaCy now also supports simple
|
|
|
|
|
lookup-based lemmatization – and **many new languages**!
|
|
|
|
|
|
|
|
|
|
<Infobox>
|
|
|
|
|
|
|
|
|
|
**API:** [`Language`](/api/language) **Code:**
|
|
|
|
|
[`spacy/lang`](https://github.com/explosion/spaCy/tree/master/spacy/lang)
|
|
|
|
|
**Usage:** [Adding languages](/usage/adding-languages)
|
|
|
|
|
|
|
|
|
|
</Infobox>
|
|
|
|
|
|
|
|
|
|
### Revised matcher API and phrase matcher {#features-matcher}
|
|
|
|
|
|
|
|
|
|
> #### Example
|
|
|
|
|
>
|
|
|
|
|
> ```python
|
|
|
|
|
> from spacy.matcher import Matcher, PhraseMatcher
|
|
|
|
|
>
|
|
|
|
|
> matcher = Matcher(nlp.vocab)
|
|
|
|
|
> matcher.add('HEARTS', None, [{"ORTH": "❤️", "OP": '+'}])
|
|
|
|
|
>
|
|
|
|
|
> phrasematcher = PhraseMatcher(nlp.vocab)
|
|
|
|
|
> phrasematcher.add("OBAMA", None, nlp(u"Barack Obama"))
|
|
|
|
|
> ```
|
|
|
|
|
|
|
|
|
|
Patterns can now be added to the matcher by calling
|
|
|
|
|
[`matcher.add()`](/api/matcher#add) with a match ID, an optional callback
|
|
|
|
|
function to be invoked on each match, and one or more patterns. This allows you
|
|
|
|
|
to write powerful, pattern-specific logic using only one matcher. For example,
|
|
|
|
|
you might only want to merge some entity types, and set custom flags for other
|
|
|
|
|
matched patterns. The new [`PhraseMatcher`](/api/phrasematcher) lets you
|
|
|
|
|
efficiently match very large terminology lists using `Doc` objects as match
|
|
|
|
|
patterns.
|
|
|
|
|
|
|
|
|
|
<Infobox>
|
|
|
|
|
|
|
|
|
|
**API:** [`Matcher`](/api/matcher), [`PhraseMatcher`](/api/phrasematcher)
|
2019-02-17 21:25:50 +00:00
|
|
|
|
**Usage:** [Rule-based matching](/usage/rule-based-matching)
|
💫 Update website (#3285)
<!--- Provide a general summary of your changes in the title. -->
## Description
The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.
This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.
### Types of change
enhancement
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-02-17 18:31:19 +00:00
|
|
|
|
|
|
|
|
|
</Infobox>
|
|
|
|
|
|
|
|
|
|
## Backwards incompatibilities {#incompat}
|
|
|
|
|
|
|
|
|
|
The following modules, classes and methods have changed between v1.x and v2.0.
|
|
|
|
|
|
|
|
|
|
| Old | New |
|
|
|
|
|
| ------------------------------------------------------ | --------------------------------------------------------------------------------------------------------------------------------------------------- |
|
|
|
|
|
| `spacy.download.en`, `spacy.download.de` | [`cli.download`](/api/cli#download) |
|
|
|
|
|
| `spacy.en` etc. | `spacy.lang.en` etc. |
|
|
|
|
|
| `spacy.en.word_sets` | `spacy.lang.en.stop_words` |
|
|
|
|
|
| `spacy.orth` | `spacy.lang.xx.lex_attrs` |
|
|
|
|
|
| `spacy.syntax.iterators` | `spacy.lang.xx.syntax_iterators` |
|
|
|
|
|
| `spacy.tagger.Tagger` | `spacy.pipeline.Tagger` |
|
|
|
|
|
| `spacy.cli.model` | [`spacy.cli.vocab`](/api/cli#vocab) |
|
|
|
|
|
| `Language.save_to_directory` | [`Language.to_disk`](/api/language#to_disk) |
|
|
|
|
|
| `Language.end_training` | [`Language.begin_training`](/api/language#begin_training) |
|
|
|
|
|
| `Language.create_make_doc` | [`Language.tokenizer`](/api/language#attributes) |
|
|
|
|
|
| `Vocab.resize_vectors` | [`Vectors.resize`](/api/vectors#resize) |
|
|
|
|
|
| `Vocab.load` `Vocab.load_lexemes` | [`Vocab.from_disk`](/api/vocab#from_disk) [`Vocab.from_bytes`](/api/vocab#from_bytes) |
|
|
|
|
|
| `Vocab.dump` | [`Vocab.to_disk`](/api/vocab#to_disk) [`Vocab.to_bytes`](/api/vocab#to_bytes) |
|
|
|
|
|
| `Vocab.load_vectors` `Vocab.load_vectors_from_bin_loc` | [`Vectors.from_disk`](/api/vectors#from_disk) [`Vectors.from_bytes`](/api/vectors#from_bytes) [`Vectors.from_glove`](/api/vectors#from_glove) |
|
|
|
|
|
| `Vocab.dump_vectors` | [`Vectors.to_disk`](/api/vectors#to_disk) [`Vectors.to_bytes`](/api/vectors#to_bytes) |
|
|
|
|
|
| `StringStore.load` | [`StringStore.from_disk`](/api/stringstore#from_disk) [`StringStore.from_bytes`](/api/stringstore#from_bytes) |
|
|
|
|
|
| `StringStore.dump` | [`StringStore.to_disk`](/api/stringstore#to_disk) [`StringStore.to_bytes`](/api/stringstore#to_bytes) |
|
|
|
|
|
| `Tokenizer.load` | [`Tokenizer.from_disk`](/api/tokenizer#from_disk) [`Tokenizer.from_bytes`](/api/tokenizer#from_bytes) |
|
|
|
|
|
| `Tagger.load` | [`Tagger.from_disk`](/api/tagger#from_disk) [`Tagger.from_bytes`](/api/tagger#from_bytes) |
|
|
|
|
|
| `Tagger.tag_names` | `Tagger.labels` |
|
|
|
|
|
| `DependencyParser.load` | [`DependencyParser.from_disk`](/api/dependencyparser#from_disk) [`DependencyParser.from_bytes`](/api/dependencyparser#from_bytes) |
|
|
|
|
|
| `EntityRecognizer.load` | [`EntityRecognizer.from_disk`](/api/entityrecognizer#from_disk) [`EntityRecognizer.from_bytes`](/api/entityrecognizer#from_bytes) |
|
|
|
|
|
| `Matcher.load` | - |
|
|
|
|
|
| `Matcher.add_pattern` `Matcher.add_entity` | [`Matcher.add`](/api/matcher#add) [`PhraseMatcher.add`](/api/phrasematcher#add) |
|
|
|
|
|
| `Matcher.get_entity` | [`Matcher.get`](/api/matcher#get) |
|
|
|
|
|
| `Matcher.has_entity` | [`Matcher.has_key`](/api/matcher#has_key) |
|
|
|
|
|
| `Doc.read_bytes` | [`Doc.to_bytes`](/api/doc#to_bytes) [`Doc.from_bytes`](/api/doc#from_bytes) [`Doc.to_disk`](/api/doc#to_disk) [`Doc.from_disk`](/api/doc#from_disk) |
|
|
|
|
|
| `Token.is_ancestor_of` | [`Token.is_ancestor`](/api/token#is_ancestor) |
|
|
|
|
|
|
|
|
|
|
### Deprecated {#deprecated}
|
|
|
|
|
|
|
|
|
|
The following methods are deprecated. They can still be used, but should be
|
|
|
|
|
replaced.
|
|
|
|
|
|
|
|
|
|
| Old | New |
|
|
|
|
|
| ---------------------------- | ----------------------------------------------- |
|
|
|
|
|
| `Tokenizer.tokens_from_list` | [`Doc`](/api/doc) |
|
|
|
|
|
| `Span.sent_start` | [`Span.is_sent_start`](/api/span#is_sent_start) |
|
|
|
|
|
|
|
|
|
|
## Migrating from spaCy 1.x {#migrating}
|
|
|
|
|
|
|
|
|
|
Because we'e made so many architectural changes to the library, we've tried to
|
|
|
|
|
**keep breaking changes to a minimum**. A lot of projects follow the philosophy
|
|
|
|
|
that if you're going to break anything, you may as well break everything. We
|
|
|
|
|
think migration is easier if there's a logic to what has changed. We've
|
|
|
|
|
therefore followed a policy of avoiding breaking changes to the `Doc`, `Span`
|
|
|
|
|
and `Token` objects. This way, you can focus on only migrating the code that
|
|
|
|
|
does training, loading and serialization — in other words, code that works with
|
|
|
|
|
the `nlp` object directly. Code that uses the annotations should continue to
|
|
|
|
|
work.
|
|
|
|
|
|
|
|
|
|
<Infobox title="Important note" variant="warning">
|
|
|
|
|
|
|
|
|
|
If you've trained your own models, keep in mind that your train and runtime
|
|
|
|
|
inputs must match. This means you'll have to **retrain your models** with spaCy
|
|
|
|
|
v2.0.
|
|
|
|
|
|
|
|
|
|
</Infobox>
|
|
|
|
|
|
|
|
|
|
### Document processing {#migrating-document-processing}
|
|
|
|
|
|
|
|
|
|
The [`Language.pipe`](/api/language#pipe) method allows spaCy to batch
|
|
|
|
|
documents, which brings a **significant performance advantage** in v2.0. The new
|
|
|
|
|
neural networks introduce some overhead per batch, so if you're processing a
|
|
|
|
|
number of documents in a row, you should use `nlp.pipe` and process the texts as
|
|
|
|
|
a stream.
|
|
|
|
|
|
|
|
|
|
```diff
|
|
|
|
|
- docs = (nlp(text) for text in texts)
|
|
|
|
|
|
|
|
|
|
+ docs = nlp.pipe(texts)
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
To make usage easier, there's now a boolean `as_tuples` keyword argument, that
|
|
|
|
|
lets you pass in an iterator of `(text, context)` pairs, so you can get back an
|
|
|
|
|
iterator of `(doc, context)` tuples.
|
|
|
|
|
|
|
|
|
|
### Saving, loading and serialization {#migrating-saving-loading}
|
|
|
|
|
|
|
|
|
|
Double-check all calls to `spacy.load()` and make sure they don't use the `path`
|
|
|
|
|
keyword argument. If you're only loading in binary data and not a model package
|
|
|
|
|
that can construct its own `Language` class and pipeline, you should now use the
|
|
|
|
|
[`Language.from_disk`](/api/language#from_disk) method.
|
|
|
|
|
|
|
|
|
|
```diff
|
|
|
|
|
- nlp = spacy.load("en", path="/model")
|
|
|
|
|
|
|
|
|
|
+ nlp = spacy.load("/model")
|
|
|
|
|
+ nlp = spacy.blank("en").from_disk("/model/data")
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
Review all other code that writes state to disk or bytes. All containers, now
|
|
|
|
|
share the same, consistent API for saving and loading. Replace saving with
|
|
|
|
|
`to_disk()` or `to_bytes()`, and loading with `from_disk()` and `from_bytes()`.
|
|
|
|
|
|
|
|
|
|
```diff
|
|
|
|
|
- nlp.save_to_directory("/model")
|
|
|
|
|
- nlp.vocab.dump("/vocab")
|
|
|
|
|
|
|
|
|
|
+ nlp.to_disk("/model")
|
|
|
|
|
+ nlp.vocab.to_disk("/vocab")
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
If you've trained models with input from v1.x, you'll need to **retrain them**
|
|
|
|
|
with spaCy v2.0. All previous models will not be compatible with the new
|
|
|
|
|
version.
|
|
|
|
|
|
|
|
|
|
### Processing pipelines and language data {#migrating-languages}
|
|
|
|
|
|
|
|
|
|
If you're importing language data or `Language` classes, make sure to change
|
|
|
|
|
your import statements to import from `spacy.lang`. If you've added your own
|
|
|
|
|
custom language, it needs to be moved to `spacy/lang/xx` and adjusted
|
|
|
|
|
accordingly.
|
|
|
|
|
|
|
|
|
|
```diff
|
|
|
|
|
- from spacy.en import English
|
|
|
|
|
|
|
|
|
|
+ from spacy.lang.en import English
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
If you've been using custom pipeline components, check out the new guide on
|
|
|
|
|
[processing pipelines](/usage/processing-pipelines). Pipeline components are now
|
|
|
|
|
`(name, func)` tuples. Appending them to the pipeline still works – but the
|
|
|
|
|
[`add_pipe`](/api/language#add_pipe) method now makes this much more convenient.
|
|
|
|
|
Methods for removing, renaming, replacing and retrieving components have been
|
|
|
|
|
added as well. Components can now be disabled by passing a list of their names
|
|
|
|
|
to the `disable` keyword argument on load, or by using
|
|
|
|
|
[`disable_pipes`](/api/language#disable_pipes) as a method or context manager:
|
|
|
|
|
|
|
|
|
|
```diff
|
|
|
|
|
- nlp = spacy.load("en", tagger=False, entity=False)
|
|
|
|
|
- doc = nlp(u"I don't want parsed", parse=False)
|
|
|
|
|
|
|
|
|
|
+ nlp = spacy.load("en", disable=["tagger", "ner"])
|
|
|
|
|
+ with nlp.disable_pipes("parser"):
|
|
|
|
|
+ doc = nlp(u"I don't want parsed")
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
To add spaCy's built-in pipeline components to your pipeline, you can still
|
|
|
|
|
import and instantiate them directly – but it's more convenient to use the new
|
|
|
|
|
[`create_pipe`](/api/language#create_pipe) method with the component name, i.e.
|
|
|
|
|
`'tagger'`, `'parser'`, `'ner'` or `'textcat'`.
|
|
|
|
|
|
|
|
|
|
```diff
|
|
|
|
|
- from spacy.pipeline import Tagger
|
|
|
|
|
- tagger = Tagger(nlp.vocab)
|
|
|
|
|
- nlp.pipeline.insert(0, tagger)
|
|
|
|
|
|
|
|
|
|
+ tagger = nlp.create_pipe("tagger")
|
|
|
|
|
+ nlp.add_pipe(tagger, first=True)
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
### Training {#migrating-training}
|
|
|
|
|
|
|
|
|
|
All built-in pipeline components are now subclasses of [`Pipe`](/api/pipe),
|
|
|
|
|
fully trainable and serializable, and follow the same API. Instead of updating
|
|
|
|
|
the model and telling spaCy when to _stop_, you can now explicitly call
|
|
|
|
|
[`begin_training`](/api/language#begin_training), which returns an optimizer you
|
|
|
|
|
can pass into the [`update`](/api/language#update) function. While `update`
|
|
|
|
|
still accepts sequences of `Doc` and `GoldParse` objects, you can now also pass
|
|
|
|
|
in a list of strings and dictionaries describing the annotations. We call this
|
|
|
|
|
the ["simple training style"](/usage/training#training-simple-style). This is
|
|
|
|
|
also the recommended usage, as it removes one layer of abstraction from the
|
|
|
|
|
training.
|
|
|
|
|
|
|
|
|
|
```diff
|
|
|
|
|
- for itn in range(1000):
|
|
|
|
|
- for text, entities in train_data:
|
|
|
|
|
- doc = Doc(text)
|
|
|
|
|
- gold = GoldParse(doc, entities=entities)
|
|
|
|
|
- nlp.update(doc, gold)
|
|
|
|
|
- nlp.end_training()
|
|
|
|
|
- nlp.save_to_directory("/model")
|
|
|
|
|
|
|
|
|
|
+ nlp.begin_training()
|
|
|
|
|
+ for itn in range(1000):
|
|
|
|
|
+ for texts, annotations in train_data:
|
|
|
|
|
+ nlp.update(texts, annotations)
|
|
|
|
|
+ nlp.to_disk("/model")
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
### Attaching custom data to the Doc {#migrating-doc}
|
|
|
|
|
|
|
|
|
|
Previously, you had to create a new container in order to attach custom data to
|
|
|
|
|
a `Doc` object. This often required converting the `Doc` objects to and from
|
|
|
|
|
arrays. In spaCy v2.0, you can set your own attributes, properties and methods
|
|
|
|
|
on the `Doc`, `Token` and `Span` via
|
|
|
|
|
[custom extensions](/usage/processing-pipelines#custom-components-attributes).
|
|
|
|
|
This means that your application can – and should – only pass around `Doc`
|
|
|
|
|
objects and refer to them as the single source of truth.
|
|
|
|
|
|
|
|
|
|
```diff
|
|
|
|
|
- doc = nlp(u"This is a regular doc")
|
|
|
|
|
- doc_array = doc.to_array(["ORTH", "POS"])
|
|
|
|
|
- doc_with_meta = {"doc_array": doc_array, "meta": get_doc_meta(doc_array)}
|
|
|
|
|
|
|
|
|
|
+ Doc.set_extension("meta", getter=get_doc_meta)
|
|
|
|
|
+ doc_with_meta = nlp(u'This is a doc with meta data')
|
|
|
|
|
+ meta = doc._.meta
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
If you wrap your extension attributes in a
|
|
|
|
|
[custom pipeline component](/usage/processing-pipelines#custom-components), they
|
|
|
|
|
will be assigned automatically when you call `nlp` on a text. If your
|
|
|
|
|
application assigns custom data to spaCy's container objects, or includes other
|
|
|
|
|
utilities that interact with the pipeline, consider moving this logic into its
|
|
|
|
|
own extension module.
|
|
|
|
|
|
|
|
|
|
```diff
|
|
|
|
|
- doc = nlp(u"Doc with a standard pipeline")
|
|
|
|
|
- meta = get_meta(doc)
|
|
|
|
|
|
|
|
|
|
+ nlp.add_pipe(meta_component)
|
|
|
|
|
+ doc = nlp(u"Doc with a custom pipeline that assigns meta")
|
|
|
|
|
+ meta = doc._.meta
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
### Strings and hash values {#migrating-strings}
|
|
|
|
|
|
|
|
|
|
The change from integer IDs to hash values may not actually affect your code
|
|
|
|
|
very much. However, if you're adding strings to the vocab manually, you now need
|
|
|
|
|
to call [`StringStore.add`](/api/stringstore#add) explicitly. You can also now
|
|
|
|
|
be sure that the string-to-hash mapping will always match across vocabularies.
|
|
|
|
|
|
|
|
|
|
```diff
|
|
|
|
|
- nlp.vocab.strings[u"coffee"] # 3672
|
|
|
|
|
- other_nlp.vocab.strings[u"coffee"] # 40259
|
|
|
|
|
|
|
|
|
|
+ nlp.vocab.strings.add(u"coffee")
|
|
|
|
|
+ nlp.vocab.strings[u"coffee"] # 3197928453018144401
|
|
|
|
|
+ other_nlp.vocab.strings[u"coffee"] # 3197928453018144401
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
### Adding patterns and callbacks to the matcher {#migrating-matcher}
|
|
|
|
|
|
|
|
|
|
If you're using the matcher, you can now add patterns in one step. This should
|
|
|
|
|
be easy to update – simply merge the ID, callback and patterns into one call to
|
|
|
|
|
[`Matcher.add()`](/api/matcher#add). The matcher now also supports string keys,
|
|
|
|
|
which saves you an extra import. If you've been using **acceptor functions**,
|
|
|
|
|
you'll need to move this logic into the
|
|
|
|
|
[`on_match` callbacks](/usage/linguistic-features#on_match). The callback
|
|
|
|
|
function is invoked on every match and will give you access to the doc, the
|
|
|
|
|
index of the current match and all total matches. This lets you both accept or
|
|
|
|
|
reject the match, and define the actions to be triggered.
|
|
|
|
|
|
|
|
|
|
```diff
|
|
|
|
|
- matcher.add_entity("GoogleNow", on_match=merge_phrases)
|
|
|
|
|
- matcher.add_pattern("GoogleNow", [{ORTH: "Google"}, {ORTH: "Now"}])
|
|
|
|
|
|
|
|
|
|
+ matcher.add("GoogleNow", merge_phrases, [{"ORTH": "Google"}, {"ORTH": "Now"}])
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
If you need to match large terminology lists, you can now also use the
|
|
|
|
|
[`PhraseMatcher`](/api/phrasematcher), which accepts `Doc` objects as match
|
|
|
|
|
patterns and is more efficient than the regular, rule-based matcher.
|
|
|
|
|
|
|
|
|
|
```diff
|
|
|
|
|
- matcher = Matcher(nlp.vocab)
|
|
|
|
|
- matcher.add_entity("PRODUCT")
|
|
|
|
|
- for text in large_terminology_list
|
|
|
|
|
- matcher.add_pattern("PRODUCT", [{ORTH: text}])
|
|
|
|
|
|
|
|
|
|
+ from spacy.matcher import PhraseMatcher
|
|
|
|
|
+ matcher = PhraseMatcher(nlp.vocab)
|
|
|
|
|
+ patterns = [nlp.make_doc(text) for text in large_terminology_list]
|
|
|
|
|
+ matcher.add("PRODUCT", None, *patterns)
|
|
|
|
|
```
|