If your training data only contained new entities and you didn't mix in any
diff --git a/website/docs/usage/linguistic-features.md b/website/docs/usage/linguistic-features.md
index faa6dc850..99612a6bb 100644
--- a/website/docs/usage/linguistic-features.md
+++ b/website/docs/usage/linguistic-features.md
@@ -2,11 +2,11 @@
title: Linguistic Features
next: /usage/rule-based-matching
menu:
- - ['Tokenization', 'tokenization']
- ['POS Tagging', 'pos-tagging']
- ['Dependency Parse', 'dependency-parse']
- ['Named Entities', 'named-entities']
- ['Entity Linking', 'entity-linking']
+ - ['Tokenization', 'tokenization']
- ['Merging & Splitting', 'retokenization']
- ['Sentence Segmentation', 'sbd']
- ['Language data', 'language-data']
@@ -31,8 +31,8 @@ import PosDeps101 from 'usage/101/\_pos-deps.md'
For a list of the fine-grained and coarse-grained part-of-speech tags assigned
-by spaCy's models across different languages, see the
-[POS tag scheme documentation](/api/annotation#pos-tagging).
+by spaCy's models across different languages, see the label schemes documented
+in the [models directory](/models).
@@ -290,8 +290,8 @@ for token in doc:
For a list of the syntactic dependency labels assigned by spaCy's models across
-different languages, see the
-[dependency label scheme documentation](/api/annotation#dependency-parsing).
+different languages, see the label schemes documented in the
+[models directory](/models).
@@ -354,7 +354,7 @@ import NER101 from 'usage/101/\_named-entities.md'
-### Accessing entity annotations {#accessing}
+### Accessing entity annotations and labels {#accessing-ner}
The standard way to access entity annotations is the [`doc.ents`](/api/doc#ents)
property, which produces a sequence of [`Span`](/api/span) objects. The entity
@@ -371,9 +371,17 @@ on a token, it will return an empty string.
> #### IOB Scheme
>
-> - `I` – Token is inside an entity.
-> - `O` – Token is outside an entity.
-> - `B` – Token is the beginning of an entity.
+> - `I` – Token is **inside** an entity.
+> - `O` – Token is **outside** an entity.
+> - `B` – Token is the **beginning** of an entity.
+>
+> #### BILUO Scheme
+>
+> - `B` – Token is the **beginning** of an entity.
+> - `I` – Token is **inside** a multi-token entity.
+> - `L` – Token is the **last** token of a multi-token entity.
+> - `U` – Token is a single-token **unit** entity.
+> - `O` – Toke is **outside** an entity.
```python
### {executable="true"}
@@ -492,38 +500,8 @@ responsibility for ensuring that the data is left in a consistent state.
For details on the entity types available in spaCy's pretrained models, see the
-[NER annotation scheme](/api/annotation#named-entities).
-
-
-
-### Training and updating {#updating}
-
-To provide training examples to the entity recognizer, you'll first need to
-create an instance of the [`GoldParse`](/api/goldparse) class. You can specify
-your annotations in a stand-off format or as token tags. If a character offset
-in your entity annotations doesn't fall on a token boundary, the `GoldParse`
-class will treat that annotation as a missing value. This allows for more
-realistic training, because the entity recognizer is allowed to learn from
-examples that may feature tokenizer errors.
-
-```python
-train_data = [
- ("Who is Chaka Khan?", [(7, 17, "PERSON")]),
- ("I like London and Berlin.", [(7, 13, "LOC"), (18, 24, "LOC")]),
-]
-```
-
-```python
-doc = Doc(nlp.vocab, ["rats", "make", "good", "pets"])
-gold = GoldParse(doc, entities=["U-ANIMAL", "O", "O", "O"])
-```
-
-
-
-For more details on **training and updating** the named entity recognizer, see
-the usage guides on [training](/usage/training) or check out the runnable
-[training script](https://github.com/explosion/spaCy/tree/master/examples/training/train_ner.py)
-on GitHub.
+"label scheme" sections of the individual models in the
+[models directory](/models).
@@ -1103,7 +1081,7 @@ In situations like that, you often want to align the tokenization so that you
can merge annotations from different sources together, or take vectors predicted
by a
[pretrained BERT model](https://github.com/huggingface/pytorch-transformers) and
-apply them to spaCy tokens. spaCy's [`gold.align`](/api/goldparse#align) helper
+apply them to spaCy tokens. spaCy's [`gold.align`](/api/top-level#align) helper
returns a `(cost, a2b, b2a, a2b_multi, b2a_multi)` tuple describing the number
of misaligned tokens, the one-to-one mappings of token indices in both
directions and the indices where multiple tokens align to one single token.
diff --git a/website/docs/usage/processing-pipelines.md b/website/docs/usage/processing-pipelines.md
index 6b32dc422..32d6bf7a2 100644
--- a/website/docs/usage/processing-pipelines.md
+++ b/website/docs/usage/processing-pipelines.md
@@ -1,6 +1,6 @@
---
title: Language Processing Pipelines
-next: vectors-similarity
+next: /usage/vectors-embeddings
menu:
- ['Processing Text', 'processing']
- ['How Pipelines Work', 'pipelines']
@@ -818,14 +818,14 @@ function that takes a `Doc`, modifies it and returns it.
### Wrapping other models and libraries {#wrapping-models-libraries}
Let's say you have a custom entity recognizer that takes a list of strings and
-returns their [BILUO tags](/api/annotation#biluo). Given an input like
-`["A", "text", "about", "Facebook"]`, it will predict and return
+returns their [BILUO tags](/usage/linguistic-features#accessing-ner). Given an
+input like `["A", "text", "about", "Facebook"]`, it will predict and return
`["O", "O", "O", "U-ORG"]`. To integrate it into your spaCy pipeline and make it
add those entities to the `doc.ents`, you can wrap it in a custom pipeline
component function and pass it the token texts from the `Doc` object received by
the component.
-The [`gold.spans_from_biluo_tags`](/api/goldparse#spans_from_biluo_tags) is very
+The [`gold.spans_from_biluo_tags`](/api/top-level#spans_from_biluo_tags) is very
helpful here, because it takes a `Doc` object and token-based BILUO tags and
returns a sequence of `Span` objects in the `Doc` with added labels. So all your
wrapper has to do is compute the entity spans and overwrite the `doc.ents`.
diff --git a/website/docs/usage/rule-based-matching.md b/website/docs/usage/rule-based-matching.md
index d0ee44e49..e89e41586 100644
--- a/website/docs/usage/rule-based-matching.md
+++ b/website/docs/usage/rule-based-matching.md
@@ -1289,10 +1289,9 @@ print([(ent.text, ent.label_, ent._.person_title) for ent in doc.ents])
>
> This example makes extensive use of part-of-speech tag and dependency
> attributes and related `Doc`, `Token` and `Span` methods. For an introduction
-> on this, see the guide on
-> [linguistic features](http://localhost:8000/usage/linguistic-features/). Also
-> see the [annotation specs](/api/annotation#pos-tagging) for details on the
-> label schemes.
+> on this, see the guide on [linguistic features](/usage/linguistic-features/).
+> Also see the label schemes in the [models directory](/models) for details on
+> the labels.
Let's say you want to parse professional biographies and extract the person
names and company names, and whether it's a company they're _currently_ working
diff --git a/website/docs/usage/spacy-101.md b/website/docs/usage/spacy-101.md
index aa8aa59af..3c4e85a7d 100644
--- a/website/docs/usage/spacy-101.md
+++ b/website/docs/usage/spacy-101.md
@@ -249,7 +249,7 @@ import Vectors101 from 'usage/101/\_vectors-similarity.md'
To learn more about word vectors, how to **customize them** and how to load
**your own vectors** into spaCy, see the usage guide on
-[using word vectors and semantic similarities](/usage/vectors-similarity).
+[using word vectors and semantic similarities](/usage/vectors-embeddings).
@@ -712,7 +712,7 @@ not available in the live demo).
-**Usage:** [Word vectors and similarity](/usage/vectors-similarity)
+**Usage:** [Word vectors and similarity](/usage/vectors-embeddings)
diff --git a/website/docs/usage/training.md b/website/docs/usage/training.md
index 6fa0b3d8e..53b713f98 100644
--- a/website/docs/usage/training.md
+++ b/website/docs/usage/training.md
@@ -10,9 +10,7 @@ menu:
- ['Internal API', 'api']
---
-
-
-## Introduction to training models {#basics}
+## Introduction to training models {#basics hidden="true"}
import Training101 from 'usage/101/\_training.md'
@@ -33,10 +31,13 @@ ready-to-use spaCy models.
## Training CLI & config {#cli-config}
+
+
The recommended way to train your spaCy models is via the
[`spacy train`](/api/cli#train) command on the command line.
-1. The **training data** in spaCy's binary format created using
+1. The **training data** in spaCy's
+ [binary format](/api/data-formats#binary-training) created using
[`spacy convert`](/api/cli#convert).
2. A `config.cfg` **configuration file** with all settings and hyperparameters.
3. An optional **Python file** to register
@@ -44,9 +45,13 @@ The recommended way to train your spaCy models is via the
-### Training data format {#data-format}
+
-
+Lorem ipsum dolor sit amet, consectetur adipiscing elit. Phasellus interdum
+sodales lectus, ut sodales orci ullamcorper id. Sed condimentum neque ut erat
+mattis pretium.
+
+
> #### Tip: Debug your data
>
@@ -158,6 +163,14 @@ dropout = null
+
+
+Lorem ipsum dolor sit amet, consectetur adipiscing elit. Phasellus interdum
+sodales lectus, ut sodales orci ullamcorper id. Sed condimentum neque ut erat
+mattis pretium.
+
+
+
### Training with custom code
@@ -168,6 +181,14 @@ dropout = null
+
+
+Try out a BERT-based model pipeline using this project template: swap in your
+data, edit the settings and hyperparameters and train, evaluate, package and
+visualize your model.
+
+
+
### Pretraining with spaCy {#pretraining}
@@ -176,6 +197,14 @@ dropout = null
+
+
+Lorem ipsum dolor sit amet, consectetur adipiscing elit. Phasellus interdum
+sodales lectus, ut sodales orci ullamcorper id. Sed condimentum neque ut erat
+mattis pretium.
+
+
+
## Internal training API {#api}
@@ -259,5 +288,5 @@ The [`nlp.update`](/api/language#update) method takes the following arguments:
Instead of writing your own training loop, you can also use the built-in
[`train`](/api/cli#train) command, which expects data in spaCy's
-[JSON format](/api/annotation#json-input). On each epoch, a model will be saved
-out to the directory.
+[JSON format](/api/data-formats#json-input). On each epoch, a model will be
+saved out to the directory.
diff --git a/website/docs/usage/v2-2.md b/website/docs/usage/v2-2.md
index 19a0434fb..dd7325a9c 100644
--- a/website/docs/usage/v2-2.md
+++ b/website/docs/usage/v2-2.md
@@ -351,7 +351,7 @@ check if all of your models are up to date, you can run the
automatically to prevent spaCy from being downloaded and installed again from
pip.
- The built-in
- [`biluo_tags_from_offsets`](/api/goldparse#biluo_tags_from_offsets) converter
+ [`biluo_tags_from_offsets`](/api/top-level#biluo_tags_from_offsets) converter
is now stricter and will raise an error if entities are overlapping (instead
of silently skipping them). If your data contains invalid entity annotations,
make sure to clean it and resolve conflicts. You can now also use the new
@@ -430,7 +430,7 @@ lemma_rules = {"verb": [["ing", ""]]}
#### Converting entity offsets to BILUO tags
If you've been using the
-[`biluo_tags_from_offsets`](/api/goldparse#biluo_tags_from_offsets) helper to
+[`biluo_tags_from_offsets`](/api/top-level#biluo_tags_from_offsets) helper to
convert character offsets into token-based BILUO tags, you may now see an error
if the offsets contain overlapping tokens and make it impossible to create a
valid BILUO sequence. This is helpful, because it lets you spot potential
diff --git a/website/docs/usage/v2.md b/website/docs/usage/v2.md
index a2322c3be..59a842968 100644
--- a/website/docs/usage/v2.md
+++ b/website/docs/usage/v2.md
@@ -169,8 +169,8 @@ network to assign position-sensitive vectors to each word in the document.
**API:** [`TextCategorizer`](/api/textcategorizer),
-[`Doc.cats`](/api/doc#attributes), [`GoldParse.cats`](/api/goldparse#attributes)
-**Usage:** [Training a text classification model](/usage/training#textcat)
+[`Doc.cats`](/api/doc#attributes), `GoldParse.cats` **Usage:**
+[Training a text classification model](/usage/training#textcat)
@@ -218,7 +218,7 @@ available via `token.orth`.
The new [`Vectors`](/api/vectors) class helps the `Vocab` manage the vectors
assigned to strings, and lets you assign vectors individually, or
-[load in GloVe vectors](/usage/vectors-similarity#custom-loading-glove) from a
+[load in GloVe vectors](/usage/vectors-embeddings#custom-loading-glove) from a
directory. To help you strike a good balance between coverage and memory usage,
the `Vectors` class lets you map **multiple keys** to the **same row** of the
table. If you're using the [`spacy init-model`](/api/cli#init-model) command to
diff --git a/website/docs/usage/vectors-similarity.md b/website/docs/usage/vectors-embeddings.md
similarity index 95%
rename from website/docs/usage/vectors-similarity.md
rename to website/docs/usage/vectors-embeddings.md
index 9b65bb80a..49b651d9e 100644
--- a/website/docs/usage/vectors-similarity.md
+++ b/website/docs/usage/vectors-embeddings.md
@@ -1,12 +1,13 @@
---
-title: Word Vectors and Semantic Similarity
+title: Word Vectors and Embeddings
menu:
- - ['Basics', 'basics']
- - ['Custom Vectors', 'custom']
- - ['GPU Usage', 'gpu']
+ - ['Word Vectors', 'vectors']
+ - ['Other Embeddings', 'embeddings']
---
-## Basics {#basics hidden="true"}
+
+
+## Word vectors and similarity
> #### Training word vectors
>
@@ -21,7 +22,7 @@ import Vectors101 from 'usage/101/\_vectors-similarity.md'
-## Customizing word vectors {#custom}
+### Customizing word vectors {#custom}
Word vectors let you import knowledge from raw text into your model. The
knowledge is represented as a table of numbers, with one row per term in your
@@ -193,7 +194,7 @@ For more details on **adding hooks** and **overwriting** the built-in `Doc`,
-## Storing vectors on a GPU {#gpu}
+### Storing vectors on a GPU {#gpu}
If you're using a GPU, it's much more efficient to keep the word vectors on the
device. You can do that by setting the [`Vectors.data`](/api/vectors#attributes)
@@ -224,3 +225,7 @@ vector_table = numpy.zeros((3, 300), dtype="f")
vectors = Vectors(["dog", "cat", "orange"], vector_table)
vectors.data = torch.Tensor(vectors.data).cuda(0)
```
+
+## Other embeddings {#embeddings}
+
+
diff --git a/website/docs/usage/visualizers.md b/website/docs/usage/visualizers.md
index df4987a62..6b533b739 100644
--- a/website/docs/usage/visualizers.md
+++ b/website/docs/usage/visualizers.md
@@ -130,10 +130,9 @@ If you specify a list of `ents`, only those entity types will be rendered – fo
example, you can choose to display `PERSON` entities. Internally, the visualizer
knows nothing about available entity types and will render whichever spans and
labels it receives. This makes it especially easy to work with custom entity
-types. By default, displaCy comes with colors for all
-[entity types supported by spaCy](/api/annotation#named-entities). If you're
-using custom entity types, you can use the `colors` setting to add your own
-colors for them.
+types. By default, displaCy comes with colors for all entity types used by
+[spaCy models](/models). If you're using custom entity types, you can use the
+`colors` setting to add your own colors for them.
> #### Options example
>
diff --git a/website/meta/sidebars.json b/website/meta/sidebars.json
index 9a0d0fb05..18b14751e 100644
--- a/website/meta/sidebars.json
+++ b/website/meta/sidebars.json
@@ -18,7 +18,7 @@
{ "text": "Linguistic Features", "url": "/usage/linguistic-features" },
{ "text": "Rule-based Matching", "url": "/usage/rule-based-matching" },
{ "text": "Processing Pipelines", "url": "/usage/processing-pipelines" },
- { "text": "Vectors & Similarity", "url": "/usage/vectors-similarity" },
+ { "text": "Vectors & Embeddings", "url": "/usage/vectors-embeddings" },
{ "text": "Training Models", "url": "/usage/training", "tag": "new" },
{ "text": "spaCy Projects", "url": "/usage/projects", "tag": "new" },
{ "text": "Saving & Loading", "url": "/usage/saving-loading" },
@@ -26,8 +26,10 @@
]
},
{
- "label": "In-depth",
- "items": [{ "text": "Code Examples", "url": "/usage/examples" }]
+ "label": "Resources",
+ "items": [
+ { "text": "Project Templates", "url": "https://github.com/explosion/projects" }
+ ]
}
]
},
@@ -56,7 +58,7 @@
"items": [
{ "text": "Library Architecture", "url": "/api" },
{ "text": "Model Architectures", "url": "/api/architectures" },
- { "text": "Annotation Specs", "url": "/api/annotation" },
+ { "text": "Data Formats", "url": "/api/data-formats" },
{ "text": "Command Line", "url": "/api/cli" },
{ "text": "Functions", "url": "/api/top-level" }
]
diff --git a/website/src/components/copy.js b/website/src/components/copy.js
new file mode 100644
index 000000000..4392273e2
--- /dev/null
+++ b/website/src/components/copy.js
@@ -0,0 +1,48 @@
+import React, { useState, useRef } from 'react'
+
+import Icon from './icon'
+import classes from '../styles/copy.module.sass'
+
+const CopyInput = ({ text, prefix }) => {
+ const isClient = typeof window !== 'undefined'
+ const supportsCopy = isClient && document.queryCommandSupported('copy')
+ const textareaRef = useRef()
+ const [copySuccess, setCopySuccess] = useState(false)
+
+ function copyToClipboard() {
+ if (textareaRef.current && isClient) {
+ textareaRef.current.select()
+ document.execCommand('copy')
+ setCopySuccess(true)
+ textareaRef.current.blur()
+ setTimeout(() => setCopySuccess(false), 1000)
+ }
+ }
+
+ function selectText() {
+ if (textareaRef.current && isClient) {
+ textareaRef.current.select()
+ }
+ }
+
+ return (
+
+ {prefix && {prefix}}
+
+ {supportsCopy && (
+
+ )}
+
+ )
+}
+
+export default CopyInput
diff --git a/website/src/components/icon.js b/website/src/components/icon.js
index 6e9f81d51..e58b047af 100644
--- a/website/src/components/icon.js
+++ b/website/src/components/icon.js
@@ -20,6 +20,7 @@ import { ReactComponent as NeutralIcon } from '../images/icons/neutral.svg'
import { ReactComponent as OfflineIcon } from '../images/icons/offline.svg'
import { ReactComponent as SearchIcon } from '../images/icons/search.svg'
import { ReactComponent as MoonIcon } from '../images/icons/moon.svg'
+import { ReactComponent as ClipboardIcon } from '../images/icons/clipboard.svg'
import classes from '../styles/icon.module.sass'
@@ -43,6 +44,7 @@ const icons = {
offline: OfflineIcon,
search: SearchIcon,
moon: MoonIcon,
+ clipboard: ClipboardIcon,
}
const Icon = ({ name, width, height, inline, variant, className }) => {
diff --git a/website/src/images/icons/clipboard.svg b/website/src/images/icons/clipboard.svg
new file mode 100644
index 000000000..c281f4b13
--- /dev/null
+++ b/website/src/images/icons/clipboard.svg
@@ -0,0 +1,4 @@
+
diff --git a/website/src/styles/copy.module.sass b/website/src/styles/copy.module.sass
new file mode 100644
index 000000000..c6d2f68cb
--- /dev/null
+++ b/website/src/styles/copy.module.sass
@@ -0,0 +1,21 @@
+.root
+ background: var(--color-back)
+ border-radius: 2em
+ border: 1px solid var(--color-subtle)
+ width: 100%
+ padding: 0.25em 1em
+ display: inline-flex
+ margin: var(--spacing-xs) 0
+ font: var(--font-size-code)/var(--line-height-code) var(--font-code)
+ -webkit-font-smoothing: subpixel-antialiased
+ -moz-osx-font-smoothing: auto
+
+.textarea
+ flex: 100%
+ background: transparent
+ resize: none
+ font: inherit
+
+.prefix
+ margin-right: 0.75em
+ color: var(--color-subtle-dark)
diff --git a/website/src/styles/infobox.module.sass b/website/src/styles/infobox.module.sass
index 5fd0ff0d3..2be59f33b 100644
--- a/website/src/styles/infobox.module.sass
+++ b/website/src/styles/infobox.module.sass
@@ -5,7 +5,7 @@
padding: 1.5rem 2rem 0.75rem 1.5rem
border-radius: var(--border-radius)
color: var(--color-dark)
- border-left: 0.75rem solid var(--color-theme)
+ border-left: 0.75rem solid var(--color-subtle-light)
p, pre, ul, ol
margin-bottom: var(--spacing-xs)
@@ -21,6 +21,10 @@
margin-bottom: var(--spacing-xs)
font-size: var(--font-size-md)
+ code
+ font-weight: normal
+ color: inherit
+
.icon
color: var(--color-theme)
vertical-align: baseline
diff --git a/website/src/templates/index.js b/website/src/templates/index.js
index b40f03fee..7f9314d9d 100644
--- a/website/src/templates/index.js
+++ b/website/src/templates/index.js
@@ -32,6 +32,7 @@ import Grid from '../components/grid'
import { YouTube, SoundCloud, Iframe, Image } from '../components/embed'
import Alert from '../components/alert'
import Search from '../components/search'
+import Project from '../widgets/project'
const mdxComponents = {
a: Link,
@@ -73,6 +74,7 @@ const scopeComponents = {
Accordion,
Grid,
InlineCode,
+ Project,
}
const AlertSpace = ({ nightly }) => {
diff --git a/website/src/widgets/project.js b/website/src/widgets/project.js
new file mode 100644
index 000000000..f1c18cf7a
--- /dev/null
+++ b/website/src/widgets/project.js
@@ -0,0 +1,32 @@
+import React from 'react'
+
+import CopyInput from '../components/copy'
+import Infobox from '../components/infobox'
+import Link from '../components/link'
+import { InlineCode } from '../components/code'
+
+// TODO: move to meta?
+const DEFAULT_REPO = 'https://github.com/explosion/projects'
+const COMMAND = 'python -m spacy project clone'
+
+const Project = ({ id, repo, children }) => {
+ const repoArg = repo ? ` --repo ${repo}` : ''
+ const text = `${COMMAND} ${id}${repoArg}`
+ const url = `${repo || DEFAULT_REPO}/${id}`
+ const title = (
+ <>
+ 🪐 Get started with a project template:{' '}
+
+ {id}
+
+ >
+ )
+ return (
+
+ {children}
+
+
+ )
+}
+
+export default Project