From 9391998c7779081226b65888a08b51798f94b5d2 Mon Sep 17 00:00:00 2001
From: Paul O'Leary McCann <polm@dampfkraft.com>
Date: Tue, 17 Aug 2021 00:37:21 +0900
Subject: [PATCH] Add notes on preparing training data to docs (#8964)

* Add training data section

Not entirely sure this is in the right location on the page - maybe it
should be after quickstart?

* Add pointer from binary format to training data section

* Minor cleanup

* Add to ToC, fix filename

* Update website/docs/usage/training.md

Co-authored-by: Ines Montani <ines@ines.io>

* Update website/docs/usage/training.md

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update website/docs/usage/training.md

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Move the training data section further down the page

* Update website/docs/usage/training.md

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update website/docs/usage/training.md

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Run prettier

Co-authored-by: Ines Montani <ines@ines.io>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
---
 website/docs/api/data-formats.md |  4 +++
 website/docs/usage/training.md   | 54 ++++++++++++++++++++++++++++++++
 2 files changed, 58 insertions(+)
diff --git a/website/docs/api/data-formats.md b/website/docs/api/data-formats.md
index 1bdeb509a..001455f33 100644
--- a/website/docs/api/data-formats.md
+++ b/website/docs/api/data-formats.md
@@ -283,6 +283,10 @@ CLI [`train`](/api/cli#train) command. The built-in
 of the `.conllu` format used by the
 [Universal Dependencies corpora](https://github.com/UniversalDependencies).
 
+Note that while this is the format used to save training data, you do not have
+to understand the internal details to use it or create training data. See the
+section on [preparing training data](/usage/training#training-data).
+
 ### JSON training format {#json-input tag="deprecated"}
 
 <Infobox variant="warning" title="Changed in v3.0">
diff --git a/website/docs/usage/training.md b/website/docs/usage/training.md
index 6deba3761..0fe34f2a2 100644
--- a/website/docs/usage/training.md
+++ b/website/docs/usage/training.md
@@ -6,6 +6,7 @@ menu:
   - ['Introduction', 'basics']
   - ['Quickstart', 'quickstart']
   - ['Config System', 'config']
+  - ['Training Data', 'training-data']
   - ['Custom Training', 'config-custom']
   - ['Custom Functions', 'custom-functions']
   - ['Initialization', 'initialization']
@@ -355,6 +356,59 @@ that reference this variable.
 
 </Infobox>
 
+## Preparing Training Data {#training-data}
+
+Training data for NLP projects comes in many different formats. For some common
+formats such as CoNLL, spaCy provides [converters](/api/cli#convert) you can use
+from the command line. In other cases you'll have to prepare the training data
+yourself.
+
+When converting training data for use in spaCy, the main thing is to create
+[`Doc`](/api/doc) objects just like the results you want as output from the
+pipeline. For example, if you're creating an NER pipeline, loading your
+annotations and setting them as the `.ents` property on a `Doc` is all you need
+to worry about. On disk the annotations will be saved as a
+[`DocBin`](/api/docbin) in the
+[`.spacy` format](/api/data-formats#binary-training), but the details of that
+are handled automatically.
+
+Here's an example of creating a `.spacy` file from some NER annotations.
+
+```python
+### preprocess.py
+import spacy
+from spacy.tokens import DocBin
+
+nlp = spacy.blank("en")
+training_data = [
+  ("Tokyo Tower is 333m tall.", [(0, 11, "BUILDING")]),
+]
+# the DocBin will store the example documents
+db = DocBin()
+for text, annotations in training_data:
+    doc = nlp(text)
+    ents = []
+    for start, end, label in annotations:
+        span = doc.char_span(start, end, label=label)
+        ents.append(span)
+    doc.ents = ents
+    db.add(doc)
+db.to_disk("./train.spacy")
+```
+
+For more examples of how to convert training data from a wide variety of formats
+for use with spaCy, look at the preprocessing steps in the
+[tutorial projects](https://github.com/explosion/projects/tree/v3/tutorials).
+
+<Accordion title="What about the spaCy JSON format?" id="json-annotations" spaced>
+
+In spaCy v2, the recommended way to store training data was in
+[a particular JSON format](/api/data-formats#json-input), but in v3 this format
+is deprecated. It's fine as a readable storage format, but there's no need to
+convert your data to JSON before creating a `.spacy` file.
+
+</Accordion>
+
 ## Customizing the pipeline and training {#config-custom}
 
 ### Defining pipeline components {#config-components}