add docbin explanation and example

This commit is contained in:
svlandeg 2020-08-06 15:29:44 +02:00
parent 5d417d3b19
commit 881e3f8fd0
2 changed files with 27 additions and 7 deletions

View File

@ -20,7 +20,7 @@ def create_docbin_reader(
class Corpus: class Corpus:
"""Iterate Example objects from a file or directory of DocBin (.spacy) """Iterate Example objects from a file or directory of DocBin (.spacy)
formated data files. formatted data files.
path (Path): The directory or filename to read from. path (Path): The directory or filename to read from.
gold_preproc (bool): Whether to set up the Example object with gold-standard gold_preproc (bool): Whether to set up the Example object with gold-standard

View File

@ -17,12 +17,32 @@ label schemes used in its components, depending on the data it was trained on.
### Binary training format {#binary-training new="3"} ### Binary training format {#binary-training new="3"}
The built-in [`convert`](/api/cli#convert) command helps you convert the > #### Example
`.conllu` format used by the >
[Universal Dependencies corpora](https://github.com/UniversalDependencies) as > ```python
well as spaCy's previous [JSON format](#json-input). > from pathlib import Path
> from spacy.tokens import DocBin
> from spacy.gold import Corpus
> output_file = Path(dir) / "output.spacy"
> data = DocBin(docs=docs).to_bytes()
> with output_file.open("wb") as file_:
> file_.write(data)
> reader = Corpus(output_file)
> ```
<!-- TODO: document DocBin format --> The main data format used in spaCy v3 is a binary format created by serializing
a [`DocBin`](/api/docbin) object, which represents a collection of `Doc`
objects. Typically, the extension for these binary files is `.spacy`, and they
are used as input format for specifying a [training corpus](/api/corpus) and for
spaCy's CLI [`train`](/api/cli#train) command.
This binary format is extremely efficient in storage, especially when packing
multiple documents together.
The built-in [`convert`](/api/cli#convert) command helps you convert spaCy's
previous [JSON format](#json-input) to this new `DocBin` format. It also
supports conversion of the `.conllu` format used by the
[Universal Dependencies corpora](https://github.com/UniversalDependencies).
### JSON training format {#json-input tag="deprecated"} ### JSON training format {#json-input tag="deprecated"}
@ -30,7 +50,7 @@ well as spaCy's previous [JSON format](#json-input).
As of v3.0, the JSON input format is deprecated and is replaced by the As of v3.0, the JSON input format is deprecated and is replaced by the
[binary format](#binary-training). Instead of converting [`Doc`](/api/doc) [binary format](#binary-training). Instead of converting [`Doc`](/api/doc)
objects to JSON, you can now now serialize them directly using the objects to JSON, you can now serialize them directly using the
[`DocBin`](/api/docbin) container and then use them as input data. [`DocBin`](/api/docbin) container and then use them as input data.
[`spacy convert`](/api/cli) lets you convert your JSON data to the new `.spacy` [`spacy convert`](/api/cli) lets you convert your JSON data to the new `.spacy`