mirror of https://github.com/explosion/spaCy.git
add docbin explanation and example
This commit is contained in:
parent
5d417d3b19
commit
881e3f8fd0
|
@ -20,7 +20,7 @@ def create_docbin_reader(
|
||||||
|
|
||||||
class Corpus:
|
class Corpus:
|
||||||
"""Iterate Example objects from a file or directory of DocBin (.spacy)
|
"""Iterate Example objects from a file or directory of DocBin (.spacy)
|
||||||
formated data files.
|
formatted data files.
|
||||||
|
|
||||||
path (Path): The directory or filename to read from.
|
path (Path): The directory or filename to read from.
|
||||||
gold_preproc (bool): Whether to set up the Example object with gold-standard
|
gold_preproc (bool): Whether to set up the Example object with gold-standard
|
||||||
|
|
|
@ -17,12 +17,32 @@ label schemes used in its components, depending on the data it was trained on.
|
||||||
|
|
||||||
### Binary training format {#binary-training new="3"}
|
### Binary training format {#binary-training new="3"}
|
||||||
|
|
||||||
The built-in [`convert`](/api/cli#convert) command helps you convert the
|
> #### Example
|
||||||
`.conllu` format used by the
|
>
|
||||||
[Universal Dependencies corpora](https://github.com/UniversalDependencies) as
|
> ```python
|
||||||
well as spaCy's previous [JSON format](#json-input).
|
> from pathlib import Path
|
||||||
|
> from spacy.tokens import DocBin
|
||||||
|
> from spacy.gold import Corpus
|
||||||
|
> output_file = Path(dir) / "output.spacy"
|
||||||
|
> data = DocBin(docs=docs).to_bytes()
|
||||||
|
> with output_file.open("wb") as file_:
|
||||||
|
> file_.write(data)
|
||||||
|
> reader = Corpus(output_file)
|
||||||
|
> ```
|
||||||
|
|
||||||
<!-- TODO: document DocBin format -->
|
The main data format used in spaCy v3 is a binary format created by serializing
|
||||||
|
a [`DocBin`](/api/docbin) object, which represents a collection of `Doc`
|
||||||
|
objects. Typically, the extension for these binary files is `.spacy`, and they
|
||||||
|
are used as input format for specifying a [training corpus](/api/corpus) and for
|
||||||
|
spaCy's CLI [`train`](/api/cli#train) command.
|
||||||
|
|
||||||
|
This binary format is extremely efficient in storage, especially when packing
|
||||||
|
multiple documents together.
|
||||||
|
|
||||||
|
The built-in [`convert`](/api/cli#convert) command helps you convert spaCy's
|
||||||
|
previous [JSON format](#json-input) to this new `DocBin` format. It also
|
||||||
|
supports conversion of the `.conllu` format used by the
|
||||||
|
[Universal Dependencies corpora](https://github.com/UniversalDependencies).
|
||||||
|
|
||||||
### JSON training format {#json-input tag="deprecated"}
|
### JSON training format {#json-input tag="deprecated"}
|
||||||
|
|
||||||
|
@ -30,7 +50,7 @@ well as spaCy's previous [JSON format](#json-input).
|
||||||
|
|
||||||
As of v3.0, the JSON input format is deprecated and is replaced by the
|
As of v3.0, the JSON input format is deprecated and is replaced by the
|
||||||
[binary format](#binary-training). Instead of converting [`Doc`](/api/doc)
|
[binary format](#binary-training). Instead of converting [`Doc`](/api/doc)
|
||||||
objects to JSON, you can now now serialize them directly using the
|
objects to JSON, you can now serialize them directly using the
|
||||||
[`DocBin`](/api/docbin) container and then use them as input data.
|
[`DocBin`](/api/docbin) container and then use them as input data.
|
||||||
|
|
||||||
[`spacy convert`](/api/cli) lets you convert your JSON data to the new `.spacy`
|
[`spacy convert`](/api/cli) lets you convert your JSON data to the new `.spacy`
|
||||||
|
|
Loading…
Reference in New Issue