spaCy/corpus.md at d92954ac1d8aee45a666d2b79f86e846f76991c8

2.8 KiB

Raw Blame History

title	teaser	tag	source	new
Corpus	An annotated corpus	class	spacy/gold/corpus.py	3

This class manages annotated corpora and can be used for training and development datasets in the DocBin (.spacy) format. To customize the data loading during training, you can register your own data readers and batchers

Corpus.init

Create a Corpus for iterating Example objects from a file or directory of .spacy data files. The gold_preproc setting lets you specify whether to set up the Example object with gold-standard sentences and tokens for the predictions. Gold preprocessing helps the annotations align to the tokenization, and may result in sequences of more consistent length. However, it may reduce runtime accuracy due to train/test skew.

Example

from spacy.gold import Corpus

# With a single file
corpus = Corpus("./data/train.spacy")

# With a directory
corpus = Corpus("./data", limit=10)

Name	Type	Description
`path`	str / `Path`	The directory or filename to read from.
keyword-only
`gold_preproc`	bool	Whether to set up the Example object with gold-standard sentences and tokens for the predictions. Defaults to `False`.
`max_length`	int	Maximum document length. Longer documents will be split into sentences, if sentence boundaries are available. Defaults to `0` for no limit.
`limit`	int	Limit corpus to a subset of examples, e.g. for debugging. Defaults to `0` for no limit.

Corpus.call

Yield examples from the data.

Example

from spacy.gold import Corpus
import spacy

corpus = Corpus("./train.spacy")
nlp = spacy.blank("en")
train_data = corpus(nlp)

Name	Type	Description
`nlp`	`Language`	The current `nlp` object.
YIELDS	`Example`	The examples.

2.8 KiB Raw Blame History

Corpus.__init__

Example

Corpus.__call__

Example

2.8 KiB

Raw Blame History

Corpus.init

Corpus.call