spaCy/corpus.md at 9f69afdd1e1a059ed855a7830318091bb9ab5271

4.1 KiB

Raw Blame History

title	teaser	tag	source	new
Corpus	An annotated corpus	class	spacy/gold/corpus.py	3

This class manages annotated corpora and can read training and development datasets in the DocBin (.spacy) format.

Corpus.init

Create a Corpus. The input data can be a file or a directory of files.

Example

from spacy.gold import Corpus

corpus = Corpus("./train.spacy", "./dev.spacy")

Name	Type	Description
`train`	str / `Path`	Training data (`.spacy` file or directory of `.spacy` files).
`dev`	str / `Path`	Development data (`.spacy` file or directory of `.spacy` files).
`limit`	int	Maximum number of examples returned. `0` for no limit (default).

Corpus.train_dataset

Yield examples from the training data.

Example

from spacy.gold import Corpus
import spacy

corpus = Corpus("./train.spacy", "./dev.spacy")
nlp = spacy.blank("en")
train_data = corpus.train_dataset(nlp)

Name	Type	Description
`nlp`	`Language`	The current `nlp` object.
keyword-only
`shuffle`	bool	Whether to shuffle the examples. Defaults to `True`.
`gold_preproc`	bool	Whether to train on gold-standard sentences and tokens. Defaults to `False`.
`max_length`	int	Maximum document length. Longer documents will be split into sentences, if sentence boundaries are available. `0` for no limit (default).
YIELDS	`Example`	The examples.

Corpus.dev_dataset

Yield examples from the development data.

Example

from spacy.gold import Corpus
import spacy

corpus = Corpus("./train.spacy", "./dev.spacy")
nlp = spacy.blank("en")
dev_data = corpus.dev_dataset(nlp)

Name	Type	Description
`nlp`	`Language`	The current `nlp` object.
keyword-only
`gold_preproc`	bool	Whether to train on gold-standard sentences and tokens. Defaults to `False`.
YIELDS	`Example`	The examples.

Corpus.count_train

Get the word count of all training examples.

Example

from spacy.gold import Corpus
import spacy

corpus = Corpus("./train.spacy", "./dev.spacy")
nlp = spacy.blank("en")
word_count = corpus.count_train(nlp)

Name	Type	Description
`nlp`	`Language`	The current `nlp` object.
RETURNS	int	The word count.

4.1 KiB Raw Blame History

Corpus.__init__

Example

Corpus.train_dataset

Example

Corpus.dev_dataset

Example

Corpus.count_train

Example

4.1 KiB

Raw Blame History

Corpus.init