mirror of https://github.com/explosion/spaCy.git
4.1 KiB
4.1 KiB
title | teaser | tag | source | new |
---|---|---|---|---|
Corpus | An annotated corpus | class | spacy/gold/corpus.py | 3 |
This class manages annotated corpora and can read training and development
datasets in the DocBin (.spacy
) format.
Corpus.__init__
Create a Corpus
. The input data can be a file or a directory of files.
Example
from spacy.gold import Corpus corpus = Corpus("./train.spacy", "./dev.spacy")
Name | Type | Description |
---|---|---|
train |
str / Path |
Training data (.spacy file or directory of .spacy files). |
dev |
str / Path |
Development data (.spacy file or directory of .spacy files). |
limit |
int | Maximum number of examples returned. 0 for no limit (default). |
Corpus.train_dataset
Yield examples from the training data.
Example
from spacy.gold import Corpus import spacy corpus = Corpus("./train.spacy", "./dev.spacy") nlp = spacy.blank("en") train_data = corpus.train_dataset(nlp)
Name | Type | Description |
---|---|---|
nlp |
Language |
The current nlp object. |
keyword-only | ||
shuffle |
bool | Whether to shuffle the examples. Defaults to True . |
gold_preproc |
bool | Whether to train on gold-standard sentences and tokens. Defaults to False . |
max_length |
int | Maximum document length. Longer documents will be split into sentences, if sentence boundaries are available. 0 for no limit (default). |
YIELDS | Example |
The examples. |
Corpus.dev_dataset
Yield examples from the development data.
Example
from spacy.gold import Corpus import spacy corpus = Corpus("./train.spacy", "./dev.spacy") nlp = spacy.blank("en") dev_data = corpus.dev_dataset(nlp)
Name | Type | Description |
---|---|---|
nlp |
Language |
The current nlp object. |
keyword-only | ||
gold_preproc |
bool | Whether to train on gold-standard sentences and tokens. Defaults to False . |
YIELDS | Example |
The examples. |
Corpus.count_train
Get the word count of all training examples.
Example
from spacy.gold import Corpus import spacy corpus = Corpus("./train.spacy", "./dev.spacy") nlp = spacy.blank("en") word_count = corpus.count_train(nlp)
Name | Type | Description |
---|---|---|
nlp |
Language |
The current nlp object. |
RETURNS | int | The word count. |