spaCy/website/docs/api/corpus.md

---
title: Corpus
teaser: An annotated corpus
tag: class
source: spacy/training/corpus.py
new: 3
---

This class manages annotated corpora and can be used for training and
development datasets in the [`DocBin`](/api/docbin) (`.spacy`) format. To
customize the data loading during training, you can register your own
[data readers and batchers](/usage/training#custom-code-readers-batchers). Also
see the usage guide on [data utilities](/usage/training#data) for more details
and examples.

## Config and implementation {#config}

`spacy.Corpus.v1` is a registered function that creates a `Corpus` of training
or evaluation data. It takes the same arguments as the `Corpus` class and
returns a callable that yields [`Example`](/api/example) objects. You can
replace it with your own registered function in the
[`@readers` registry](/api/top-level#registry) to customize the data loading and
streaming.

> #### Example config
>
> ```ini
> [paths]
> train = "corpus/train.spacy"
>
> [corpora.train]
> @readers = "spacy.Corpus.v1"
> path = ${paths.train}
> gold_preproc = false
> max_length = 0
> limit = 0
> augmenter = null
> ```

| Name           | Description                                                                                                                                                                                                                                                                              |
| -------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `path`         | The directory or filename to read from. Expects data in spaCy's binary [`.spacy` format](/api/data-formats#binary-training). ~~Path~~                                                                                                                                                    |
| `gold_preproc` | Whether to set up the Example object with gold-standard sentences and tokens for the predictions. See [`Corpus`](/api/corpus#init) for details. ~~bool~~                                                                                                                                 |
| `max_length`   | Maximum document length. Longer documents will be split into sentences, if sentence boundaries are available. Defaults to `0` for no limit. ~~int~~                                                                                                                                      |
| `limit`        | Limit corpus to a subset of examples, e.g. for debugging. Defaults to `0` for no limit. ~~int~~                                                                                                                                                                                          |
| `augmenter`    | Apply some simply data augmentation, where we replace tokens with variations. This is especially useful for punctuation and case replacement, to help generalize beyond corpora that don't have smart-quotes, or only have smart quotes, etc. Defaults to `None`. ~~Optional[Callable]~~ |

```python
%%GITHUB_SPACY/spacy/training/corpus.py
```

## Corpus.\_\_init\_\_ {#init tag="method"}

Create a `Corpus` for iterating [Example](/api/example) objects from a file or
directory of [`.spacy` data files](/api/data-formats#binary-training). The
`gold_preproc` setting lets you specify whether to set up the `Example` object
with gold-standard sentences and tokens for the predictions. Gold preprocessing
helps the annotations align to the tokenization, and may result in sequences of
more consistent length. However, it may reduce runtime accuracy due to
train/test skew.

> #### Example
>
> ```python
> from spacy.training import Corpus
>
> # With a single file
> corpus = Corpus("./data/train.spacy")
>
> # With a directory
> corpus = Corpus("./data", limit=10)
> ```

| Name           | Description                                                                                                                                         |
| -------------- | --------------------------------------------------------------------------------------------------------------------------------------------------- |
| `path`         | The directory or filename to read from. ~~Union[str, Path]~~                                                                                        |
| _keyword-only_ |                                                                                                                                                     |
| `gold_preproc` | Whether to set up the Example object with gold-standard sentences and tokens for the predictions. Defaults to `False`. ~~bool~~                     |
| `max_length`   | Maximum document length. Longer documents will be split into sentences, if sentence boundaries are available. Defaults to `0` for no limit. ~~int~~ |
| `limit`        | Limit corpus to a subset of examples, e.g. for debugging. Defaults to `0` for no limit. ~~int~~                                                     |
| `augmenter`    | Optional data augmentation callback. ~~Callable[[Language, Example], Iterable[Example]]~~                                                           |
| `shuffle`      | Whether to shuffle the examples. Defaults to `False`. ~~bool~~                                                                                      |

## Corpus.\_\_call\_\_ {#call tag="method"}

Yield examples from the data.

> #### Example
>
> ```python
> from spacy.training import Corpus
> import spacy
>
> corpus = Corpus("./train.spacy")
> nlp = spacy.blank("en")
> train_data = corpus(nlp)
> ```

| Name       | Description                            |
| ---------- | -------------------------------------- |
| `nlp`      | The current `nlp` object. ~~Language~~ |
| **YIELDS** | The examples. ~~Example~~              |

## JsonlCorpus {#jsonlcorpus tag="class"}

Iterate Doc objects from a file or directory of JSONL (newline-delimited JSON)
formatted raw text files. Can be used to read the raw text corpus for language
model [pretraining](/usage/embeddings-transformers#pretraining) from a JSONL
file.

> #### Tip: Writing JSONL
>
> Our utility library [`srsly`](https://github.com/explosion/srsly) provides a
> handy `write_jsonl` helper that takes a file path and list of dictionaries and
> writes out JSONL-formatted data.
>
> ```python
> import srsly
> data = [{"text": "Some text"}, {"text": "More..."}]
> srsly.write_jsonl("/path/to/text.jsonl", data)
> ```

```json
### Example
{"text": "Can I ask where you work now and what you do, and if you enjoy it?"}
{"text": "They may just pull out of the Seattle market completely, at least until they have autonomous vehicles."}
{"text": "My cynical view on this is that it will never be free to the public. Reason: what would be the draw of joining the military? Right now their selling point is free Healthcare and Education. Ironically both are run horribly and most, that I've talked to, come out wishing they never went in."}
```

### JsonlCorpus.\_\init\_\_ {#jsonlcorpus tag="method"}

Initialize the reader.

> #### Example
>
> ```python
> from spacy.training import JsonlCorpus
>
> corpus = JsonlCorpus("./data/texts.jsonl")
> ```
>
> ```ini
> ### Example config
> [corpora.pretrain]
> @readers = "spacy.JsonlCorpus.v1"
> path = "corpus/raw_text.jsonl"
> min_length = 0
> max_length = 0
> limit = 0
> ```

| Name           | Description                                                                                                                      |
| -------------- | -------------------------------------------------------------------------------------------------------------------------------- |
| `path`         | The directory or filename to read from. Expects newline-delimited JSON with a key `"text"` for each record. ~~Union[str, Path]~~ |
| _keyword-only_ |                                                                                                                                  |
| `min_length`   | Minimum document length (in tokens). Shorter documents will be skipped. Defaults to `0`, which indicates no limit. ~~int~~       |
| `max_length`   | Maximum document length (in tokens). Longer documents will be skipped. Defaults to `0`, which indicates no limit. ~~int~~        |
| `limit`        | Limit corpus to a subset of examples, e.g. for debugging. Defaults to `0` for no limit. ~~int~~                                  |

### JsonlCorpus.\_\_call\_\_ {#jsonlcorpus-call tag="method"}

Yield examples from the data.

> #### Example
>
> ```python
> from spacy.training import JsonlCorpus
> import spacy
>
> corpus = JsonlCorpus("./texts.jsonl")
> nlp = spacy.blank("en")
> data = corpus(nlp)
> ```

| Name       | Description                            |
| ---------- | -------------------------------------- |
| `nlp`      | The current `nlp` object. ~~Language~~ |
| **YIELDS** | The examples. ~~Example~~              |
Update docs 2020-07-04 12:23:10 +00:00			`---`
			`title: Corpus`
			`teaser: An annotated corpus`
			`tag: class`
Update docs [ci skip] 2020-09-12 15:05:10 +00:00			`source: spacy/training/corpus.py`
Update docs 2020-07-04 12:23:10 +00:00			`new: 3`
			`---`

Update docs [ci skip] 2020-08-05 18:29:53 +00:00			`This class manages annotated corpora and can be used for training and`
Update augmenter lookups and docs 2020-09-30 21:03:47 +00:00			development datasets in the [`DocBin`](/api/docbin) (`.spacy`) format. To
Update docs [ci skip] 2020-08-05 18:29:53 +00:00			`customize the data loading during training, you can register your own`
Update augmenter lookups and docs 2020-09-30 21:03:47 +00:00			`[data readers and batchers](/usage/training#custom-code-readers-batchers). Also`
			`see the usage guide on [data utilities](/usage/training#data) for more details`
			`and examples.`
Update docs 2020-08-06 17:30:43 +00:00
			`## Config and implementation {#config}`

			`spacy.Corpus.v1` is a registered function that creates a `Corpus` of training
			or evaluation data. It takes the same arguments as the `Corpus` class and
			returns a callable that yields [`Example`](/api/example) objects. You can
			`replace it with your own registered function in the`
typo in link 2020-08-18 10:04:05 +00:00			[`@readers` registry](/api/top-level#registry) to customize the data loading and
Update docs 2020-08-06 17:30:43 +00:00			`streaming.`

			`> #### Example config`
			`>`
			> ```ini
			`> [paths]`
			`> train = "corpus/train.spacy"`
			`>`
generalize corpora, dot notation for dev and train corpus 2020-09-17 09:38:59 +00:00			`> [corpora.train]`
Update docs 2020-08-06 17:30:43 +00:00			`> @readers = "spacy.Corpus.v1"`
Merge branch 'develop' of https://github.com/explosion/spaCy into develop [ci skip] 2020-08-20 09:20:58 +00:00			`> path = ${paths.train}`
Update docs 2020-08-06 17:30:43 +00:00			`> gold_preproc = false`
			`> max_length = 0`
			`> limit = 0`
Update docs [ci skip] 2020-09-30 13:16:00 +00:00			`> augmenter = null`
Update docs 2020-08-06 17:30:43 +00:00			> ```

Remove NBSP's across tables in the docs (#10842) 2022-05-25 07:48:39 +00:00			`\| Name \| Description \|`
			`\| -------------- \| ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- \|`
			\| `path` \| The directory or filename to read from. Expects data in spaCy's binary [`.spacy` format](/api/data-formats#binary-training). ~~Path~~ \|
			\| `gold_preproc` \| Whether to set up the Example object with gold-standard sentences and tokens for the predictions. See [`Corpus`](/api/corpus#init) for details. ~~bool~~ \|
			\| `max_length` \| Maximum document length. Longer documents will be split into sentences, if sentence boundaries are available. Defaults to `0` for no limit. ~~int~~ \|
			\| `limit` \| Limit corpus to a subset of examples, e.g. for debugging. Defaults to `0` for no limit. ~~int~~ \|
			\| `augmenter` \| Apply some simply data augmentation, where we replace tokens with variations. This is especially useful for punctuation and case replacement, to help generalize beyond corpora that don't have smart-quotes, or only have smart quotes, etc. Defaults to `None`. ~~Optional[Callable]~~ \|
Update docs 2020-08-06 17:30:43 +00:00
			```python
Update docs [ci skip] 2020-09-12 15:05:10 +00:00			`%%GITHUB_SPACY/spacy/training/corpus.py`
Update docs 2020-08-06 17:30:43 +00:00			```
Update docs 2020-07-04 12:23:10 +00:00
			`## Corpus.\_\_init\_\_ {#init tag="method"}`

Update docs [ci skip] 2020-08-05 18:29:53 +00:00			Create a `Corpus` for iterating [Example](/api/example) objects from a file or
			directory of [`.spacy` data files](/api/data-formats#binary-training). The
			`gold_preproc` setting lets you specify whether to set up the `Example` object
			`with gold-standard sentences and tokens for the predictions. Gold preprocessing`
			`helps the annotations align to the tokenization, and may result in sequences of`
			`more consistent length. However, it may reduce runtime accuracy due to`
			`train/test skew.`
Update docs 2020-07-04 12:23:10 +00:00
Update docstrings, docs and types 2020-07-29 09:36:42 +00:00			`> #### Example`
			`>`
			> ```python
Renaming gold & annotation_setter (#6042) * version bump to 3.0.0a16 * rename "gold" folder to "training" * rename 'annotation_setter' to 'set_extra_annotations' * formatting 2020-09-09 08:31:03 +00:00			`> from spacy.training import Corpus`
Update docstrings, docs and types 2020-07-29 09:36:42 +00:00			`>`
Update docs [ci skip] 2020-08-05 18:29:53 +00:00			`> # With a single file`
			`> corpus = Corpus("./data/train.spacy")`
Update docstrings, docs and types 2020-07-29 09:36:42 +00:00			`>`
Update docs [ci skip] 2020-08-05 18:29:53 +00:00			`> # With a directory`
			`> corpus = Corpus("./data", limit=10)`
Update docstrings, docs and types 2020-07-29 09:36:42 +00:00			> ```
Update docs 2020-07-04 12:23:10 +00:00
Remove NBSP's across tables in the docs (#10842) 2022-05-25 07:48:39 +00:00			`\| Name \| Description \|`
			`\| -------------- \| --------------------------------------------------------------------------------------------------------------------------------------------------- \|`
			\| `path` \| The directory or filename to read from. ~~Union[str, Path]~~ \|
			`\| _keyword-only_ \| \|`
			\| `gold_preproc` \| Whether to set up the Example object with gold-standard sentences and tokens for the predictions. Defaults to `False`. ~~bool~~ \|
			\| `max_length` \| Maximum document length. Longer documents will be split into sentences, if sentence boundaries are available. Defaults to `0` for no limit. ~~int~~ \|
			\| `limit` \| Limit corpus to a subset of examples, e.g. for debugging. Defaults to `0` for no limit. ~~int~~ \|
			\| `augmenter` \| Optional data augmentation callback. ~~Callable[[Language, Example], Iterable[Example]]~~ \|
			\| `shuffle` \| Whether to shuffle the examples. Defaults to `False`. ~~bool~~ \|
Update docs 2020-07-04 12:23:10 +00:00
Update docs [ci skip] 2020-08-05 18:29:53 +00:00			`## Corpus.\_\_call\_\_ {#call tag="method"}`
Update docs 2020-07-04 12:23:10 +00:00
Update docs [ci skip] 2020-08-05 18:29:53 +00:00			`Yield examples from the data.`
Update docs 2020-07-04 12:23:10 +00:00
Update docstrings, docs and types 2020-07-29 09:36:42 +00:00			`> #### Example`
			`>`
			> ```python
Renaming gold & annotation_setter (#6042) * version bump to 3.0.0a16 * rename "gold" folder to "training" * rename 'annotation_setter' to 'set_extra_annotations' * formatting 2020-09-09 08:31:03 +00:00			`> from spacy.training import Corpus`
Update docstrings, docs and types 2020-07-29 09:36:42 +00:00			`> import spacy`
			`>`
Update docs [ci skip] 2020-08-05 18:29:53 +00:00			`> corpus = Corpus("./train.spacy")`
Update docstrings, docs and types 2020-07-29 09:36:42 +00:00			`> nlp = spacy.blank("en")`
Update docs [ci skip] 2020-08-05 18:29:53 +00:00			`> train_data = corpus(nlp)`
Update docstrings, docs and types 2020-07-29 09:36:42 +00:00			> ```

Update docs, types and API consistency 2020-08-17 14:45:24 +00:00			`\| Name \| Description \|`
			`\| ---------- \| -------------------------------------- \|`
			\| `nlp` \| The current `nlp` object. ~~Language~~ \|
			`\| YIELDS \| The examples. ~~Example~~ \|`
Update docs and consistency [ci skip] 2020-09-14 22:32:49 +00:00
Integrate file readers 2020-10-01 23:36:06 +00:00			`## JsonlCorpus {#jsonlcorpus tag="class"}`
Update docs and consistency [ci skip] 2020-09-14 22:32:49 +00:00
			`Iterate Doc objects from a file or directory of JSONL (newline-delimited JSON)`
			`formatted raw text files. Can be used to read the raw text corpus for language`
			`model [pretraining](/usage/embeddings-transformers#pretraining) from a JSONL`
			`file.`

			`> #### Tip: Writing JSONL`
			`>`
			> Our utility library [`srsly`](https://github.com/explosion/srsly) provides a
			> handy `write_jsonl` helper that takes a file path and list of dictionaries and
			`> writes out JSONL-formatted data.`
			`>`
			> ```python
			`> import srsly`
			`> data = [{"text": "Some text"}, {"text": "More..."}]`
			`> srsly.write_jsonl("/path/to/text.jsonl", data)`
			> ```

			```json
			`### Example`
			`{"text": "Can I ask where you work now and what you do, and if you enjoy it?"}`
			`{"text": "They may just pull out of the Seattle market completely, at least until they have autonomous vehicles."}`
			`{"text": "My cynical view on this is that it will never be free to the public. Reason: what would be the draw of joining the military? Right now their selling point is free Healthcare and Education. Ironically both are run horribly and most, that I've talked to, come out wishing they never went in."}`
			```

Integrate file readers 2020-10-01 23:36:06 +00:00			`### JsonlCorpus.\_\init\_\_ {#jsonlcorpus tag="method"}`
Update docs and consistency [ci skip] 2020-09-14 22:32:49 +00:00
			`Initialize the reader.`

			`> #### Example`
			`>`
			> ```python
Integrate file readers 2020-10-01 23:36:06 +00:00			`> from spacy.training import JsonlCorpus`
Update docs and consistency [ci skip] 2020-09-14 22:32:49 +00:00			`>`
Integrate file readers 2020-10-01 23:36:06 +00:00			`> corpus = JsonlCorpus("./data/texts.jsonl")`
Update docs and consistency [ci skip] 2020-09-14 22:32:49 +00:00			> ```
			`>`
			> ```ini
			`> ### Example config`
generalize corpora, dot notation for dev and train corpus 2020-09-17 09:38:59 +00:00			`> [corpora.pretrain]`
Integrate file readers 2020-10-01 23:36:06 +00:00			`> @readers = "spacy.JsonlCorpus.v1"`
Update docs and consistency [ci skip] 2020-09-14 22:32:49 +00:00			`> path = "corpus/raw_text.jsonl"`
			`> min_length = 0`
			`> max_length = 0`
			`> limit = 0`
			> ```

			`\| Name \| Description \|`
			`\| -------------- \| -------------------------------------------------------------------------------------------------------------------------------- \|`
			\| `path` \| The directory or filename to read from. Expects newline-delimited JSON with a key `"text"` for each record. ~~Union[str, Path]~~ \|
			`\| _keyword-only_ \| \|`
			\| `min_length` \| Minimum document length (in tokens). Shorter documents will be skipped. Defaults to `0`, which indicates no limit. ~~int~~ \|
			\| `max_length` \| Maximum document length (in tokens). Longer documents will be skipped. Defaults to `0`, which indicates no limit. ~~int~~ \|
			\| `limit` \| Limit corpus to a subset of examples, e.g. for debugging. Defaults to `0` for no limit. ~~int~~ \|

Integrate file readers 2020-10-01 23:36:06 +00:00			`### JsonlCorpus.\_\_call\_\_ {#jsonlcorpus-call tag="method"}`
Update docs and consistency [ci skip] 2020-09-14 22:32:49 +00:00
			`Yield examples from the data.`

			`> #### Example`
			`>`
			> ```python
Integrate file readers 2020-10-01 23:36:06 +00:00			`> from spacy.training import JsonlCorpus`
Update docs and consistency [ci skip] 2020-09-14 22:32:49 +00:00			`> import spacy`
			`>`
Integrate file readers 2020-10-01 23:36:06 +00:00			`> corpus = JsonlCorpus("./texts.jsonl")`
Update docs and consistency [ci skip] 2020-09-14 22:32:49 +00:00			`> nlp = spacy.blank("en")`
			`> data = corpus(nlp)`
			> ```

			`\| Name \| Description \|`
			`\| ---------- \| -------------------------------------- \|`
			\| `nlp` \| The current `nlp` object. ~~Language~~ \|
			`\| YIELDS \| The examples. ~~Example~~ \|`