spaCy

History

Daniël de Kok 8d69874afb Add `spacy.PlainTextCorpusReader.v1` (#12122 ) * Add `spacy.PlainTextCorpusReader.v1` This is a corpus reader that reads plain text corpora with the following format: - UTF-8 encoding - One line per document. - Blank lines are ignored. It is useful for applications where we deal with very large corpora, such as distillation, and don't want to deal with the space overhead of serialized formats. Additionally, many large corpora already use such a text format, keeping the necessary preprocessing to a minimum. * Update spacy/training/corpus.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * docs: add version to `PlainTextCorpus` * Add docstring to registry function * Add plain text corpus tests * Only strip newline/carriage return * Add return type _string_to_tmp_file helper * Use a temporary directory in place of file name Different OS auto delete/sharing semantics are just wonky. * This will be new in 3.5.1 (rather than 4) * Test improvements from code review Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>		2023-01-26 11:33:22 +01:00
..
converters	Auto-format code with black (#10377 )	2022-02-25 10:00:21 +01:00
__init__.pxd	…
__init__.py	Add `spacy.PlainTextCorpusReader.v1` (#12122 )	2023-01-26 11:33:22 +01:00
align.pyx	…
alignment.py	Alignment: use a simplified ragged type for performance (#10319 )	2022-04-01 09:02:06 +02:00
alignment_array.pxd	Alignment: use a simplified ragged type for performance (#10319 )	2022-04-01 09:02:06 +02:00
alignment_array.pyx	Backport parser/alignment optimizations from `feature/refactor-parser` (#10952 )	2022-06-24 13:39:52 +02:00
augment.py	Preserve missing entity annotation in augmenters (#11540 )	2022-09-27 10:16:51 +02:00
batchers.py	🏷 Add Mypy check to CI and ignore all existing Mypy errors (#9167 )	2021-10-14 15:21:40 +02:00
callbacks.py	🏷 Add Mypy check to CI and ignore all existing Mypy errors (#9167 )	2021-10-14 15:21:40 +02:00
corpus.py	Add `spacy.PlainTextCorpusReader.v1` (#12122 )	2023-01-26 11:33:22 +01:00
example.pxd	…
example.pyx	Cast to uint64 for all array-based doc representations (#11933 )	2022-12-12 08:45:35 +01:00
gold_io.pyx	…
initialize.py	Clean up warnings in the test suite (#11331 )	2022-08-22 12:04:30 +02:00
iob_utils.py	Preserve missing entity annotation in augmenters (#11540 )	2022-09-27 10:16:51 +02:00
loggers.py	New console logger with expanded progress tracking (#11972 )	2022-12-23 15:21:44 +01:00
loop.py	Add `training.before_update` callback (#11739 )	2022-11-23 17:54:58 +01:00
pretrain.py	Clarify how to fill in init_tok2vec after pretraining (#9639 )	2021-11-18 15:38:30 +01:00