spaCy/spacy
Daniël de Kok 8d69874afb
Add `spacy.PlainTextCorpusReader.v1` (#12122)
* Add `spacy.PlainTextCorpusReader.v1`

This is a corpus reader that reads plain text corpora with the following
format:

- UTF-8 encoding
- One line per document.
- Blank lines are ignored.

It is useful for applications where we deal with very large corpora,
such as distillation, and don't want to deal with the space overhead of
serialized formats. Additionally, many large corpora already use such
a text format, keeping the necessary preprocessing to a minimum.

* Update spacy/training/corpus.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* docs: add version to `PlainTextCorpus`

* Add docstring to registry function

* Add plain text corpus tests

* Only strip newline/carriage return

* Add return type _string_to_tmp_file helper

* Use a temporary directory in place of file name

Different OS auto delete/sharing semantics are just wonky.

* This will be new in 3.5.1 (rather than 4)

* Test improvements from code review

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2023-01-26 11:33:22 +01:00
..
cli Add a `spacy benchmark speed` subcommand (#11902) 2023-01-12 11:55:21 +01:00
displacy Auto-format code with black (#12100) 2023-01-13 10:12:10 +01:00
kb API docs: Rename kb_in_memory to inmemorylookupkb, add to sidebar (#12128) 2023-01-19 13:29:17 +01:00
lang Update stop_words.py (#11997) 2022-12-19 16:17:49 +01:00
matcher Fix comments and examples for levenshtein_compare (#12113) 2023-01-18 08:02:33 +01:00
ml Handle Docs with no entities in EntityLinker (#11640) 2022-10-28 10:25:34 +02:00
pipeline Fix speed problem with `top_k>1` on CPU in edit tree lemmatizer (#12017) 2023-01-20 19:34:11 +01:00
tests Add `spacy.PlainTextCorpusReader.v1` (#12122) 2023-01-26 11:33:22 +01:00
tokens Fix `SpanGroup` and `Span` typing (#12009) 2022-12-21 18:54:27 +01:00
training Add `spacy.PlainTextCorpusReader.v1` (#12122) 2023-01-26 11:33:22 +01:00
__init__.pxd
__init__.py Simplify and clarify enable/disable behavior of spacy.load() (#11459) 2022-09-27 14:22:36 +02:00
__main__.py
about.py Set version to v3.5.0 2022-11-25 12:05:25 +01:00
attrs.pxd
attrs.pyx Intify IOB (#9738) 2022-01-20 13:19:38 +01:00
compat.py Custom component types in spacy.ty (#9469) 2021-10-21 15:31:06 +02:00
default_config.cfg Add `training.before_update` callback (#11739) 2022-11-23 17:54:58 +01:00
default_config_pretraining.cfg Add new parameter for saving every n epoch in pretraining (#8912) 2021-08-12 11:14:48 +02:00
errors.py Clean up displacy port-related error messages, docs (#12089) 2023-01-12 14:54:09 +09:00
glossary.py Add glossary entry for root (#10821) 2022-05-20 09:56:32 +02:00
language.py Replace Pipe type with Callable in Language (#11803) 2022-11-29 13:20:08 +01:00
lexeme.pxd
lexeme.pyi fix type of lexeme.rank (#9979) 2022-01-04 13:15:25 +01:00
lexeme.pyx Bugfix for similarity return types (#10051) 2022-01-20 11:40:46 +01:00
lookups.py Fix issues for Mypy 0.950 and Pydantic 1.9.0 (#10786) 2022-05-25 09:33:54 +02:00
morphology.pxd
morphology.pyx
parts_of_speech.pxd
parts_of_speech.pyx
pipe_analysis.py 🏷 Add Mypy check to CI and ignore all existing Mypy errors (#9167) 2021-10-14 15:21:40 +02:00
py.typed
schemas.py Auto-format code with black (#12100) 2023-01-13 10:12:10 +01:00
scorer.py Restore v2 token_acc score implementation (#12073) 2023-01-11 08:01:47 +01:00
strings.pxd `StringStore`-related optimizations (#10938) 2022-07-04 15:04:03 +02:00
strings.pyi Fix StringStore.__getitem__ return type depending on parameter types (#10741) 2022-05-03 17:57:07 +02:00
strings.pyx `StringStore`-related optimizations (#10938) 2022-07-04 15:04:03 +02:00
structs.pxd
symbols.pxd
symbols.pyx
tokenizer.pxd Add tokenizer option to allow Matcher handling for all rules (#10452) 2022-03-24 13:21:32 +01:00
tokenizer.pyx Add tokenizer option to allow Matcher handling for all rules (#10452) 2022-03-24 13:21:32 +01:00
ty.py Custom component types in spacy.ty (#9469) 2021-10-21 15:31:06 +02:00
typedefs.pxd
typedefs.pyx
util.py improve ux for displacy when the serve port is in use (#11948) 2023-01-10 15:52:57 +09:00
vectors.pyx Add equality definition for vectors (#11806) 2022-11-16 09:44:42 +01:00
vocab.pxd Add support for floret vectors (#8909) 2021-10-27 14:08:31 +02:00
vocab.pyi Add vector deduplication (#10551) 2022-03-30 08:54:23 +02:00
vocab.pyx fix comparison of constants (#11834) 2022-11-21 08:12:03 +01:00