spaCy/spacy/tests/doc/test_creation.py

import pytest
from spacy.vocab import Vocab
from spacy.tokens import Doc
from spacy import util


@pytest.fixture
def vocab():
    return Vocab()


def test_empty_doc(vocab):
    doc = Doc(vocab)
    assert len(doc) == 0


def test_single_word(vocab):
    doc = Doc(vocab, words=["a"])
    assert doc.text == "a "
    doc = Doc(vocab, words=["a"], spaces=[False])
    assert doc.text == "a"


def test_create_from_words_and_text(vocab):
    # no whitespace in words
    words = ["'", "dogs", "'", "run"]
    text = "  'dogs'\n\nrun  "
    (words, spaces) = util.get_words_and_spaces(words, text)
    doc = Doc(vocab, words=words, spaces=spaces)
    assert [t.text for t in doc] == ["  ", "'", "dogs", "'", "\n\n", "run", " "]
    assert [t.whitespace_ for t in doc] == ["", "", "", "", "", " ", ""]
    assert doc.text == text
    assert [t.text for t in doc if not t.text.isspace()] == [
        word for word in words if not word.isspace()
    ]

    # partial whitespace in words
    words = ["  ", "'", "dogs", "'", "\n\n", "run", " "]
    text = "  'dogs'\n\nrun  "
    (words, spaces) = util.get_words_and_spaces(words, text)
    doc = Doc(vocab, words=words, spaces=spaces)
    assert [t.text for t in doc] == ["  ", "'", "dogs", "'", "\n\n", "run", " "]
    assert [t.whitespace_ for t in doc] == ["", "", "", "", "", " ", ""]
    assert doc.text == text
    assert [t.text for t in doc if not t.text.isspace()] == [
        word for word in words if not word.isspace()
    ]

    # non-standard whitespace tokens
    words = [" ", " ", "'", "dogs", "'", "\n\n", "run"]
    text = "  'dogs'\n\nrun  "
    (words, spaces) = util.get_words_and_spaces(words, text)
    doc = Doc(vocab, words=words, spaces=spaces)
    assert [t.text for t in doc] == ["  ", "'", "dogs", "'", "\n\n", "run", " "]
    assert [t.whitespace_ for t in doc] == ["", "", "", "", "", " ", ""]
    assert doc.text == text
    assert [t.text for t in doc if not t.text.isspace()] == [
        word for word in words if not word.isspace()
    ]

    # mismatch between words and text
    with pytest.raises(ValueError):
        words = [" ", " ", "'", "dogs", "'", "\n\n", "run"]
        text = "  'dogs'\n\nrun  "
        (words, spaces) = util.get_words_and_spaces(words + ["away"], text)
💫 Refactor test suite (#2568) ## Description Related issues: #2379 (should be fixed by separating model tests) * total execution time down from > 300 seconds to under 60 seconds 🎉 * removed all model-specific tests that could only really be run manually anyway – those will now live in a separate test suite in the [`spacy-models`](https://github.com/explosion/spacy-models) repository and are already integrated into our new model training infrastructure * changed all relative imports to absolute imports to prepare for moving the test suite from `/spacy/tests` to `/tests` (it'll now always test against the installed version) * merged old regression tests into collections, e.g. `test_issue1001-1500.py` (about 90% of the regression tests are very short anyways) * tidied up and rewrote existing tests wherever possible ### Todo - [ ] move tests to `/tests` and adjust CI commands accordingly - [x] move model test suite from internal repo to `spacy-models` - [x] ~~investigate why `pipeline/test_textcat.py` is flakey~~ - [x] review old regression tests (leftover files) and see if they can be merged, simplified or deleted - [ ] update documentation on how to run tests ### Types of change enhancement, tests ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [ ] My changes don't require a change to the documentation, or if they do, I've added all required information. 2018-07-24 21:38:44 +00:00			`import pytest`
			`from spacy.vocab import Vocab`
			`from spacy.tokens import Doc`
Add Doc init from list of words and text (#5251) * Add Doc init from list of words and text Add an option to initialize a `Doc` from a text and list of words where the words may or may not include all whitespace tokens. If the text and words are mismatched, raise an error. * Fix error code * Remove all whitespace before aligning words/text * Move words/text init to util function * Update error message * Rename to get_words_and_spaces * Fix formatting 2020-04-14 17:15:52 +00:00			`from spacy import util`
Add tests for Doc creation 2017-10-11 01:21:23 +00:00

			`@pytest.fixture`
Add Lemmatizer and simplify related components (#5848) * Add Lemmatizer and simplify related components * Add `Lemmatizer` pipe with `lookup` and `rule` modes using the `Lookups` tables. * Reduce `Tagger` to a simple tagger that sets `Token.tag` (no pos or lemma) * Reduce `Morphology` to only keep track of morph tags (no tag map, lemmatizer, or morph rules) * Remove lemmatizer from `Vocab` * Adjust many many tests Differences: * No default lookup lemmas * No special treatment of TAG in `from_array` and similar required * Easier to modify labels in a `Tagger` * No extra strings added from morphology / tag map * Fix test * Initial fix for Lemmatizer config/serialization * Adjust init test to be more generic * Adjust init test to force empty Lookups * Add simple cache to rule-based lemmatizer * Convert language-specific lemmatizers Convert language-specific lemmatizers to component lemmatizers. Remove previous lemmatizer class. * Fix French and Polish lemmatizers * Remove outdated UPOS conversions * Update Russian lemmatizer init in tests * Add minimal init/run tests for custom lemmatizers * Add option to overwrite existing lemmas * Update mode setting, lookup loading, and caching * Make `mode` an immutable property * Only enforce strict `load_lookups` for known supported modes * Move caching into individual `_lemmatize` methods * Implement strict when lang is not found in lookups * Fix tables/lookups in make_lemmatizer * Reallow provided lookups and allow for stricter checks * Add lookups asset to all Lemmatizer pipe tests * Rename lookups in lemmatizer init test * Clean up merge * Refactor lookup table loading * Add helper from `load_lemmatizer_lookups` that loads required and optional lookups tables based on settings provided by a config. Additional slight refactor of lookups: * Add `Lookups.set_table` to set a table from a provided `Table` * Reorder class definitions to be able to specify type as `Table` * Move registry assets into test methods * Refactor lookups tables config Use class methods within `Lemmatizer` to provide the config for particular modes and to load the lookups from a config. * Add pipe and score to lemmatizer * Simplify Tagger.score * Add missing import * Clean up imports and auto-format * Remove unused kwarg * Tidy up and auto-format * Update docstrings for Lemmatizer Update docstrings for Lemmatizer. Additionally modify `is_base_form` API to take `Token` instead of individual features. * Update docstrings * Remove tag map values from Tagger.add_label * Update API docs * Fix relative link in Lemmatizer API docs 2020-08-07 13:27:13 +00:00			`def vocab():`
			`return Vocab()`
Add tests for Doc creation 2017-10-11 01:21:23 +00:00

			`def test_empty_doc(vocab):`
			`doc = Doc(vocab)`
			`assert len(doc) == 0`


			`def test_single_word(vocab):`
💫 Tidy up and auto-format tests (#2967) * Auto-format tests with black * Add flake8 config * Tidy up and remove unused imports * Fix redefinitions of test functions * Replace orths_and_spaces with words and spaces * Fix compatibility with pytest 4.0 * xfail test for now Test was previously overwritten by following test due to naming conflict, so failure wasn't reported * Unfail passing test * Only use fixture via arguments Fixes pytest 4.0 compatibility 2018-11-27 00:09:36 +00:00			`doc = Doc(vocab, words=["a"])`
			`assert doc.text == "a "`
			`doc = Doc(vocab, words=["a"], spaces=[False])`
			`assert doc.text == "a"`
Add tests for Doc creation 2017-10-11 01:21:23 +00:00

Add Doc init from list of words and text (#5251) * Add Doc init from list of words and text Add an option to initialize a `Doc` from a text and list of words where the words may or may not include all whitespace tokens. If the text and words are mismatched, raise an error. * Fix error code * Remove all whitespace before aligning words/text * Move words/text init to util function * Update error message * Rename to get_words_and_spaces * Fix formatting 2020-04-14 17:15:52 +00:00			`def test_create_from_words_and_text(vocab):`
			`# no whitespace in words`
			`words = ["'", "dogs", "'", "run"]`
			`text = " 'dogs'\n\nrun "`
			`(words, spaces) = util.get_words_and_spaces(words, text)`
			`doc = Doc(vocab, words=words, spaces=spaces)`
			`assert [t.text for t in doc] == [" ", "'", "dogs", "'", "\n\n", "run", " "]`
			`assert [t.whitespace_ for t in doc] == ["", "", "", "", "", " ", ""]`
			`assert doc.text == text`
Tidy up and auto-format 2020-05-21 12:14:01 +00:00			`assert [t.text for t in doc if not t.text.isspace()] == [`
			`word for word in words if not word.isspace()`
			`]`
Add Doc init from list of words and text (#5251) * Add Doc init from list of words and text Add an option to initialize a `Doc` from a text and list of words where the words may or may not include all whitespace tokens. If the text and words are mismatched, raise an error. * Fix error code * Remove all whitespace before aligning words/text * Move words/text init to util function * Update error message * Rename to get_words_and_spaces * Fix formatting 2020-04-14 17:15:52 +00:00
			`# partial whitespace in words`
			`words = [" ", "'", "dogs", "'", "\n\n", "run", " "]`
			`text = " 'dogs'\n\nrun "`
			`(words, spaces) = util.get_words_and_spaces(words, text)`
			`doc = Doc(vocab, words=words, spaces=spaces)`
			`assert [t.text for t in doc] == [" ", "'", "dogs", "'", "\n\n", "run", " "]`
			`assert [t.whitespace_ for t in doc] == ["", "", "", "", "", " ", ""]`
			`assert doc.text == text`
Tidy up and auto-format 2020-05-21 12:14:01 +00:00			`assert [t.text for t in doc if not t.text.isspace()] == [`
			`word for word in words if not word.isspace()`
			`]`
Add Doc init from list of words and text (#5251) * Add Doc init from list of words and text Add an option to initialize a `Doc` from a text and list of words where the words may or may not include all whitespace tokens. If the text and words are mismatched, raise an error. * Fix error code * Remove all whitespace before aligning words/text * Move words/text init to util function * Update error message * Rename to get_words_and_spaces * Fix formatting 2020-04-14 17:15:52 +00:00
			`# non-standard whitespace tokens`
			`words = [" ", " ", "'", "dogs", "'", "\n\n", "run"]`
			`text = " 'dogs'\n\nrun "`
			`(words, spaces) = util.get_words_and_spaces(words, text)`
			`doc = Doc(vocab, words=words, spaces=spaces)`
			`assert [t.text for t in doc] == [" ", "'", "dogs", "'", "\n\n", "run", " "]`
			`assert [t.whitespace_ for t in doc] == ["", "", "", "", "", " ", ""]`
			`assert doc.text == text`
Tidy up and auto-format 2020-05-21 12:14:01 +00:00			`assert [t.text for t in doc if not t.text.isspace()] == [`
			`word for word in words if not word.isspace()`
			`]`
Add Doc init from list of words and text (#5251) * Add Doc init from list of words and text Add an option to initialize a `Doc` from a text and list of words where the words may or may not include all whitespace tokens. If the text and words are mismatched, raise an error. * Fix error code * Remove all whitespace before aligning words/text * Move words/text init to util function * Update error message * Rename to get_words_and_spaces * Fix formatting 2020-04-14 17:15:52 +00:00
			`# mismatch between words and text`
			`with pytest.raises(ValueError):`
			`words = [" ", " ", "'", "dogs", "'", "\n\n", "run"]`
			`text = " 'dogs'\n\nrun "`
			`(words, spaces) = util.get_words_and_spaces(words + ["away"], text)`