spaCy/website/usage/_adding-languages/_testing.jade

52 lines
2.4 KiB
Plaintext
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

//- 💫 DOCS > USAGE > ADDING LANGUAGES > TESTING
p
| Before using the new language or submitting a
| #[+a(gh("spaCy") + "/pulls") pull request] to spaCy, you should make sure
| it works as expected. This is especially important if you've added custom
| regular expressions for token matching or punctuation you don't want to
| be causing regressions.
+infobox("spaCy's test suite")
| spaCy uses the #[+a("https://docs.pytest.org/en/latest/") pytest framework]
| for testing. For more details on how the tests are structured and best
| practices for writing your own tests, see our
| #[+a(gh("spaCy", "spacy/tests")) tests documentation].
+h(3, "testing-custom") Writing language-specific tests
p
| It's recommended to always add at least some tests with examples specific
| to the language. Language tests should be located in
| #[+src(gh("spaCy", "spacy/tests/lang")) #[code tests/lang]] in a
| directory named after the language ID. You'll also need to create a
| fixture for your tokenizer in the
| #[+src(gh("spaCy", "spacy/tests/conftest.py")) #[code conftest.py]].
| Always use the #[+api("util#get_lang_class") #[code get_lang_class()]]
| helper function within the fixture, instead of importing the class at the
| top of the file. This will load the language data only when it's needed.
| (Otherwise, #[em all data] would be loaded every time you run a test.)
+code.
@pytest.fixture
def en_tokenizer():
return util.get_lang_class('en').Defaults.create_tokenizer()
p
| When adding test cases, always
| #[+a(gh("spaCy", "spacy/tests#parameters")) #[code parametrize]] them
| this will make it easier for others to add more test cases without having
| to modify the test itself. You can also add parameter tuples, for example,
| a test sentence and its expected length, or a list of expected tokens.
| Here's an example of an English tokenizer test for combinations of
| punctuation and abbreviations:
+code("Example test").
@pytest.mark.parametrize('text,length', [
("The U.S. Army likes Shock and Awe.", 8),
("U.N. regulations are not a part of their concern.", 10),
("“Isn't it?”", 6)])
def test_en_tokenizer_handles_punct_abbrev(en_tokenizer, text, length):
tokens = en_tokenizer(text)
assert len(tokens) == length