spaCy/spacy/tests/doc/test_to_json.py

# coding: utf-8
from __future__ import unicode_literals

import pytest
from spacy.cli._schemas import TRAINING_SCHEMA
from spacy.util import get_json_validator, validate_json
from spacy.tokens import Doc
from ..util import get_doc


@pytest.fixture()
def doc(en_vocab):
    words = ["c", "d", "e"]
    pos = ["VERB", "NOUN", "NOUN"]
    tags = ["VBP", "NN", "NN"]
    heads = [0, -1, -2]
    deps = ["ROOT", "dobj", "dobj"]
    ents = [(1, 2, "ORG")]
    return get_doc(
        en_vocab, words=words, pos=pos, tags=tags, heads=heads, deps=deps, ents=ents
    )


def test_doc_to_json(doc):
    json_doc = doc.to_json()
    assert json_doc["text"] == "c d e "
    assert len(json_doc["tokens"]) == 3
    assert json_doc["tokens"][0]["pos"] == "VERB"
    assert json_doc["tokens"][0]["tag"] == "VBP"
    assert json_doc["tokens"][0]["dep"] == "ROOT"
    assert len(json_doc["ents"]) == 1
    assert json_doc["ents"][0]["start"] == 2  # character offset!
    assert json_doc["ents"][0]["end"] == 3  # character offset!
    assert json_doc["ents"][0]["label"] == "ORG"


def test_doc_to_json_underscore(doc):
    Doc.set_extension("json_test1", default=False)
    Doc.set_extension("json_test2", default=False)
    doc._.json_test1 = "hello world"
    doc._.json_test2 = [1, 2, 3]
    json_doc = doc.to_json(underscore=["json_test1", "json_test2"])
    assert "_" in json_doc
    assert json_doc["_"]["json_test1"] == "hello world"
    assert json_doc["_"]["json_test2"] == [1, 2, 3]


def test_doc_to_json_underscore_error_attr(doc):
    """Test that Doc.to_json() raises an error if a custom attribute doesn't
    exist in the ._ space."""
    with pytest.raises(ValueError):
        doc.to_json(underscore=["json_test3"])


def test_doc_to_json_underscore_error_serialize(doc):
    """Test that Doc.to_json() raises an error if a custom attribute value
    isn't JSON-serializable."""
    Doc.set_extension("json_test4", method=lambda doc: doc.text)
    with pytest.raises(ValueError):
        doc.to_json(underscore=["json_test4"])


def test_doc_to_json_valid_training(doc):
    json_doc = doc.to_json()
    validator = get_json_validator(TRAINING_SCHEMA)
    errors = validate_json([json_doc], validator)
    assert not errors
💫 New JSON helpers, training data internals & CLI rewrite (#2932) * Support nowrap setting in util.prints * Tidy up and fix whitespace * Simplify script and use read_jsonl helper * Add JSON schemas (see #2928) * Deprecate Doc.print_tree Will be replaced with Doc.to_json, which will produce a unified format * Add Doc.to_json() method (see #2928) Converts Doc objects to JSON using the same unified format as the training data. Method also supports serializing selected custom attributes in the doc._. space. * Remove outdated test * Add write_json and write_jsonl helpers * WIP: Update spacy train * Tidy up spacy train * WIP: Use wasabi for formatting * Add GoldParse helpers for JSON format * WIP: add debug-data command * Fix typo * Add missing import * Update wasabi pin * Add missing import * 💫 Refactor CLI (#2943) To be merged into #2932. ## Description - [x] refactor CLI To use [`wasabi`](https://github.com/ines/wasabi) - [x] use [`black`](https://github.com/ambv/black) for auto-formatting - [x] add `flake8` config - [x] move all messy UD-related scripts to `cli.ud` - [x] make converters function that take the opened file and return the converted data (instead of having them handle the IO) ### Types of change enhancement ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. * Update wasabi pin * Delete old test * Update errors * Fix typo * Tidy up and format remaining code * Fix formatting * Improve formatting of messages * Auto-format remaining code * Add tok2vec stuff to spacy.train * Fix typo * Update wasabi pin * Fix path checks for when train() is called as function * Reformat and tidy up pretrain script * Update argument annotations * Raise error if model language doesn't match lang * Document new train command 2018-11-30 19:16:14 +00:00			`# coding: utf-8`
			`from __future__ import unicode_literals`

			`import pytest`
Store JSON schemas in Python and tidy up (#3235) 2019-02-07 08:44:31 +00:00			`from spacy.cli._schemas import TRAINING_SCHEMA`
💫 Add token match pattern validation via JSON schemas (#3244) * Add custom MatchPatternError * Improve validators and add validation option to Matcher * Adjust formatting * Never validate in Matcher within PhraseMatcher If we do decide to make validate default to True, the PhraseMatcher's Matcher shouldn't ever validate. Here, we create the patterns automatically anyways (and it's currently unclear whether the validation has performance impacts at a very large scale). 2019-02-12 14:47:26 +00:00			`from spacy.util import get_json_validator, validate_json`
💫 New JSON helpers, training data internals & CLI rewrite (#2932) * Support nowrap setting in util.prints * Tidy up and fix whitespace * Simplify script and use read_jsonl helper * Add JSON schemas (see #2928) * Deprecate Doc.print_tree Will be replaced with Doc.to_json, which will produce a unified format * Add Doc.to_json() method (see #2928) Converts Doc objects to JSON using the same unified format as the training data. Method also supports serializing selected custom attributes in the doc._. space. * Remove outdated test * Add write_json and write_jsonl helpers * WIP: Update spacy train * Tidy up spacy train * WIP: Use wasabi for formatting * Add GoldParse helpers for JSON format * WIP: add debug-data command * Fix typo * Add missing import * Update wasabi pin * Add missing import * 💫 Refactor CLI (#2943) To be merged into #2932. ## Description - [x] refactor CLI To use [`wasabi`](https://github.com/ines/wasabi) - [x] use [`black`](https://github.com/ambv/black) for auto-formatting - [x] add `flake8` config - [x] move all messy UD-related scripts to `cli.ud` - [x] make converters function that take the opened file and return the converted data (instead of having them handle the IO) ### Types of change enhancement ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. * Update wasabi pin * Delete old test * Update errors * Fix typo * Tidy up and format remaining code * Fix formatting * Improve formatting of messages * Auto-format remaining code * Add tok2vec stuff to spacy.train * Fix typo * Update wasabi pin * Fix path checks for when train() is called as function * Reformat and tidy up pretrain script * Update argument annotations * Raise error if model language doesn't match lang * Document new train command 2018-11-30 19:16:14 +00:00			`from spacy.tokens import Doc`
			`from ..util import get_doc`


			`@pytest.fixture()`
			`def doc(en_vocab):`
			`words = ["c", "d", "e"]`
			`pos = ["VERB", "NOUN", "NOUN"]`
			`tags = ["VBP", "NN", "NN"]`
			`heads = [0, -1, -2]`
			`deps = ["ROOT", "dobj", "dobj"]`
			`ents = [(1, 2, "ORG")]`
			`return get_doc(`
			`en_vocab, words=words, pos=pos, tags=tags, heads=heads, deps=deps, ents=ents`
			`)`


			`def test_doc_to_json(doc):`
			`json_doc = doc.to_json()`
			`assert json_doc["text"] == "c d e "`
			`assert len(json_doc["tokens"]) == 3`
			`assert json_doc["tokens"][0]["pos"] == "VERB"`
			`assert json_doc["tokens"][0]["tag"] == "VBP"`
			`assert json_doc["tokens"][0]["dep"] == "ROOT"`
			`assert len(json_doc["ents"]) == 1`
			`assert json_doc["ents"][0]["start"] == 2 # character offset!`
			`assert json_doc["ents"][0]["end"] == 3 # character offset!`
			`assert json_doc["ents"][0]["label"] == "ORG"`


			`def test_doc_to_json_underscore(doc):`
			`Doc.set_extension("json_test1", default=False)`
			`Doc.set_extension("json_test2", default=False)`
			`doc._.json_test1 = "hello world"`
			`doc._.json_test2 = [1, 2, 3]`
			`json_doc = doc.to_json(underscore=["json_test1", "json_test2"])`
			`assert "_" in json_doc`
			`assert json_doc["_"]["json_test1"] == "hello world"`
			`assert json_doc["_"]["json_test2"] == [1, 2, 3]`


			`def test_doc_to_json_underscore_error_attr(doc):`
			`"""Test that Doc.to_json() raises an error if a custom attribute doesn't`
			`exist in the ._ space."""`
			`with pytest.raises(ValueError):`
			`doc.to_json(underscore=["json_test3"])`


			`def test_doc_to_json_underscore_error_serialize(doc):`
			`"""Test that Doc.to_json() raises an error if a custom attribute value`
			`isn't JSON-serializable."""`
			`Doc.set_extension("json_test4", method=lambda doc: doc.text)`
			`with pytest.raises(ValueError):`
			`doc.to_json(underscore=["json_test4"])`


			`def test_doc_to_json_valid_training(doc):`
			`json_doc = doc.to_json()`
💫 Add token match pattern validation via JSON schemas (#3244) * Add custom MatchPatternError * Improve validators and add validation option to Matcher * Adjust formatting * Never validate in Matcher within PhraseMatcher If we do decide to make validate default to True, the PhraseMatcher's Matcher shouldn't ever validate. Here, we create the patterns automatically anyways (and it's currently unclear whether the validation has performance impacts at a very large scale). 2019-02-12 14:47:26 +00:00			`validator = get_json_validator(TRAINING_SCHEMA)`
			`errors = validate_json([json_doc], validator)`
💫 New JSON helpers, training data internals & CLI rewrite (#2932) * Support nowrap setting in util.prints * Tidy up and fix whitespace * Simplify script and use read_jsonl helper * Add JSON schemas (see #2928) * Deprecate Doc.print_tree Will be replaced with Doc.to_json, which will produce a unified format * Add Doc.to_json() method (see #2928) Converts Doc objects to JSON using the same unified format as the training data. Method also supports serializing selected custom attributes in the doc._. space. * Remove outdated test * Add write_json and write_jsonl helpers * WIP: Update spacy train * Tidy up spacy train * WIP: Use wasabi for formatting * Add GoldParse helpers for JSON format * WIP: add debug-data command * Fix typo * Add missing import * Update wasabi pin * Add missing import * 💫 Refactor CLI (#2943) To be merged into #2932. ## Description - [x] refactor CLI To use [`wasabi`](https://github.com/ines/wasabi) - [x] use [`black`](https://github.com/ambv/black) for auto-formatting - [x] add `flake8` config - [x] move all messy UD-related scripts to `cli.ud` - [x] make converters function that take the opened file and return the converted data (instead of having them handle the IO) ### Types of change enhancement ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. * Update wasabi pin * Delete old test * Update errors * Fix typo * Tidy up and format remaining code * Fix formatting * Improve formatting of messages * Auto-format remaining code * Add tok2vec stuff to spacy.train * Fix typo * Update wasabi pin * Fix path checks for when train() is called as function * Reformat and tidy up pretrain script * Update argument annotations * Raise error if model language doesn't match lang * Document new train command 2018-11-30 19:16:14 +00:00			`assert not errors`