spaCy/spacy/tests/lang/ko/test_tokenizer.py

import pytest

# fmt: off
TOKENIZER_TESTS = [("서울 타워 근처에 살고 있습니다.", "서울 타워 근처 에 살 고 있 습니다 ."),
                   ("영등포구에 있는 맛집 좀 알려주세요.", "영등포구 에 있 는 맛집 좀 알려 주 세요 ."),
                   ("10$ 할인코드를 적용할까요?", "10 $ 할인 코드 를 적용 할까요 ?")]

TAG_TESTS = [("서울 타워 근처에 살고 있습니다.",
              "NNP NNG NNG JKB VV EC VX EF SF"),
             ("영등포구에 있는 맛집 좀 알려주세요.",
              "NNP JKB VV ETM NNG MAG VV VX EP SF")]

FULL_TAG_TESTS = [("영등포구에 있는 맛집 좀 알려주세요.",
                   "NNP JKB VV ETM NNG MAG VV+EC VX EP+EF SF")]

POS_TESTS = [("서울 타워 근처에 살고 있습니다.",
              "PROPN NOUN NOUN ADP VERB X AUX X PUNCT"),
             ("영등포구에 있는 맛집 좀 알려주세요.",
              "PROPN ADP VERB X NOUN ADV VERB AUX X PUNCT")]
# fmt: on


@pytest.mark.parametrize("text,expected_tokens", TOKENIZER_TESTS)
def test_ko_tokenizer(ko_tokenizer, text, expected_tokens):
    tokens = [token.text for token in ko_tokenizer(text)]
    assert tokens == expected_tokens.split()


@pytest.mark.parametrize("text,expected_tags", TAG_TESTS)
def test_ko_tokenizer_tags(ko_tokenizer, text, expected_tags):
    tags = [token.tag_ for token in ko_tokenizer(text)]
    assert tags == expected_tags.split()


@pytest.mark.parametrize("text,expected_tags", FULL_TAG_TESTS)
def test_ko_tokenizer_full_tags(ko_tokenizer, text, expected_tags):
    tags = ko_tokenizer(text).user_data["full_tags"]
    assert tags == expected_tags.split()


@pytest.mark.parametrize("text,expected_pos", POS_TESTS)
def test_ko_tokenizer_pos(ko_tokenizer, text, expected_pos):
    pos = [token.pos_ for token in ko_tokenizer(text)]
    assert pos == expected_pos.split()


def test_ko_empty_doc(ko_tokenizer):
    tokens = ko_tokenizer("")
    assert len(tokens) == 0


@pytest.mark.issue(10535)
def test_ko_tokenizer_unknown_tag(ko_tokenizer):
    tokens = ko_tokenizer("미닛 리피터")
    assert tokens[1].pos_ == "X"


# fmt: off
SPACY_TOKENIZER_TESTS = [
    ("있다.", "있다 ."),
    ("'예'는", "' 예 ' 는"),
    ("부 (富) 는", "부 ( 富 ) 는"),
    ("부(富)는", "부 ( 富 ) 는"),
    ("1982~1983.", "1982 ~ 1983 ."),
    ("사과·배·복숭아·수박은 모두 과일이다.", "사과 · 배 · 복숭아 · 수박은 모두 과일이다 ."),
    ("그렇구나~", "그렇구나~"),
    ("『9시 반의 당구』,", "『 9시 반의 당구 』 ,"),
]
# fmt: on


@pytest.mark.parametrize("text,expected_tokens", SPACY_TOKENIZER_TESTS)
def test_ko_spacy_tokenizer(ko_tokenizer_tokenizer, text, expected_tokens):
    tokens = [token.text for token in ko_tokenizer_tokenizer(text)]
    assert tokens == expected_tokens.split()
Korean support (#3901) * start lang/ko * add test codes * using natto-py * add test_ko_tokenizer_full_tags() * spaCy contributor agreement * external dependency for ko * collections.namedtuple for python version < 3.5 * case fix * tuple unpacking * add jongseong(final consonant) * apply mecab option * Remove Pipfile for now Co-authored-by: Ines Montani <ines@ines.io> 2019-07-09 20:23:16 +00:00			`import pytest`

			`# fmt: off`
			`TOKENIZER_TESTS = [("서울 타워 근처에 살고 있습니다.", "서울 타워 근처 에 살 고 있 습니다 ."),`
Fix error when Korean text contains regexp special characters. (#4022) 2019-07-25 15:53:33 +00:00			`("영등포구에 있는 맛집 좀 알려주세요.", "영등포구 에 있 는 맛집 좀 알려 주 세요 ."),`
			`("10$ 할인코드를 적용할까요?", "10 $ 할인 코드 를 적용 할까요 ?")]`
Korean support (#3901) * start lang/ko * add test codes * using natto-py * add test_ko_tokenizer_full_tags() * spaCy contributor agreement * external dependency for ko * collections.namedtuple for python version < 3.5 * case fix * tuple unpacking * add jongseong(final consonant) * apply mecab option * Remove Pipfile for now Co-authored-by: Ines Montani <ines@ines.io> 2019-07-09 20:23:16 +00:00
Tidy up and auto-format 2019-07-11 10:02:25 +00:00			`TAG_TESTS = [("서울 타워 근처에 살고 있습니다.",`
Korean support (#3901) * start lang/ko * add test codes * using natto-py * add test_ko_tokenizer_full_tags() * spaCy contributor agreement * external dependency for ko * collections.namedtuple for python version < 3.5 * case fix * tuple unpacking * add jongseong(final consonant) * apply mecab option * Remove Pipfile for now Co-authored-by: Ines Montani <ines@ines.io> 2019-07-09 20:23:16 +00:00			`"NNP NNG NNG JKB VV EC VX EF SF"),`
Tidy up and auto-format 2019-07-11 10:02:25 +00:00			`("영등포구에 있는 맛집 좀 알려주세요.",`
Korean support (#3901) * start lang/ko * add test codes * using natto-py * add test_ko_tokenizer_full_tags() * spaCy contributor agreement * external dependency for ko * collections.namedtuple for python version < 3.5 * case fix * tuple unpacking * add jongseong(final consonant) * apply mecab option * Remove Pipfile for now Co-authored-by: Ines Montani <ines@ines.io> 2019-07-09 20:23:16 +00:00			`"NNP JKB VV ETM NNG MAG VV VX EP SF")]`

			`FULL_TAG_TESTS = [("영등포구에 있는 맛집 좀 알려주세요.",`
			`"NNP JKB VV ETM NNG MAG VV+EC VX EP+EF SF")]`

Tidy up and auto-format 2019-07-11 10:02:25 +00:00			`POS_TESTS = [("서울 타워 근처에 살고 있습니다.",`
Korean support (#3901) * start lang/ko * add test codes * using natto-py * add test_ko_tokenizer_full_tags() * spaCy contributor agreement * external dependency for ko * collections.namedtuple for python version < 3.5 * case fix * tuple unpacking * add jongseong(final consonant) * apply mecab option * Remove Pipfile for now Co-authored-by: Ines Montani <ines@ines.io> 2019-07-09 20:23:16 +00:00			`"PROPN NOUN NOUN ADP VERB X AUX X PUNCT"),`
			`("영등포구에 있는 맛집 좀 알려주세요.",`
			`"PROPN ADP VERB X NOUN ADV VERB AUX X PUNCT")]`
			`# fmt: on`


			`@pytest.mark.parametrize("text,expected_tokens", TOKENIZER_TESTS)`
			`def test_ko_tokenizer(ko_tokenizer, text, expected_tokens):`
			`tokens = [token.text for token in ko_tokenizer(text)]`
			`assert tokens == expected_tokens.split()`


			`@pytest.mark.parametrize("text,expected_tags", TAG_TESTS)`
			`def test_ko_tokenizer_tags(ko_tokenizer, text, expected_tags):`
			`tags = [token.tag_ for token in ko_tokenizer(text)]`
			`assert tags == expected_tags.split()`


			`@pytest.mark.parametrize("text,expected_tags", FULL_TAG_TESTS)`
			`def test_ko_tokenizer_full_tags(ko_tokenizer, text, expected_tags):`
			`tags = ko_tokenizer(text).user_data["full_tags"]`
			`assert tags == expected_tags.split()`


			`@pytest.mark.parametrize("text,expected_pos", POS_TESTS)`
			`def test_ko_tokenizer_pos(ko_tokenizer, text, expected_pos):`
			`pos = [token.pos_ for token in ko_tokenizer(text)]`
			`assert pos == expected_pos.split()`
Fix ValueError exception on empty Korean text. (#4245) 2019-09-06 08:29:40 +00:00

			`def test_ko_empty_doc(ko_tokenizer):`
			`tokens = ko_tokenizer("")`
			`assert len(tokens) == 0`
Update Korean defaults for Tokenizer (#10322) Update Korean defaults for `Tokenizer` for tokenization following UD Korean Kaist. 2022-02-21 09:26:19 +00:00

Handle unknown tags in KoreanTokenizer tag map (#10536) 2022-03-24 10:25:36 +00:00			`@pytest.mark.issue(10535)`
			`def test_ko_tokenizer_unknown_tag(ko_tokenizer):`
			`tokens = ko_tokenizer("미닛 리피터")`
			`assert tokens[1].pos_ == "X"`


Update Korean defaults for Tokenizer (#10322) Update Korean defaults for `Tokenizer` for tokenization following UD Korean Kaist. 2022-02-21 09:26:19 +00:00			`# fmt: off`
			`SPACY_TOKENIZER_TESTS = [`
			`("있다.", "있다 ."),`
			`("'예'는", "' 예 ' 는"),`
			`("부 (富) 는", "부 ( 富 ) 는"),`
			`("부(富)는", "부 ( 富 ) 는"),`
			`("1982~1983.", "1982 ~ 1983 ."),`
			`("사과·배·복숭아·수박은 모두 과일이다.", "사과 · 배 · 복숭아 · 수박은 모두 과일이다 ."),`
			`("그렇구나~", "그렇구나~"),`
			`("『9시 반의 당구』,", "『 9시 반의 당구 』 ,"),`
			`]`
			`# fmt: on`


			`@pytest.mark.parametrize("text,expected_tokens", SPACY_TOKENIZER_TESTS)`
			`def test_ko_spacy_tokenizer(ko_tokenizer_tokenizer, text, expected_tokens):`
			`tokens = [token.text for token in ko_tokenizer_tokenizer(text)]`
			`assert tokens == expected_tokens.split()`