spaCy/spacy/lang/zh/__init__.py

# coding: utf8
from __future__ import unicode_literals

from ...language import Language
from ...tokens import Doc


class Chinese(Language):
    lang = 'zh'

    def make_doc(self, text):
        try:
            from jieba
        except ImportError:
            raise ImportError("The Chinese tokenizer requires the Jieba library: "
                              "https://github.com/fxsjy/jieba")
        words = list(jieba.cut(text, cut_all=True))
        words=[x for x in words if x]
        return Doc(self.vocab, words=words, spaces=[False]*len(words))


__all__ = ['Chinese']
Reorganise Chinese language data 2017-05-08 13:54:36 +00:00			`# coding: utf8`
			`from __future__ import unicode_literals`

Fix relative imports 2017-05-08 20:29:04 +00:00			`from ...language import Language`
			`from ...tokens import Doc`
* Add initial stuff for Chinese parsing 2016-04-24 16:44:24 +00:00

			`class Chinese(Language):`
Reorganise Chinese language data 2017-05-08 13:54:36 +00:00			`lang = 'zh'`
* Work on Chinese support 2016-05-05 09:39:12 +00:00
Add draft Jieba tokenizer for Chinese 2016-11-02 18:57:38 +00:00			`def make_doc(self, text):`
Reorganise Chinese language data 2017-05-08 13:54:36 +00:00			`try:`
			`from jieba`
			`except ImportError:`
			`raise ImportError("The Chinese tokenizer requires the Jieba library: "`
			`"https://github.com/fxsjy/jieba")`
Add draft Jieba tokenizer for Chinese 2016-11-02 18:57:38 +00:00			`words = list(jieba.cut(text, cut_all=True))`
Port over changes from #1168 2017-07-01 09:43:54 +00:00			`words=[x for x in words if x]`
Add draft Jieba tokenizer for Chinese 2016-11-02 18:57:38 +00:00			`return Doc(self.vocab, words=words, spaces=[False]*len(words))`
Lazy imports language 2017-05-03 09:01:42 +00:00

Reorganise Chinese language data 2017-05-08 13:54:36 +00:00			`__all__ = ['Chinese']`