spaCy/spacy/lang/zh/__init__.py

# coding: utf8
from __future__ import unicode_literals

from ...language import Language
from ...tokens import Doc


class Chinese(Language):
    lang = 'zh'

    def make_doc(self, text):
        try:
            import jieba
        except ImportError:
            raise ImportError("The Chinese tokenizer requires the Jieba library: "
                              "https://github.com/fxsjy/jieba")
        words = list(jieba.cut(text, cut_all=False))
        words = [x for x in words if x]
        return Doc(self.vocab, words=words, spaces=[False]*len(words))


__all__ = ['Chinese']
Reorganise Chinese language data 2017-05-08 13:54:36 +00:00			`# coding: utf8`
			`from __future__ import unicode_literals`

Fix relative imports 2017-05-08 20:29:04 +00:00			`from ...language import Language`
			`from ...tokens import Doc`
* Add initial stuff for Chinese parsing 2016-04-24 16:44:24 +00:00

			`class Chinese(Language):`
Reorganise Chinese language data 2017-05-08 13:54:36 +00:00			`lang = 'zh'`
* Work on Chinese support 2016-05-05 09:39:12 +00:00
Add draft Jieba tokenizer for Chinese 2016-11-02 18:57:38 +00:00			`def make_doc(self, text):`
Reorganise Chinese language data 2017-05-08 13:54:36 +00:00			`try:`
fixed SyntaxError while checking for jieba 2017-10-16 13:21:33 +00:00			`import jieba`
Reorganise Chinese language data 2017-05-08 13:54:36 +00:00			`except ImportError:`
			`raise ImportError("The Chinese tokenizer requires the Jieba library: "`
			`"https://github.com/fxsjy/jieba")`
Port over change from #1323 and tidy up 2017-09-14 17:23:13 +00:00			`words = list(jieba.cut(text, cut_all=False))`
			`words = [x for x in words if x]`
Add draft Jieba tokenizer for Chinese 2016-11-02 18:57:38 +00:00			`return Doc(self.vocab, words=words, spaces=[False]*len(words))`
Lazy imports language 2017-05-03 09:01:42 +00:00

Reorganise Chinese language data 2017-05-08 13:54:36 +00:00			`__all__ = ['Chinese']`