mirror of https://github.com/explosion/spaCy.git
Update Chinese usage docs
This commit is contained in:
parent
3908fff899
commit
7670df04dd
|
@ -85,7 +85,8 @@ import the `MultiLanguage` class directly, or call
|
|||
|
||||
### Chinese language support {#chinese new=2.3}
|
||||
|
||||
The Chinese language class supports three word segmentation options:
|
||||
The Chinese language class supports three word segmentation options, `char`,
|
||||
`jieba` and `pkuseg`:
|
||||
|
||||
> ```python
|
||||
> from spacy.lang.zh import Chinese
|
||||
|
@ -95,11 +96,12 @@ The Chinese language class supports three word segmentation options:
|
|||
>
|
||||
> # Jieba
|
||||
> cfg = {"segmenter": "jieba"}
|
||||
> nlp = Chinese(meta={"tokenizer": {"config": cfg}})
|
||||
> nlp = Chinese.from_config({"nlp": {"tokenizer": cfg}})
|
||||
>
|
||||
> # PKUSeg with "default" model provided by pkuseg
|
||||
> cfg = {"segmenter": "pkuseg", "pkuseg_model": "default"}
|
||||
> nlp = Chinese(meta={"tokenizer": {"config": cfg}})
|
||||
> cfg = {"segmenter": "pkuseg"}
|
||||
> nlp = Chinese.from_config({"nlp": {"tokenizer": cfg}})
|
||||
> nlp.tokenizer.initialize(pkuseg_model="default")
|
||||
> ```
|
||||
|
||||
1. **Character segmentation:** Character segmentation is the default
|
||||
|
@ -116,41 +118,34 @@ The Chinese language class supports three word segmentation options:
|
|||
<Infobox variant="warning">
|
||||
|
||||
In spaCy v3.0, the default Chinese word segmenter has switched from Jieba to
|
||||
character segmentation. Also note that
|
||||
[`pkuseg`](https://github.com/lancopku/pkuseg-python) doesn't yet ship with
|
||||
pre-compiled wheels for Python 3.8. If you're running Python 3.8, you can
|
||||
install it from our fork and compile it locally:
|
||||
|
||||
```bash
|
||||
$ pip install https://github.com/honnibal/pkuseg-python/archive/master.zip
|
||||
```
|
||||
character segmentation.
|
||||
|
||||
</Infobox>
|
||||
|
||||
<Accordion title="Details on spaCy's Chinese API">
|
||||
|
||||
The `meta` argument of the `Chinese` language class supports the following
|
||||
following tokenizer config settings:
|
||||
The `initialize` method for the Chinese tokenizer class supports the following
|
||||
config settings for loading pkuseg models:
|
||||
|
||||
| Name | Description |
|
||||
| ------------------ | --------------------------------------------------------------------------------------------------------------- |
|
||||
| `segmenter` | Word segmenter: `char`, `jieba` or `pkuseg`. Defaults to `char`. ~~str~~ |
|
||||
| `pkuseg_model` | **Required for `pkuseg`:** Name of a model provided by `pkuseg` or the path to a local model directory. ~~str~~ |
|
||||
| `pkuseg_user_dict` | Optional path to a file with one word per line which overrides the default `pkuseg` user dictionary. ~~str~~ |
|
||||
| Name | Description |
|
||||
| ------------------ | ------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `pkuseg_model` | Name of a model provided by `pkuseg` or the path to a local model directory. ~~str~~ |
|
||||
| `pkuseg_user_dict` | Optional path to a file with one word per line which overrides the default `pkuseg` user dictionary. Defaults to `"default"`. ~~str~~ |
|
||||
|
||||
```python
|
||||
### Examples
|
||||
# Initialize the pkuseg tokenizer
|
||||
cfg = {"segmenter": "pkuseg"}
|
||||
nlp = Chinese.from_config({"nlp": {"tokenizer": cfg}})
|
||||
|
||||
# Load "default" model
|
||||
cfg = {"segmenter": "pkuseg", "pkuseg_model": "default"}
|
||||
nlp = Chinese(config={"tokenizer": {"config": cfg}})
|
||||
nlp.tokenizer.initialize(pkuseg_model="default")
|
||||
|
||||
# Load local model
|
||||
cfg = {"segmenter": "pkuseg", "pkuseg_model": "/path/to/pkuseg_model"}
|
||||
nlp = Chinese(config={"tokenizer": {"config": cfg}})
|
||||
nlp.tokenizer.initialize(pkuseg_model="/path/to/pkuseg_model")
|
||||
|
||||
# Override the user directory
|
||||
cfg = {"segmenter": "pkuseg", "pkuseg_model": "default", "pkuseg_user_dict": "/path"}
|
||||
nlp = Chinese(config={"tokenizer": {"config": cfg}})
|
||||
nlp.tokenizer.initialize(pkuseg_model="default", pkuseg_user_dict="/path/to/user_dict")
|
||||
```
|
||||
|
||||
You can also modify the user dictionary on-the-fly:
|
||||
|
@ -185,8 +180,11 @@ from spacy.lang.zh import Chinese
|
|||
|
||||
# Train pkuseg model
|
||||
pkuseg.train("train.utf8", "test.utf8", "/path/to/pkuseg_model")
|
||||
|
||||
# Load pkuseg model in spaCy Chinese tokenizer
|
||||
nlp = Chinese(meta={"tokenizer": {"config": {"pkuseg_model": "/path/to/pkuseg_model", "require_pkuseg": True}}})
|
||||
cfg = {"segmenter": "pkuseg"}
|
||||
nlp = Chinese.from_config({"nlp": {"tokenizer": cfg}})
|
||||
nlp.tokenizer.initialize(pkuseg_model="/path/to/pkuseg_model")
|
||||
```
|
||||
|
||||
</Accordion>
|
||||
|
|
Loading…
Reference in New Issue