Merge pull request #6180 from adrianeboyd/docs/minor-v3-2 [ci skip]

This commit is contained in:
Ines Montani 2020-10-02 11:37:25 +02:00 committed by GitHub
commit 6d8df081bd
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
3 changed files with 29 additions and 43 deletions

View File

@ -84,7 +84,7 @@ cuda102 =
cupy-cuda102>=5.0.0b4,<9.0.0 cupy-cuda102>=5.0.0b4,<9.0.0
# Language tokenizers with external dependencies # Language tokenizers with external dependencies
ja = ja =
sudachipy>=0.4.5 sudachipy>=0.4.9
sudachidict_core>=20200330 sudachidict_core>=20200330
ko = ko =
natto-py==0.9.0 natto-py==0.9.0

View File

@ -85,7 +85,8 @@ import the `MultiLanguage` class directly, or call
### Chinese language support {#chinese new=2.3} ### Chinese language support {#chinese new=2.3}
The Chinese language class supports three word segmentation options: The Chinese language class supports three word segmentation options, `char`,
`jieba` and `pkuseg`:
> ```python > ```python
> from spacy.lang.zh import Chinese > from spacy.lang.zh import Chinese
@ -95,11 +96,12 @@ The Chinese language class supports three word segmentation options:
> >
> # Jieba > # Jieba
> cfg = {"segmenter": "jieba"} > cfg = {"segmenter": "jieba"}
> nlp = Chinese(meta={"tokenizer": {"config": cfg}}) > nlp = Chinese.from_config({"nlp": {"tokenizer": cfg}})
> >
> # PKUSeg with "default" model provided by pkuseg > # PKUSeg with "default" model provided by pkuseg
> cfg = {"segmenter": "pkuseg", "pkuseg_model": "default"} > cfg = {"segmenter": "pkuseg"}
> nlp = Chinese(meta={"tokenizer": {"config": cfg}}) > nlp = Chinese.from_config({"nlp": {"tokenizer": cfg}})
> nlp.tokenizer.initialize(pkuseg_model="default")
> ``` > ```
1. **Character segmentation:** Character segmentation is the default 1. **Character segmentation:** Character segmentation is the default
@ -116,41 +118,34 @@ The Chinese language class supports three word segmentation options:
<Infobox variant="warning"> <Infobox variant="warning">
In spaCy v3.0, the default Chinese word segmenter has switched from Jieba to In spaCy v3.0, the default Chinese word segmenter has switched from Jieba to
character segmentation. Also note that character segmentation.
[`pkuseg`](https://github.com/lancopku/pkuseg-python) doesn't yet ship with
pre-compiled wheels for Python 3.8. If you're running Python 3.8, you can
install it from our fork and compile it locally:
```bash
$ pip install https://github.com/honnibal/pkuseg-python/archive/master.zip
```
</Infobox> </Infobox>
<Accordion title="Details on spaCy's Chinese API"> <Accordion title="Details on spaCy's Chinese API">
The `meta` argument of the `Chinese` language class supports the following The `initialize` method for the Chinese tokenizer class supports the following
following tokenizer config settings: config settings for loading pkuseg models:
| Name | Description | | Name | Description |
| ------------------ | --------------------------------------------------------------------------------------------------------------- | | ------------------ | ------------------------------------------------------------------------------------------------------------------------------------- |
| `segmenter` | Word segmenter: `char`, `jieba` or `pkuseg`. Defaults to `char`. ~~str~~ | | `pkuseg_model` | Name of a model provided by `pkuseg` or the path to a local model directory. ~~str~~ |
| `pkuseg_model` | **Required for `pkuseg`:** Name of a model provided by `pkuseg` or the path to a local model directory. ~~str~~ | | `pkuseg_user_dict` | Optional path to a file with one word per line which overrides the default `pkuseg` user dictionary. Defaults to `"default"`. ~~str~~ |
| `pkuseg_user_dict` | Optional path to a file with one word per line which overrides the default `pkuseg` user dictionary. ~~str~~ |
```python ```python
### Examples ### Examples
# Initialize the pkuseg tokenizer
cfg = {"segmenter": "pkuseg"}
nlp = Chinese.from_config({"nlp": {"tokenizer": cfg}})
# Load "default" model # Load "default" model
cfg = {"segmenter": "pkuseg", "pkuseg_model": "default"} nlp.tokenizer.initialize(pkuseg_model="default")
nlp = Chinese(config={"tokenizer": {"config": cfg}})
# Load local model # Load local model
cfg = {"segmenter": "pkuseg", "pkuseg_model": "/path/to/pkuseg_model"} nlp.tokenizer.initialize(pkuseg_model="/path/to/pkuseg_model")
nlp = Chinese(config={"tokenizer": {"config": cfg}})
# Override the user directory # Override the user directory
cfg = {"segmenter": "pkuseg", "pkuseg_model": "default", "pkuseg_user_dict": "/path"} nlp.tokenizer.initialize(pkuseg_model="default", pkuseg_user_dict="/path/to/user_dict")
nlp = Chinese(config={"tokenizer": {"config": cfg}})
``` ```
You can also modify the user dictionary on-the-fly: You can also modify the user dictionary on-the-fly:
@ -185,8 +180,11 @@ from spacy.lang.zh import Chinese
# Train pkuseg model # Train pkuseg model
pkuseg.train("train.utf8", "test.utf8", "/path/to/pkuseg_model") pkuseg.train("train.utf8", "test.utf8", "/path/to/pkuseg_model")
# Load pkuseg model in spaCy Chinese tokenizer # Load pkuseg model in spaCy Chinese tokenizer
nlp = Chinese(meta={"tokenizer": {"config": {"pkuseg_model": "/path/to/pkuseg_model", "require_pkuseg": True}}}) cfg = {"segmenter": "pkuseg"}
nlp = Chinese.from_config({"nlp": {"tokenizer": cfg}})
nlp.tokenizer.initialize(pkuseg_model="/path/to/pkuseg_model")
``` ```
</Accordion> </Accordion>
@ -201,20 +199,19 @@ nlp = Chinese(meta={"tokenizer": {"config": {"pkuseg_model": "/path/to/pkuseg_mo
> >
> # Load SudachiPy with split mode B > # Load SudachiPy with split mode B
> cfg = {"split_mode": "B"} > cfg = {"split_mode": "B"}
> nlp = Japanese(meta={"tokenizer": {"config": cfg}}) > nlp = Japanese.from_config({"nlp": {"tokenizer": cfg}})
> ``` > ```
The Japanese language class uses The Japanese language class uses
[SudachiPy](https://github.com/WorksApplications/SudachiPy) for word [SudachiPy](https://github.com/WorksApplications/SudachiPy) for word
segmentation and part-of-speech tagging. The default Japanese language class and segmentation and part-of-speech tagging. The default Japanese language class and
the provided Japanese pipelines use SudachiPy split mode `A`. The `meta` the provided Japanese pipelines use SudachiPy split mode `A`. The tokenizer
argument of the `Japanese` language class can be used to configure the split config can be used to configure the split mode to `A`, `B` or `C`.
mode to `A`, `B` or `C`.
<Infobox variant="warning"> <Infobox variant="warning">
If you run into errors related to `sudachipy`, which is currently under active If you run into errors related to `sudachipy`, which is currently under active
development, we suggest downgrading to `sudachipy==0.4.5`, which is the version development, we suggest downgrading to `sudachipy==0.4.9`, which is the version
used for training the current [Japanese pipelines](/models/ja). used for training the current [Japanese pipelines](/models/ja).
</Infobox> </Infobox>

View File

@ -1124,17 +1124,6 @@ a dictionary with keyword arguments specifying the annotations, like `tags` or
annotations, the model can be updated to learn a sentence of three words with annotations, the model can be updated to learn a sentence of three words with
their assigned part-of-speech tags. their assigned part-of-speech tags.
> #### About the tag map
>
> The tag map is part of the vocabulary and defines the annotation scheme. If
> you're training a new pipeline, this will let you map the tags present in the
> treebank you train on to spaCy's tag scheme:
>
> ```python
> tag_map = {"N": {"pos": "NOUN"}, "V": {"pos": "VERB"}}
> vocab = Vocab(tag_map=tag_map)
> ```
```python ```python
words = ["I", "like", "stuff"] words = ["I", "like", "stuff"]
tags = ["NOUN", "VERB", "NOUN"] tags = ["NOUN", "VERB", "NOUN"]