Update Japanese docs and pin for sudachipy

This commit is contained in:
Adriane Boyd 2020-10-02 10:12:44 +02:00
parent 7670df04dd
commit 351f352cdc
2 changed files with 5 additions and 6 deletions

View File

@ -84,7 +84,7 @@ cuda102 =
cupy-cuda102>=5.0.0b4,<9.0.0
# Language tokenizers with external dependencies
ja =
sudachipy>=0.4.5
sudachipy>=0.4.9
sudachidict_core>=20200330
ko =
natto-py==0.9.0

View File

@ -199,20 +199,19 @@ nlp.tokenizer.initialize(pkuseg_model="/path/to/pkuseg_model")
>
> # Load SudachiPy with split mode B
> cfg = {"split_mode": "B"}
> nlp = Japanese(meta={"tokenizer": {"config": cfg}})
> nlp = Japanese.from_config({"nlp": {"tokenizer": cfg}})
> ```
The Japanese language class uses
[SudachiPy](https://github.com/WorksApplications/SudachiPy) for word
segmentation and part-of-speech tagging. The default Japanese language class and
the provided Japanese pipelines use SudachiPy split mode `A`. The `meta`
argument of the `Japanese` language class can be used to configure the split
mode to `A`, `B` or `C`.
the provided Japanese pipelines use SudachiPy split mode `A`. The tokenizer
config can be used to configure the split mode to `A`, `B` or `C`.
<Infobox variant="warning">
If you run into errors related to `sudachipy`, which is currently under active
development, we suggest downgrading to `sudachipy==0.4.5`, which is the version
development, we suggest downgrading to `sudachipy==0.4.9`, which is the version
used for training the current [Japanese pipelines](/models/ja).
</Infobox>