mirror of https://github.com/explosion/spaCy.git
Update Japanese docs and pin for sudachipy
This commit is contained in:
parent
7670df04dd
commit
351f352cdc
|
@ -84,7 +84,7 @@ cuda102 =
|
|||
cupy-cuda102>=5.0.0b4,<9.0.0
|
||||
# Language tokenizers with external dependencies
|
||||
ja =
|
||||
sudachipy>=0.4.5
|
||||
sudachipy>=0.4.9
|
||||
sudachidict_core>=20200330
|
||||
ko =
|
||||
natto-py==0.9.0
|
||||
|
|
|
@ -199,20 +199,19 @@ nlp.tokenizer.initialize(pkuseg_model="/path/to/pkuseg_model")
|
|||
>
|
||||
> # Load SudachiPy with split mode B
|
||||
> cfg = {"split_mode": "B"}
|
||||
> nlp = Japanese(meta={"tokenizer": {"config": cfg}})
|
||||
> nlp = Japanese.from_config({"nlp": {"tokenizer": cfg}})
|
||||
> ```
|
||||
|
||||
The Japanese language class uses
|
||||
[SudachiPy](https://github.com/WorksApplications/SudachiPy) for word
|
||||
segmentation and part-of-speech tagging. The default Japanese language class and
|
||||
the provided Japanese pipelines use SudachiPy split mode `A`. The `meta`
|
||||
argument of the `Japanese` language class can be used to configure the split
|
||||
mode to `A`, `B` or `C`.
|
||||
the provided Japanese pipelines use SudachiPy split mode `A`. The tokenizer
|
||||
config can be used to configure the split mode to `A`, `B` or `C`.
|
||||
|
||||
<Infobox variant="warning">
|
||||
|
||||
If you run into errors related to `sudachipy`, which is currently under active
|
||||
development, we suggest downgrading to `sudachipy==0.4.5`, which is the version
|
||||
development, we suggest downgrading to `sudachipy==0.4.9`, which is the version
|
||||
used for training the current [Japanese pipelines](/models/ja).
|
||||
|
||||
</Infobox>
|
||||
|
|
Loading…
Reference in New Issue