From 351f352cdc7ffe2d6c41675e45c1d75ec84180c8 Mon Sep 17 00:00:00 2001 From: Adriane Boyd Date: Fri, 2 Oct 2020 10:12:44 +0200 Subject: [PATCH] Update Japanese docs and pin for sudachipy --- setup.cfg | 2 +- website/docs/usage/models.md | 9 ++++----- 2 files changed, 5 insertions(+), 6 deletions(-) diff --git a/setup.cfg b/setup.cfg index 36ab64bd9..babe5fe8b 100644 --- a/setup.cfg +++ b/setup.cfg @@ -84,7 +84,7 @@ cuda102 = cupy-cuda102>=5.0.0b4,<9.0.0 # Language tokenizers with external dependencies ja = - sudachipy>=0.4.5 + sudachipy>=0.4.9 sudachidict_core>=20200330 ko = natto-py==0.9.0 diff --git a/website/docs/usage/models.md b/website/docs/usage/models.md index 5e9bd688f..6792f691c 100644 --- a/website/docs/usage/models.md +++ b/website/docs/usage/models.md @@ -199,20 +199,19 @@ nlp.tokenizer.initialize(pkuseg_model="/path/to/pkuseg_model") > > # Load SudachiPy with split mode B > cfg = {"split_mode": "B"} -> nlp = Japanese(meta={"tokenizer": {"config": cfg}}) +> nlp = Japanese.from_config({"nlp": {"tokenizer": cfg}}) > ``` The Japanese language class uses [SudachiPy](https://github.com/WorksApplications/SudachiPy) for word segmentation and part-of-speech tagging. The default Japanese language class and -the provided Japanese pipelines use SudachiPy split mode `A`. The `meta` -argument of the `Japanese` language class can be used to configure the split -mode to `A`, `B` or `C`. +the provided Japanese pipelines use SudachiPy split mode `A`. The tokenizer +config can be used to configure the split mode to `A`, `B` or `C`. If you run into errors related to `sudachipy`, which is currently under active -development, we suggest downgrading to `sudachipy==0.4.5`, which is the version +development, we suggest downgrading to `sudachipy==0.4.9`, which is the version used for training the current [Japanese pipelines](/models/ja).