mirror of https://github.com/explosion/spaCy.git
Merge pull request #6180 from adrianeboyd/docs/minor-v3-2 [ci skip]
This commit is contained in:
commit
6d8df081bd
|
@ -84,7 +84,7 @@ cuda102 =
|
||||||
cupy-cuda102>=5.0.0b4,<9.0.0
|
cupy-cuda102>=5.0.0b4,<9.0.0
|
||||||
# Language tokenizers with external dependencies
|
# Language tokenizers with external dependencies
|
||||||
ja =
|
ja =
|
||||||
sudachipy>=0.4.5
|
sudachipy>=0.4.9
|
||||||
sudachidict_core>=20200330
|
sudachidict_core>=20200330
|
||||||
ko =
|
ko =
|
||||||
natto-py==0.9.0
|
natto-py==0.9.0
|
||||||
|
|
|
@ -85,7 +85,8 @@ import the `MultiLanguage` class directly, or call
|
||||||
|
|
||||||
### Chinese language support {#chinese new=2.3}
|
### Chinese language support {#chinese new=2.3}
|
||||||
|
|
||||||
The Chinese language class supports three word segmentation options:
|
The Chinese language class supports three word segmentation options, `char`,
|
||||||
|
`jieba` and `pkuseg`:
|
||||||
|
|
||||||
> ```python
|
> ```python
|
||||||
> from spacy.lang.zh import Chinese
|
> from spacy.lang.zh import Chinese
|
||||||
|
@ -95,11 +96,12 @@ The Chinese language class supports three word segmentation options:
|
||||||
>
|
>
|
||||||
> # Jieba
|
> # Jieba
|
||||||
> cfg = {"segmenter": "jieba"}
|
> cfg = {"segmenter": "jieba"}
|
||||||
> nlp = Chinese(meta={"tokenizer": {"config": cfg}})
|
> nlp = Chinese.from_config({"nlp": {"tokenizer": cfg}})
|
||||||
>
|
>
|
||||||
> # PKUSeg with "default" model provided by pkuseg
|
> # PKUSeg with "default" model provided by pkuseg
|
||||||
> cfg = {"segmenter": "pkuseg", "pkuseg_model": "default"}
|
> cfg = {"segmenter": "pkuseg"}
|
||||||
> nlp = Chinese(meta={"tokenizer": {"config": cfg}})
|
> nlp = Chinese.from_config({"nlp": {"tokenizer": cfg}})
|
||||||
|
> nlp.tokenizer.initialize(pkuseg_model="default")
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
1. **Character segmentation:** Character segmentation is the default
|
1. **Character segmentation:** Character segmentation is the default
|
||||||
|
@ -116,41 +118,34 @@ The Chinese language class supports three word segmentation options:
|
||||||
<Infobox variant="warning">
|
<Infobox variant="warning">
|
||||||
|
|
||||||
In spaCy v3.0, the default Chinese word segmenter has switched from Jieba to
|
In spaCy v3.0, the default Chinese word segmenter has switched from Jieba to
|
||||||
character segmentation. Also note that
|
character segmentation.
|
||||||
[`pkuseg`](https://github.com/lancopku/pkuseg-python) doesn't yet ship with
|
|
||||||
pre-compiled wheels for Python 3.8. If you're running Python 3.8, you can
|
|
||||||
install it from our fork and compile it locally:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
$ pip install https://github.com/honnibal/pkuseg-python/archive/master.zip
|
|
||||||
```
|
|
||||||
|
|
||||||
</Infobox>
|
</Infobox>
|
||||||
|
|
||||||
<Accordion title="Details on spaCy's Chinese API">
|
<Accordion title="Details on spaCy's Chinese API">
|
||||||
|
|
||||||
The `meta` argument of the `Chinese` language class supports the following
|
The `initialize` method for the Chinese tokenizer class supports the following
|
||||||
following tokenizer config settings:
|
config settings for loading pkuseg models:
|
||||||
|
|
||||||
| Name | Description |
|
| Name | Description |
|
||||||
| ------------------ | --------------------------------------------------------------------------------------------------------------- |
|
| ------------------ | ------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `segmenter` | Word segmenter: `char`, `jieba` or `pkuseg`. Defaults to `char`. ~~str~~ |
|
| `pkuseg_model` | Name of a model provided by `pkuseg` or the path to a local model directory. ~~str~~ |
|
||||||
| `pkuseg_model` | **Required for `pkuseg`:** Name of a model provided by `pkuseg` or the path to a local model directory. ~~str~~ |
|
| `pkuseg_user_dict` | Optional path to a file with one word per line which overrides the default `pkuseg` user dictionary. Defaults to `"default"`. ~~str~~ |
|
||||||
| `pkuseg_user_dict` | Optional path to a file with one word per line which overrides the default `pkuseg` user dictionary. ~~str~~ |
|
|
||||||
|
|
||||||
```python
|
```python
|
||||||
### Examples
|
### Examples
|
||||||
|
# Initialize the pkuseg tokenizer
|
||||||
|
cfg = {"segmenter": "pkuseg"}
|
||||||
|
nlp = Chinese.from_config({"nlp": {"tokenizer": cfg}})
|
||||||
|
|
||||||
# Load "default" model
|
# Load "default" model
|
||||||
cfg = {"segmenter": "pkuseg", "pkuseg_model": "default"}
|
nlp.tokenizer.initialize(pkuseg_model="default")
|
||||||
nlp = Chinese(config={"tokenizer": {"config": cfg}})
|
|
||||||
|
|
||||||
# Load local model
|
# Load local model
|
||||||
cfg = {"segmenter": "pkuseg", "pkuseg_model": "/path/to/pkuseg_model"}
|
nlp.tokenizer.initialize(pkuseg_model="/path/to/pkuseg_model")
|
||||||
nlp = Chinese(config={"tokenizer": {"config": cfg}})
|
|
||||||
|
|
||||||
# Override the user directory
|
# Override the user directory
|
||||||
cfg = {"segmenter": "pkuseg", "pkuseg_model": "default", "pkuseg_user_dict": "/path"}
|
nlp.tokenizer.initialize(pkuseg_model="default", pkuseg_user_dict="/path/to/user_dict")
|
||||||
nlp = Chinese(config={"tokenizer": {"config": cfg}})
|
|
||||||
```
|
```
|
||||||
|
|
||||||
You can also modify the user dictionary on-the-fly:
|
You can also modify the user dictionary on-the-fly:
|
||||||
|
@ -185,8 +180,11 @@ from spacy.lang.zh import Chinese
|
||||||
|
|
||||||
# Train pkuseg model
|
# Train pkuseg model
|
||||||
pkuseg.train("train.utf8", "test.utf8", "/path/to/pkuseg_model")
|
pkuseg.train("train.utf8", "test.utf8", "/path/to/pkuseg_model")
|
||||||
|
|
||||||
# Load pkuseg model in spaCy Chinese tokenizer
|
# Load pkuseg model in spaCy Chinese tokenizer
|
||||||
nlp = Chinese(meta={"tokenizer": {"config": {"pkuseg_model": "/path/to/pkuseg_model", "require_pkuseg": True}}})
|
cfg = {"segmenter": "pkuseg"}
|
||||||
|
nlp = Chinese.from_config({"nlp": {"tokenizer": cfg}})
|
||||||
|
nlp.tokenizer.initialize(pkuseg_model="/path/to/pkuseg_model")
|
||||||
```
|
```
|
||||||
|
|
||||||
</Accordion>
|
</Accordion>
|
||||||
|
@ -201,20 +199,19 @@ nlp = Chinese(meta={"tokenizer": {"config": {"pkuseg_model": "/path/to/pkuseg_mo
|
||||||
>
|
>
|
||||||
> # Load SudachiPy with split mode B
|
> # Load SudachiPy with split mode B
|
||||||
> cfg = {"split_mode": "B"}
|
> cfg = {"split_mode": "B"}
|
||||||
> nlp = Japanese(meta={"tokenizer": {"config": cfg}})
|
> nlp = Japanese.from_config({"nlp": {"tokenizer": cfg}})
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
The Japanese language class uses
|
The Japanese language class uses
|
||||||
[SudachiPy](https://github.com/WorksApplications/SudachiPy) for word
|
[SudachiPy](https://github.com/WorksApplications/SudachiPy) for word
|
||||||
segmentation and part-of-speech tagging. The default Japanese language class and
|
segmentation and part-of-speech tagging. The default Japanese language class and
|
||||||
the provided Japanese pipelines use SudachiPy split mode `A`. The `meta`
|
the provided Japanese pipelines use SudachiPy split mode `A`. The tokenizer
|
||||||
argument of the `Japanese` language class can be used to configure the split
|
config can be used to configure the split mode to `A`, `B` or `C`.
|
||||||
mode to `A`, `B` or `C`.
|
|
||||||
|
|
||||||
<Infobox variant="warning">
|
<Infobox variant="warning">
|
||||||
|
|
||||||
If you run into errors related to `sudachipy`, which is currently under active
|
If you run into errors related to `sudachipy`, which is currently under active
|
||||||
development, we suggest downgrading to `sudachipy==0.4.5`, which is the version
|
development, we suggest downgrading to `sudachipy==0.4.9`, which is the version
|
||||||
used for training the current [Japanese pipelines](/models/ja).
|
used for training the current [Japanese pipelines](/models/ja).
|
||||||
|
|
||||||
</Infobox>
|
</Infobox>
|
||||||
|
|
|
@ -1124,17 +1124,6 @@ a dictionary with keyword arguments specifying the annotations, like `tags` or
|
||||||
annotations, the model can be updated to learn a sentence of three words with
|
annotations, the model can be updated to learn a sentence of three words with
|
||||||
their assigned part-of-speech tags.
|
their assigned part-of-speech tags.
|
||||||
|
|
||||||
> #### About the tag map
|
|
||||||
>
|
|
||||||
> The tag map is part of the vocabulary and defines the annotation scheme. If
|
|
||||||
> you're training a new pipeline, this will let you map the tags present in the
|
|
||||||
> treebank you train on to spaCy's tag scheme:
|
|
||||||
>
|
|
||||||
> ```python
|
|
||||||
> tag_map = {"N": {"pos": "NOUN"}, "V": {"pos": "VERB"}}
|
|
||||||
> vocab = Vocab(tag_map=tag_map)
|
|
||||||
> ```
|
|
||||||
|
|
||||||
```python
|
```python
|
||||||
words = ["I", "like", "stuff"]
|
words = ["I", "like", "stuff"]
|
||||||
tags = ["NOUN", "VERB", "NOUN"]
|
tags = ["NOUN", "VERB", "NOUN"]
|
||||||
|
|
Loading…
Reference in New Issue