Add pkuseg warnings and auto-format [ci skip]

This commit is contained in:
Ines Montani 2020-06-16 17:13:35 +02:00
parent a9e5b840ee
commit 44af53bdd9
2 changed files with 78 additions and 59 deletions

View File

@ -117,6 +117,18 @@ The Chinese language class supports three word segmentation options:
better segmentation for Chinese OntoNotes and the new better segmentation for Chinese OntoNotes and the new
[Chinese models](/models/zh). [Chinese models](/models/zh).
<Infobox variant="warning">
Note that [`pkuseg`](https://github.com/lancopku/pkuseg-python) doesn't yet ship
with pre-compiled wheels for Python 3.8. If you're running Python 3.8, you can
install it from our fork and compile it locally:
```bash
$ pip install https://github.com/honnibal/pkuseg-python/archive/master.zip
```
</Infobox>
<Accordion title="Details on spaCy's PKUSeg API"> <Accordion title="Details on spaCy's PKUSeg API">
The `meta` argument of the `Chinese` language class supports the following The `meta` argument of the `Chinese` language class supports the following
@ -196,8 +208,8 @@ nlp = Chinese(meta={"tokenizer": {"config": {"pkuseg_model": "/path/to/pkuseg_mo
The Japanese language class uses The Japanese language class uses
[SudachiPy](https://github.com/WorksApplications/SudachiPy) for word [SudachiPy](https://github.com/WorksApplications/SudachiPy) for word
segmentation and part-of-speech tagging. The default Japanese language class segmentation and part-of-speech tagging. The default Japanese language class and
and the provided Japanese models use SudachiPy split mode `A`. the provided Japanese models use SudachiPy split mode `A`.
The `meta` argument of the `Japanese` language class can be used to configure The `meta` argument of the `Japanese` language class can be used to configure
the split mode to `A`, `B` or `C`. the split mode to `A`, `B` or `C`.

View File

@ -14,10 +14,10 @@ all language models, and decreased model size and loading times for models with
vectors. We've added pretrained models for **Chinese, Danish, Japanese, Polish vectors. We've added pretrained models for **Chinese, Danish, Japanese, Polish
and Romanian** and updated the training data and vectors for most languages. and Romanian** and updated the training data and vectors for most languages.
Model packages with vectors are about **2&times** smaller on disk and load Model packages with vectors are about **2&times** smaller on disk and load
**2-4&times;** faster. For the full changelog, see the [release notes on **2-4&times;** faster. For the full changelog, see the
GitHub](https://github.com/explosion/spaCy/releases/tag/v2.3.0). For more [release notes on GitHub](https://github.com/explosion/spaCy/releases/tag/v2.3.0).
details and a behind-the-scenes look at the new release, [see our blog For more details and a behind-the-scenes look at the new release,
post](https://explosion.ai/blog/spacy-v2-3). [see our blog post](https://explosion.ai/blog/spacy-v2-3).
### Expanded model families with vectors {#models} ### Expanded model families with vectors {#models}
@ -33,10 +33,10 @@ post](https://explosion.ai/blog/spacy-v2-3).
With new model families for Chinese, Danish, Polish, Romanian and Chinese plus With new model families for Chinese, Danish, Polish, Romanian and Chinese plus
`md` and `lg` models with word vectors for all languages, this release provides `md` and `lg` models with word vectors for all languages, this release provides
a total of 46 model packages. For models trained using [Universal a total of 46 model packages. For models trained using
Dependencies](https://universaldependencies.org) corpora, the training data has [Universal Dependencies](https://universaldependencies.org) corpora, the
been updated to UD v2.5 (v2.6 for Japanese, v2.3 for Polish) and Dutch has been training data has been updated to UD v2.5 (v2.6 for Japanese, v2.3 for Polish)
extended to include both UD Dutch Alpino and LassySmall. and Dutch has been extended to include both UD Dutch Alpino and LassySmall.
<Infobox> <Infobox>
@ -48,6 +48,7 @@ extended to include both UD Dutch Alpino and LassySmall.
### Chinese {#chinese} ### Chinese {#chinese}
> #### Example > #### Example
>
> ```python > ```python
> from spacy.lang.zh import Chinese > from spacy.lang.zh import Chinese
> >
@ -57,41 +58,49 @@ extended to include both UD Dutch Alpino and LassySmall.
> >
> # Append words to user dict > # Append words to user dict
> nlp.tokenizer.pkuseg_update_user_dict(["中国", "ABC"]) > nlp.tokenizer.pkuseg_update_user_dict(["中国", "ABC"])
> ```
This release adds support for This release adds support for
[pkuseg](https://github.com/lancopku/pkuseg-python) for word segmentation and [`pkuseg`](https://github.com/lancopku/pkuseg-python) for word segmentation and
the new Chinese models ship with a custom pkuseg model trained on OntoNotes. the new Chinese models ship with a custom pkuseg model trained on OntoNotes. The
The Chinese tokenizer can be initialized with both `pkuseg` and custom models Chinese tokenizer can be initialized with both `pkuseg` and custom models and
and the `pkuseg` user dictionary is easy to customize. the `pkuseg` user dictionary is easy to customize. Note that
[`pkuseg`](https://github.com/lancopku/pkuseg-python) doesn't yet ship with
pre-compiled wheels for Python 3.8. See the
[usage documentation](/usage/models#chinese) for details on how to install it on
Python 3.8.
<Infobox> <Infobox>
**Chinese:** [Chinese tokenizer usage](/usage/models#chinese) **Models:** [Chinese models](/models/zh) **Usage: **
[Chinese tokenizer usage](/usage/models#chinese)
</Infobox> </Infobox>
### Japanese {#japanese} ### Japanese {#japanese}
The updated Japanese language class switches to The updated Japanese language class switches to
[SudachiPy](https://github.com/WorksApplications/SudachiPy) for word [`SudachiPy`](https://github.com/WorksApplications/SudachiPy) for word
segmentation and part-of-speech tagging. Using `sudachipy` greatly simplifies segmentation and part-of-speech tagging. Using `SudachiPy` greatly simplifies
installing spaCy for Japanese, which is now possible with a single command: installing spaCy for Japanese, which is now possible with a single command:
`pip install spacy[ja]`. `pip install spacy[ja]`.
<Infobox> <Infobox>
**Japanese:** [Japanese tokenizer usage](/usage/models#japanese) **Models:** [Japanese models](/models/ja) **Usage:**
[Japanese tokenizer usage](/usage/models#japanese)
</Infobox> </Infobox>
### Small CLI updates ### Small CLI updates
- `spacy debug-data` provides the coverage of the vectors in a base model with - [`spacy debug-data`](/api/cli#debug-data) provides the coverage of the vectors
`spacy debug-data lang train dev -b base_model` in a base model with `spacy debug-data lang train dev -b base_model`
- `spacy evaluate` supports `blank:lg` (e.g. `spacy evaluate blank:en - [`spacy evaluate`](/api/cli#evaluate) supports `blank:lg` (e.g.
dev.json`) to evaluate the tokenization accuracy without loading a model `spacy evaluate blank:en dev.json`) to evaluate the tokenization accuracy
- `spacy train` on GPU restricts the CPU timing evaluation to the first without loading a model
iteration - [`spacy train`](/api/cli#train) on GPU restricts the CPU timing evaluation to
the first iteration
## Backwards incompatibilities {#incompat} ## Backwards incompatibilities {#incompat}
@ -100,8 +109,8 @@ installing spaCy for Japanese, which is now possible with a single command:
If you've been training **your own models**, you'll need to **retrain** them If you've been training **your own models**, you'll need to **retrain** them
with the new version. Also don't forget to upgrade all models to the latest with the new version. Also don't forget to upgrade all models to the latest
versions. Models for earlier v2 releases (v2.0, v2.1, v2.2) aren't compatible versions. Models for earlier v2 releases (v2.0, v2.1, v2.2) aren't compatible
with models for v2.3. To check if all of your models are up to date, you can with models for v2.3. To check if all of your models are up to date, you can run
run the [`spacy validate`](/api/cli#validate) command. the [`spacy validate`](/api/cli#validate) command.
</Infobox> </Infobox>
@ -116,21 +125,20 @@ run the [`spacy validate`](/api/cli#validate) command.
> directly. > directly.
- If you're training new models, you'll want to install the package - If you're training new models, you'll want to install the package
[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data), [`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data), which
which now includes both the lemmatization tables (as in v2.2) and the now includes both the lemmatization tables (as in v2.2) and the normalization
normalization tables (new in v2.3). If you're using pretrained models, tables (new in v2.3). If you're using pretrained models, **nothing changes**,
**nothing changes**, because the relevant tables are included in the model because the relevant tables are included in the model packages.
packages.
- Due to the updated Universal Dependencies training data, the fine-grained - Due to the updated Universal Dependencies training data, the fine-grained
part-of-speech tags will change for many provided language models. The part-of-speech tags will change for many provided language models. The
coarse-grained part-of-speech tagset remains the same, but the mapping from coarse-grained part-of-speech tagset remains the same, but the mapping from
particular fine-grained to coarse-grained tags may show minor differences. particular fine-grained to coarse-grained tags may show minor differences.
- For French, Italian, Portuguese and Spanish, the fine-grained part-of-speech - For French, Italian, Portuguese and Spanish, the fine-grained part-of-speech
tagsets contain new merged tags related to contracted forms, such as tagsets contain new merged tags related to contracted forms, such as `ADP_DET`
`ADP_DET` for French `"au"`, which maps to UPOS `ADP` based on the head for French `"au"`, which maps to UPOS `ADP` based on the head `"à"`. This
`"à"`. This increases the accuracy of the models by improving the alignment increases the accuracy of the models by improving the alignment between
between spaCy's tokenization and Universal Dependencies multi-word tokens spaCy's tokenization and Universal Dependencies multi-word tokens used for
used for contractions. contractions.
### Migrating from spaCy 2.2 {#migrating} ### Migrating from spaCy 2.2 {#migrating}
@ -143,29 +151,28 @@ v2.3 so that `token_match` has priority over prefixes and suffixes as in v2.2.1
and earlier versions. and earlier versions.
A new tokenizer setting `url_match` has been introduced in v2.3.0 to handle A new tokenizer setting `url_match` has been introduced in v2.3.0 to handle
cases like URLs where the tokenizer should remove prefixes and suffixes (e.g., cases like URLs where the tokenizer should remove prefixes and suffixes (e.g., a
a comma at the end of a URL) before applying the match. See the full [tokenizer comma at the end of a URL) before applying the match. See the full
documentation](/usage/linguistic-features#tokenization) and try out [tokenizer documentation](/usage/linguistic-features#tokenization) and try out
[`nlp.tokenizer.explain()`](/usage/linguistic-features#tokenizer-debug) when [`nlp.tokenizer.explain()`](/usage/linguistic-features#tokenizer-debug) when
debugging your tokenizer configuration. debugging your tokenizer configuration.
#### Warnings configuration #### Warnings configuration
spaCy's custom warnings have been replaced with native python spaCy's custom warnings have been replaced with native Python
[`warnings`](https://docs.python.org/3/library/warnings.html). Instead of [`warnings`](https://docs.python.org/3/library/warnings.html). Instead of
setting `SPACY_WARNING_IGNORE`, use the [warnings setting `SPACY_WARNING_IGNORE`, use the
filters](https://docs.python.org/3/library/warnings.html#the-warnings-filter) [`warnings` filters](https://docs.python.org/3/library/warnings.html#the-warnings-filter)
to manage warnings. to manage warnings.
#### Normalization tables #### Normalization tables
The normalization tables have moved from the language data in The normalization tables have moved from the language data in
[`spacy/lang`](https://github.com/explosion/spaCy/tree/master/spacy/lang) to [`spacy/lang`](https://github.com/explosion/spaCy/tree/master/spacy/lang) to the
the package package [`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data).
[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data). If If you're adding data for a new language, the normalization table should be
you're adding data for a new language, the normalization table should be added added to `spacy-lookups-data`. See
to `spacy-lookups-data`. See [adding norm [adding norm exceptions](/usage/adding-languages#norm-exceptions).
exceptions](/usage/adding-languages#norm-exceptions).
#### Probability and cluster features #### Probability and cluster features
@ -181,28 +188,28 @@ exceptions](/usage/adding-languages#norm-exceptions).
The `Token.prob` and `Token.cluster` features, which are no longer used by the The `Token.prob` and `Token.cluster` features, which are no longer used by the
core pipeline components as of spaCy v2, are no longer provided in the core pipeline components as of spaCy v2, are no longer provided in the
pretrained models to reduce the model size. To keep these features available pretrained models to reduce the model size. To keep these features available for
for users relying on them, the `prob` and `cluster` features for the most users relying on them, the `prob` and `cluster` features for the most frequent
frequent 1M tokens have been moved to 1M tokens have been moved to
[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data) as [`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data) as
`extra` features for the relevant languages (English, German, Greek and `extra` features for the relevant languages (English, German, Greek and
Spanish). Spanish).
The extra tables are loaded lazily, so if you have `spacy-lookups-data` The extra tables are loaded lazily, so if you have `spacy-lookups-data`
installed and your code accesses `Token.prob`, the full table is loaded into installed and your code accesses `Token.prob`, the full table is loaded into the
the model vocab, which will take a few seconds on initial loading. When you model vocab, which will take a few seconds on initial loading. When you save
save this model after loading the `prob` table, the full `prob` table will be this model after loading the `prob` table, the full `prob` table will be saved
saved as part of the model vocab. as part of the model vocab.
If you'd like to include custom `cluster`, `prob`, or `sentiment` tables as If you'd like to include custom `cluster`, `prob`, or `sentiment` tables as part
part of a new model, add the data to of a new model, add the data to
[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data) under [`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data) under
the entry point `lg_extra`, e.g. `en_extra` for English. Alternatively, you can the entry point `lg_extra`, e.g. `en_extra` for English. Alternatively, you can
initialize your [`Vocab`](/api/vocab) with the `lookups_extra` argument with a initialize your [`Vocab`](/api/vocab) with the `lookups_extra` argument with a
[`Lookups`](/api/lookups) object that includes the tables `lexeme_cluster`, [`Lookups`](/api/lookups) object that includes the tables `lexeme_cluster`,
`lexeme_prob`, `lexeme_sentiment` or `lexeme_settings`. `lexeme_settings` is `lexeme_prob`, `lexeme_sentiment` or `lexeme_settings`. `lexeme_settings` is
currently only used to provide a custom `oov_prob`. See examples in the [`data` currently only used to provide a custom `oov_prob`. See examples in the
directory](https://github.com/explosion/spacy-lookups-data/tree/master/spacy_lookups_data/data) [`data` directory](https://github.com/explosion/spacy-lookups-data/tree/master/spacy_lookups_data/data)
in `spacy-lookups-data`. in `spacy-lookups-data`.
#### Initializing new models without extra lookups tables #### Initializing new models without extra lookups tables