Add pkuseg warnings and auto-format [ci skip]

2020-06-16 17:13:35 +02:00 · 2020-06-16 17:13:35 +02:00 · 44af53bdd9
parent a9e5b840ee
commit 44af53bdd9
2 changed files with 78 additions and 59 deletions
--- a/website/docs/usage/models.md
+++ b/website/docs/usage/models.md
@ -117,6 +117,18 @@ The Chinese language class supports three word segmentation options:
   better segmentation for Chinese OntoNotes and the new
   [Chinese models](/models/zh).

+<Infobox variant="warning">
+
+Note that [`pkuseg`](https://github.com/lancopku/pkuseg-python) doesn't yet ship
+with pre-compiled wheels for Python 3.8. If you're running Python 3.8, you can
+install it from our fork and compile it locally:
+
+```bash
+$ pip install https://github.com/honnibal/pkuseg-python/archive/master.zip
+```
+
+</Infobox>
+
 <Accordion title="Details on spaCy's PKUSeg API">

 The `meta` argument of the `Chinese` language class supports the following
@ -196,8 +208,8 @@ nlp = Chinese(meta={"tokenizer": {"config": {"pkuseg_model": "/path/to/pkuseg_mo

 The Japanese language class uses
 [SudachiPy](https://github.com/WorksApplications/SudachiPy) for word
-segmentation and part-of-speech tagging. The default Japanese language class
-and the provided Japanese models use SudachiPy split mode `A`.
+segmentation and part-of-speech tagging. The default Japanese language class and
+the provided Japanese models use SudachiPy split mode `A`.

 The `meta` argument of the `Japanese` language class can be used to configure
 the split mode to `A`, `B` or `C`.
--- a/website/docs/usage/v2-3.md
+++ b/website/docs/usage/v2-3.md
@ -14,10 +14,10 @@ all language models, and decreased model size and loading times for models with
 vectors. We've added pretrained models for **Chinese, Danish, Japanese, Polish
 and Romanian** and updated the training data and vectors for most languages.
 Model packages with vectors are about **2&times** smaller on disk and load
-**2-4&times;** faster. For the full changelog, see the [release notes on
-GitHub](https://github.com/explosion/spaCy/releases/tag/v2.3.0). For more
-details and a behind-the-scenes look at the new release, [see our blog
-post](https://explosion.ai/blog/spacy-v2-3).
+**2-4&times;** faster. For the full changelog, see the
+[release notes on GitHub](https://github.com/explosion/spaCy/releases/tag/v2.3.0).
+For more details and a behind-the-scenes look at the new release,
+[see our blog post](https://explosion.ai/blog/spacy-v2-3).

 ### Expanded model families with vectors {#models}

@ -33,10 +33,10 @@ post](https://explosion.ai/blog/spacy-v2-3).

 With new model families for Chinese, Danish, Polish, Romanian and Chinese plus
 `md` and `lg` models with word vectors for all languages, this release provides
-a total of 46 model packages. For models trained using [Universal
-Dependencies](https://universaldependencies.org) corpora, the training data has
-been updated to UD v2.5 (v2.6 for Japanese, v2.3 for Polish) and Dutch has been
-extended to include both UD Dutch Alpino and LassySmall.
+a total of 46 model packages. For models trained using
+[Universal Dependencies](https://universaldependencies.org) corpora, the
+training data has been updated to UD v2.5 (v2.6 for Japanese, v2.3 for Polish)
+and Dutch has been extended to include both UD Dutch Alpino and LassySmall.

 <Infobox>

@ -48,6 +48,7 @@ extended to include both UD Dutch Alpino and LassySmall.
 ### Chinese {#chinese}

 > #### Example
+>
 > ```python
 > from spacy.lang.zh import Chinese
 >
@ -57,41 +58,49 @@ extended to include both UD Dutch Alpino and LassySmall.
 >
 > # Append words to user dict
 > nlp.tokenizer.pkuseg_update_user_dict(["中国", "ABC"])
+> ```

 This release adds support for
-[pkuseg](https://github.com/lancopku/pkuseg-python) for word segmentation and
-the new Chinese models ship with a custom pkuseg model trained on OntoNotes.
-The Chinese tokenizer can be initialized with both `pkuseg` and custom models
-and the `pkuseg` user dictionary is easy to customize.
+[`pkuseg`](https://github.com/lancopku/pkuseg-python) for word segmentation and
+the new Chinese models ship with a custom pkuseg model trained on OntoNotes. The
+Chinese tokenizer can be initialized with both `pkuseg` and custom models and
+the `pkuseg` user dictionary is easy to customize. Note that
+[`pkuseg`](https://github.com/lancopku/pkuseg-python) doesn't yet ship with
+pre-compiled wheels for Python 3.8. See the
+[usage documentation](/usage/models#chinese) for details on how to install it on
+Python 3.8.

 <Infobox>

-**Chinese:** [Chinese tokenizer usage](/usage/models#chinese)
+**Models:** [Chinese models](/models/zh) **Usage: **
+[Chinese tokenizer usage](/usage/models#chinese)

 </Infobox>

 ### Japanese {#japanese}

 The updated Japanese language class switches to
-[SudachiPy](https://github.com/WorksApplications/SudachiPy) for word
-segmentation and part-of-speech tagging. Using `sudachipy` greatly simplifies
+[`SudachiPy`](https://github.com/WorksApplications/SudachiPy) for word
+segmentation and part-of-speech tagging. Using `SudachiPy` greatly simplifies
 installing spaCy for Japanese, which is now possible with a single command:
 `pip install spacy[ja]`.

 <Infobox>

-**Japanese:** [Japanese tokenizer usage](/usage/models#japanese)
+**Models:** [Japanese models](/models/ja) **Usage:**
+[Japanese tokenizer usage](/usage/models#japanese)

 </Infobox>

 ### Small CLI updates

- `spacy debug-data` provides the coverage of the vectors in a base model with
-  `spacy debug-data lang train dev -b base_model`
- `spacy evaluate` supports `blank:lg` (e.g. `spacy evaluate blank:en
-  dev.json`) to evaluate the tokenization accuracy without loading a model
- `spacy train` on GPU restricts the CPU timing evaluation to the first
-  iteration
+- [`spacy debug-data`](/api/cli#debug-data) provides the coverage of the vectors
+  in a base model with `spacy debug-data lang train dev -b base_model`
+- [`spacy evaluate`](/api/cli#evaluate) supports `blank:lg` (e.g.
+  `spacy evaluate blank:en dev.json`) to evaluate the tokenization accuracy
+  without loading a model
+- [`spacy train`](/api/cli#train) on GPU restricts the CPU timing evaluation to
+  the first iteration

 ## Backwards incompatibilities {#incompat}

@ -100,8 +109,8 @@ installing spaCy for Japanese, which is now possible with a single command:
 If you've been training **your own models**, you'll need to **retrain** them
 with the new version. Also don't forget to upgrade all models to the latest
 versions. Models for earlier v2 releases (v2.0, v2.1, v2.2) aren't compatible
-with models for v2.3. To check if all of your models are up to date, you can
-run the [`spacy validate`](/api/cli#validate) command.
+with models for v2.3. To check if all of your models are up to date, you can run
+the [`spacy validate`](/api/cli#validate) command.

 </Infobox>

@ -116,21 +125,20 @@ run the [`spacy validate`](/api/cli#validate) command.
 > directly.

 - If you're training new models, you'll want to install the package
-  [`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data),
-  which now includes both the lemmatization tables (as in v2.2) and the
-  normalization tables (new in v2.3). If you're using pretrained models,
-  **nothing changes**, because the relevant tables are included in the model
-  packages.
+  [`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data), which
+  now includes both the lemmatization tables (as in v2.2) and the normalization
+  tables (new in v2.3). If you're using pretrained models, **nothing changes**,
+  because the relevant tables are included in the model packages.
 - Due to the updated Universal Dependencies training data, the fine-grained
  part-of-speech tags will change for many provided language models. The
  coarse-grained part-of-speech tagset remains the same, but the mapping from
  particular fine-grained to coarse-grained tags may show minor differences.
 - For French, Italian, Portuguese and Spanish, the fine-grained part-of-speech
-  tagsets contain new merged tags related to contracted forms, such as
-  `ADP_DET` for French `"au"`, which maps to UPOS `ADP` based on the head
-  `"à"`. This increases the accuracy of the models by improving the alignment
-  between spaCy's tokenization and Universal Dependencies multi-word tokens
-  used for contractions.
+  tagsets contain new merged tags related to contracted forms, such as `ADP_DET`
+  for French `"au"`, which maps to UPOS `ADP` based on the head `"à"`. This
+  increases the accuracy of the models by improving the alignment between
+  spaCy's tokenization and Universal Dependencies multi-word tokens used for
+  contractions.

 ### Migrating from spaCy 2.2 {#migrating}

@ -143,29 +151,28 @@ v2.3 so that `token_match` has priority over prefixes and suffixes as in v2.2.1
 and earlier versions.

 A new tokenizer setting `url_match` has been introduced in v2.3.0 to handle
-cases like URLs where the tokenizer should remove prefixes and suffixes (e.g.,
-a comma at the end of a URL) before applying the match. See the full [tokenizer
-documentation](/usage/linguistic-features#tokenization) and try out
+cases like URLs where the tokenizer should remove prefixes and suffixes (e.g., a
+comma at the end of a URL) before applying the match. See the full
+[tokenizer documentation](/usage/linguistic-features#tokenization) and try out
 [`nlp.tokenizer.explain()`](/usage/linguistic-features#tokenizer-debug) when
 debugging your tokenizer configuration.

 #### Warnings configuration

-spaCy's custom warnings have been replaced with native python
+spaCy's custom warnings have been replaced with native Python
 [`warnings`](https://docs.python.org/3/library/warnings.html). Instead of
-setting `SPACY_WARNING_IGNORE`, use the [warnings
-filters](https://docs.python.org/3/library/warnings.html#the-warnings-filter)
+setting `SPACY_WARNING_IGNORE`, use the
+[`warnings` filters](https://docs.python.org/3/library/warnings.html#the-warnings-filter)
 to manage warnings.

 #### Normalization tables

 The normalization tables have moved from the language data in
-[`spacy/lang`](https://github.com/explosion/spaCy/tree/master/spacy/lang) to
-the package
-[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data). If
-you're adding data for a new language, the normalization table should be added
-to `spacy-lookups-data`. See [adding norm
-exceptions](/usage/adding-languages#norm-exceptions).
+[`spacy/lang`](https://github.com/explosion/spaCy/tree/master/spacy/lang) to the
+package [`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data).
+If you're adding data for a new language, the normalization table should be
+added to `spacy-lookups-data`. See
+[adding norm exceptions](/usage/adding-languages#norm-exceptions).

 #### Probability and cluster features

@ -181,28 +188,28 @@ exceptions](/usage/adding-languages#norm-exceptions).

 The `Token.prob` and `Token.cluster` features, which are no longer used by the
 core pipeline components as of spaCy v2, are no longer provided in the
-pretrained models to reduce the model size. To keep these features available
-for users relying on them, the `prob` and `cluster` features for the most
-frequent 1M tokens have been moved to
+pretrained models to reduce the model size. To keep these features available for
+users relying on them, the `prob` and `cluster` features for the most frequent
+1M tokens have been moved to
 [`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data) as
 `extra` features for the relevant languages (English, German, Greek and
 Spanish).

 The extra tables are loaded lazily, so if you have `spacy-lookups-data`
-installed and your code accesses `Token.prob`, the full table is loaded into
-the model vocab, which will take a few seconds on initial loading. When you
-save this model after loading the `prob` table, the full `prob` table will be
-saved as part of the model vocab.
+installed and your code accesses `Token.prob`, the full table is loaded into the
+model vocab, which will take a few seconds on initial loading. When you save
+this model after loading the `prob` table, the full `prob` table will be saved
+as part of the model vocab.

-If you'd like to include custom `cluster`, `prob`, or `sentiment` tables as
-part of a new model, add the data to
+If you'd like to include custom `cluster`, `prob`, or `sentiment` tables as part
+of a new model, add the data to
 [`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data) under
 the entry point `lg_extra`, e.g. `en_extra` for English. Alternatively, you can
 initialize your [`Vocab`](/api/vocab) with the `lookups_extra` argument with a
 [`Lookups`](/api/lookups) object that includes the tables `lexeme_cluster`,
 `lexeme_prob`, `lexeme_sentiment` or `lexeme_settings`. `lexeme_settings` is
-currently only used to provide a custom `oov_prob`. See examples in the [`data`
-directory](https://github.com/explosion/spacy-lookups-data/tree/master/spacy_lookups_data/data)
+currently only used to provide a custom `oov_prob`. See examples in the
+[`data` directory](https://github.com/explosion/spacy-lookups-data/tree/master/spacy_lookups_data/data)
 in `spacy-lookups-data`.

 #### Initializing new models without extra lookups tables