From 44af53bdd93713b24ac28459c5d2543f03c47a18 Mon Sep 17 00:00:00 2001 From: Ines Montani Date: Tue, 16 Jun 2020 17:13:35 +0200 Subject: [PATCH] Add pkuseg warnings and auto-format [ci skip] --- website/docs/usage/models.md | 16 ++++- website/docs/usage/v2-3.md | 121 ++++++++++++++++++----------------- 2 files changed, 78 insertions(+), 59 deletions(-) diff --git a/website/docs/usage/models.md b/website/docs/usage/models.md index 382193157..4549e8433 100644 --- a/website/docs/usage/models.md +++ b/website/docs/usage/models.md @@ -117,6 +117,18 @@ The Chinese language class supports three word segmentation options: better segmentation for Chinese OntoNotes and the new [Chinese models](/models/zh). + + +Note that [`pkuseg`](https://github.com/lancopku/pkuseg-python) doesn't yet ship +with pre-compiled wheels for Python 3.8. If you're running Python 3.8, you can +install it from our fork and compile it locally: + +```bash +$ pip install https://github.com/honnibal/pkuseg-python/archive/master.zip +``` + + + The `meta` argument of the `Chinese` language class supports the following @@ -196,8 +208,8 @@ nlp = Chinese(meta={"tokenizer": {"config": {"pkuseg_model": "/path/to/pkuseg_mo The Japanese language class uses [SudachiPy](https://github.com/WorksApplications/SudachiPy) for word -segmentation and part-of-speech tagging. The default Japanese language class -and the provided Japanese models use SudachiPy split mode `A`. +segmentation and part-of-speech tagging. The default Japanese language class and +the provided Japanese models use SudachiPy split mode `A`. The `meta` argument of the `Japanese` language class can be used to configure the split mode to `A`, `B` or `C`. diff --git a/website/docs/usage/v2-3.md b/website/docs/usage/v2-3.md index ba75b01ab..d59b50a6e 100644 --- a/website/docs/usage/v2-3.md +++ b/website/docs/usage/v2-3.md @@ -14,10 +14,10 @@ all language models, and decreased model size and loading times for models with vectors. We've added pretrained models for **Chinese, Danish, Japanese, Polish and Romanian** and updated the training data and vectors for most languages. Model packages with vectors are about **2×** smaller on disk and load -**2-4×** faster. For the full changelog, see the [release notes on -GitHub](https://github.com/explosion/spaCy/releases/tag/v2.3.0). For more -details and a behind-the-scenes look at the new release, [see our blog -post](https://explosion.ai/blog/spacy-v2-3). +**2-4×** faster. For the full changelog, see the +[release notes on GitHub](https://github.com/explosion/spaCy/releases/tag/v2.3.0). +For more details and a behind-the-scenes look at the new release, +[see our blog post](https://explosion.ai/blog/spacy-v2-3). ### Expanded model families with vectors {#models} @@ -33,10 +33,10 @@ post](https://explosion.ai/blog/spacy-v2-3). With new model families for Chinese, Danish, Polish, Romanian and Chinese plus `md` and `lg` models with word vectors for all languages, this release provides -a total of 46 model packages. For models trained using [Universal -Dependencies](https://universaldependencies.org) corpora, the training data has -been updated to UD v2.5 (v2.6 for Japanese, v2.3 for Polish) and Dutch has been -extended to include both UD Dutch Alpino and LassySmall. +a total of 46 model packages. For models trained using +[Universal Dependencies](https://universaldependencies.org) corpora, the +training data has been updated to UD v2.5 (v2.6 for Japanese, v2.3 for Polish) +and Dutch has been extended to include both UD Dutch Alpino and LassySmall. @@ -48,6 +48,7 @@ extended to include both UD Dutch Alpino and LassySmall. ### Chinese {#chinese} > #### Example +> > ```python > from spacy.lang.zh import Chinese > @@ -57,41 +58,49 @@ extended to include both UD Dutch Alpino and LassySmall. > > # Append words to user dict > nlp.tokenizer.pkuseg_update_user_dict(["中国", "ABC"]) +> ``` This release adds support for -[pkuseg](https://github.com/lancopku/pkuseg-python) for word segmentation and -the new Chinese models ship with a custom pkuseg model trained on OntoNotes. -The Chinese tokenizer can be initialized with both `pkuseg` and custom models -and the `pkuseg` user dictionary is easy to customize. +[`pkuseg`](https://github.com/lancopku/pkuseg-python) for word segmentation and +the new Chinese models ship with a custom pkuseg model trained on OntoNotes. The +Chinese tokenizer can be initialized with both `pkuseg` and custom models and +the `pkuseg` user dictionary is easy to customize. Note that +[`pkuseg`](https://github.com/lancopku/pkuseg-python) doesn't yet ship with +pre-compiled wheels for Python 3.8. See the +[usage documentation](/usage/models#chinese) for details on how to install it on +Python 3.8. -**Chinese:** [Chinese tokenizer usage](/usage/models#chinese) +**Models:** [Chinese models](/models/zh) **Usage: ** +[Chinese tokenizer usage](/usage/models#chinese) ### Japanese {#japanese} The updated Japanese language class switches to -[SudachiPy](https://github.com/WorksApplications/SudachiPy) for word -segmentation and part-of-speech tagging. Using `sudachipy` greatly simplifies +[`SudachiPy`](https://github.com/WorksApplications/SudachiPy) for word +segmentation and part-of-speech tagging. Using `SudachiPy` greatly simplifies installing spaCy for Japanese, which is now possible with a single command: `pip install spacy[ja]`. -**Japanese:** [Japanese tokenizer usage](/usage/models#japanese) +**Models:** [Japanese models](/models/ja) **Usage:** +[Japanese tokenizer usage](/usage/models#japanese) ### Small CLI updates -- `spacy debug-data` provides the coverage of the vectors in a base model with - `spacy debug-data lang train dev -b base_model` -- `spacy evaluate` supports `blank:lg` (e.g. `spacy evaluate blank:en - dev.json`) to evaluate the tokenization accuracy without loading a model -- `spacy train` on GPU restricts the CPU timing evaluation to the first - iteration +- [`spacy debug-data`](/api/cli#debug-data) provides the coverage of the vectors + in a base model with `spacy debug-data lang train dev -b base_model` +- [`spacy evaluate`](/api/cli#evaluate) supports `blank:lg` (e.g. + `spacy evaluate blank:en dev.json`) to evaluate the tokenization accuracy + without loading a model +- [`spacy train`](/api/cli#train) on GPU restricts the CPU timing evaluation to + the first iteration ## Backwards incompatibilities {#incompat} @@ -100,8 +109,8 @@ installing spaCy for Japanese, which is now possible with a single command: If you've been training **your own models**, you'll need to **retrain** them with the new version. Also don't forget to upgrade all models to the latest versions. Models for earlier v2 releases (v2.0, v2.1, v2.2) aren't compatible -with models for v2.3. To check if all of your models are up to date, you can -run the [`spacy validate`](/api/cli#validate) command. +with models for v2.3. To check if all of your models are up to date, you can run +the [`spacy validate`](/api/cli#validate) command. @@ -116,21 +125,20 @@ run the [`spacy validate`](/api/cli#validate) command. > directly. - If you're training new models, you'll want to install the package - [`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data), - which now includes both the lemmatization tables (as in v2.2) and the - normalization tables (new in v2.3). If you're using pretrained models, - **nothing changes**, because the relevant tables are included in the model - packages. + [`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data), which + now includes both the lemmatization tables (as in v2.2) and the normalization + tables (new in v2.3). If you're using pretrained models, **nothing changes**, + because the relevant tables are included in the model packages. - Due to the updated Universal Dependencies training data, the fine-grained part-of-speech tags will change for many provided language models. The coarse-grained part-of-speech tagset remains the same, but the mapping from particular fine-grained to coarse-grained tags may show minor differences. - For French, Italian, Portuguese and Spanish, the fine-grained part-of-speech - tagsets contain new merged tags related to contracted forms, such as - `ADP_DET` for French `"au"`, which maps to UPOS `ADP` based on the head - `"à"`. This increases the accuracy of the models by improving the alignment - between spaCy's tokenization and Universal Dependencies multi-word tokens - used for contractions. + tagsets contain new merged tags related to contracted forms, such as `ADP_DET` + for French `"au"`, which maps to UPOS `ADP` based on the head `"à"`. This + increases the accuracy of the models by improving the alignment between + spaCy's tokenization and Universal Dependencies multi-word tokens used for + contractions. ### Migrating from spaCy 2.2 {#migrating} @@ -143,29 +151,28 @@ v2.3 so that `token_match` has priority over prefixes and suffixes as in v2.2.1 and earlier versions. A new tokenizer setting `url_match` has been introduced in v2.3.0 to handle -cases like URLs where the tokenizer should remove prefixes and suffixes (e.g., -a comma at the end of a URL) before applying the match. See the full [tokenizer -documentation](/usage/linguistic-features#tokenization) and try out +cases like URLs where the tokenizer should remove prefixes and suffixes (e.g., a +comma at the end of a URL) before applying the match. See the full +[tokenizer documentation](/usage/linguistic-features#tokenization) and try out [`nlp.tokenizer.explain()`](/usage/linguistic-features#tokenizer-debug) when debugging your tokenizer configuration. #### Warnings configuration -spaCy's custom warnings have been replaced with native python +spaCy's custom warnings have been replaced with native Python [`warnings`](https://docs.python.org/3/library/warnings.html). Instead of -setting `SPACY_WARNING_IGNORE`, use the [warnings -filters](https://docs.python.org/3/library/warnings.html#the-warnings-filter) +setting `SPACY_WARNING_IGNORE`, use the +[`warnings` filters](https://docs.python.org/3/library/warnings.html#the-warnings-filter) to manage warnings. #### Normalization tables The normalization tables have moved from the language data in -[`spacy/lang`](https://github.com/explosion/spaCy/tree/master/spacy/lang) to -the package -[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data). If -you're adding data for a new language, the normalization table should be added -to `spacy-lookups-data`. See [adding norm -exceptions](/usage/adding-languages#norm-exceptions). +[`spacy/lang`](https://github.com/explosion/spaCy/tree/master/spacy/lang) to the +package [`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data). +If you're adding data for a new language, the normalization table should be +added to `spacy-lookups-data`. See +[adding norm exceptions](/usage/adding-languages#norm-exceptions). #### Probability and cluster features @@ -181,28 +188,28 @@ exceptions](/usage/adding-languages#norm-exceptions). The `Token.prob` and `Token.cluster` features, which are no longer used by the core pipeline components as of spaCy v2, are no longer provided in the -pretrained models to reduce the model size. To keep these features available -for users relying on them, the `prob` and `cluster` features for the most -frequent 1M tokens have been moved to +pretrained models to reduce the model size. To keep these features available for +users relying on them, the `prob` and `cluster` features for the most frequent +1M tokens have been moved to [`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data) as `extra` features for the relevant languages (English, German, Greek and Spanish). The extra tables are loaded lazily, so if you have `spacy-lookups-data` -installed and your code accesses `Token.prob`, the full table is loaded into -the model vocab, which will take a few seconds on initial loading. When you -save this model after loading the `prob` table, the full `prob` table will be -saved as part of the model vocab. +installed and your code accesses `Token.prob`, the full table is loaded into the +model vocab, which will take a few seconds on initial loading. When you save +this model after loading the `prob` table, the full `prob` table will be saved +as part of the model vocab. -If you'd like to include custom `cluster`, `prob`, or `sentiment` tables as -part of a new model, add the data to +If you'd like to include custom `cluster`, `prob`, or `sentiment` tables as part +of a new model, add the data to [`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data) under the entry point `lg_extra`, e.g. `en_extra` for English. Alternatively, you can initialize your [`Vocab`](/api/vocab) with the `lookups_extra` argument with a [`Lookups`](/api/lookups) object that includes the tables `lexeme_cluster`, `lexeme_prob`, `lexeme_sentiment` or `lexeme_settings`. `lexeme_settings` is -currently only used to provide a custom `oov_prob`. See examples in the [`data` -directory](https://github.com/explosion/spacy-lookups-data/tree/master/spacy_lookups_data/data) +currently only used to provide a custom `oov_prob`. See examples in the +[`data` directory](https://github.com/explosion/spacy-lookups-data/tree/master/spacy_lookups_data/data) in `spacy-lookups-data`. #### Initializing new models without extra lookups tables