mirror of https://github.com/explosion/spaCy.git
Add pkuseg warnings and auto-format [ci skip]
This commit is contained in:
parent
a9e5b840ee
commit
44af53bdd9
|
@ -117,6 +117,18 @@ The Chinese language class supports three word segmentation options:
|
||||||
better segmentation for Chinese OntoNotes and the new
|
better segmentation for Chinese OntoNotes and the new
|
||||||
[Chinese models](/models/zh).
|
[Chinese models](/models/zh).
|
||||||
|
|
||||||
|
<Infobox variant="warning">
|
||||||
|
|
||||||
|
Note that [`pkuseg`](https://github.com/lancopku/pkuseg-python) doesn't yet ship
|
||||||
|
with pre-compiled wheels for Python 3.8. If you're running Python 3.8, you can
|
||||||
|
install it from our fork and compile it locally:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
$ pip install https://github.com/honnibal/pkuseg-python/archive/master.zip
|
||||||
|
```
|
||||||
|
|
||||||
|
</Infobox>
|
||||||
|
|
||||||
<Accordion title="Details on spaCy's PKUSeg API">
|
<Accordion title="Details on spaCy's PKUSeg API">
|
||||||
|
|
||||||
The `meta` argument of the `Chinese` language class supports the following
|
The `meta` argument of the `Chinese` language class supports the following
|
||||||
|
@ -196,8 +208,8 @@ nlp = Chinese(meta={"tokenizer": {"config": {"pkuseg_model": "/path/to/pkuseg_mo
|
||||||
|
|
||||||
The Japanese language class uses
|
The Japanese language class uses
|
||||||
[SudachiPy](https://github.com/WorksApplications/SudachiPy) for word
|
[SudachiPy](https://github.com/WorksApplications/SudachiPy) for word
|
||||||
segmentation and part-of-speech tagging. The default Japanese language class
|
segmentation and part-of-speech tagging. The default Japanese language class and
|
||||||
and the provided Japanese models use SudachiPy split mode `A`.
|
the provided Japanese models use SudachiPy split mode `A`.
|
||||||
|
|
||||||
The `meta` argument of the `Japanese` language class can be used to configure
|
The `meta` argument of the `Japanese` language class can be used to configure
|
||||||
the split mode to `A`, `B` or `C`.
|
the split mode to `A`, `B` or `C`.
|
||||||
|
|
|
@ -14,10 +14,10 @@ all language models, and decreased model size and loading times for models with
|
||||||
vectors. We've added pretrained models for **Chinese, Danish, Japanese, Polish
|
vectors. We've added pretrained models for **Chinese, Danish, Japanese, Polish
|
||||||
and Romanian** and updated the training data and vectors for most languages.
|
and Romanian** and updated the training data and vectors for most languages.
|
||||||
Model packages with vectors are about **2×** smaller on disk and load
|
Model packages with vectors are about **2×** smaller on disk and load
|
||||||
**2-4×** faster. For the full changelog, see the [release notes on
|
**2-4×** faster. For the full changelog, see the
|
||||||
GitHub](https://github.com/explosion/spaCy/releases/tag/v2.3.0). For more
|
[release notes on GitHub](https://github.com/explosion/spaCy/releases/tag/v2.3.0).
|
||||||
details and a behind-the-scenes look at the new release, [see our blog
|
For more details and a behind-the-scenes look at the new release,
|
||||||
post](https://explosion.ai/blog/spacy-v2-3).
|
[see our blog post](https://explosion.ai/blog/spacy-v2-3).
|
||||||
|
|
||||||
### Expanded model families with vectors {#models}
|
### Expanded model families with vectors {#models}
|
||||||
|
|
||||||
|
@ -33,10 +33,10 @@ post](https://explosion.ai/blog/spacy-v2-3).
|
||||||
|
|
||||||
With new model families for Chinese, Danish, Polish, Romanian and Chinese plus
|
With new model families for Chinese, Danish, Polish, Romanian and Chinese plus
|
||||||
`md` and `lg` models with word vectors for all languages, this release provides
|
`md` and `lg` models with word vectors for all languages, this release provides
|
||||||
a total of 46 model packages. For models trained using [Universal
|
a total of 46 model packages. For models trained using
|
||||||
Dependencies](https://universaldependencies.org) corpora, the training data has
|
[Universal Dependencies](https://universaldependencies.org) corpora, the
|
||||||
been updated to UD v2.5 (v2.6 for Japanese, v2.3 for Polish) and Dutch has been
|
training data has been updated to UD v2.5 (v2.6 for Japanese, v2.3 for Polish)
|
||||||
extended to include both UD Dutch Alpino and LassySmall.
|
and Dutch has been extended to include both UD Dutch Alpino and LassySmall.
|
||||||
|
|
||||||
<Infobox>
|
<Infobox>
|
||||||
|
|
||||||
|
@ -48,6 +48,7 @@ extended to include both UD Dutch Alpino and LassySmall.
|
||||||
### Chinese {#chinese}
|
### Chinese {#chinese}
|
||||||
|
|
||||||
> #### Example
|
> #### Example
|
||||||
|
>
|
||||||
> ```python
|
> ```python
|
||||||
> from spacy.lang.zh import Chinese
|
> from spacy.lang.zh import Chinese
|
||||||
>
|
>
|
||||||
|
@ -57,41 +58,49 @@ extended to include both UD Dutch Alpino and LassySmall.
|
||||||
>
|
>
|
||||||
> # Append words to user dict
|
> # Append words to user dict
|
||||||
> nlp.tokenizer.pkuseg_update_user_dict(["中国", "ABC"])
|
> nlp.tokenizer.pkuseg_update_user_dict(["中国", "ABC"])
|
||||||
|
> ```
|
||||||
|
|
||||||
This release adds support for
|
This release adds support for
|
||||||
[pkuseg](https://github.com/lancopku/pkuseg-python) for word segmentation and
|
[`pkuseg`](https://github.com/lancopku/pkuseg-python) for word segmentation and
|
||||||
the new Chinese models ship with a custom pkuseg model trained on OntoNotes.
|
the new Chinese models ship with a custom pkuseg model trained on OntoNotes. The
|
||||||
The Chinese tokenizer can be initialized with both `pkuseg` and custom models
|
Chinese tokenizer can be initialized with both `pkuseg` and custom models and
|
||||||
and the `pkuseg` user dictionary is easy to customize.
|
the `pkuseg` user dictionary is easy to customize. Note that
|
||||||
|
[`pkuseg`](https://github.com/lancopku/pkuseg-python) doesn't yet ship with
|
||||||
|
pre-compiled wheels for Python 3.8. See the
|
||||||
|
[usage documentation](/usage/models#chinese) for details on how to install it on
|
||||||
|
Python 3.8.
|
||||||
|
|
||||||
<Infobox>
|
<Infobox>
|
||||||
|
|
||||||
**Chinese:** [Chinese tokenizer usage](/usage/models#chinese)
|
**Models:** [Chinese models](/models/zh) **Usage: **
|
||||||
|
[Chinese tokenizer usage](/usage/models#chinese)
|
||||||
|
|
||||||
</Infobox>
|
</Infobox>
|
||||||
|
|
||||||
### Japanese {#japanese}
|
### Japanese {#japanese}
|
||||||
|
|
||||||
The updated Japanese language class switches to
|
The updated Japanese language class switches to
|
||||||
[SudachiPy](https://github.com/WorksApplications/SudachiPy) for word
|
[`SudachiPy`](https://github.com/WorksApplications/SudachiPy) for word
|
||||||
segmentation and part-of-speech tagging. Using `sudachipy` greatly simplifies
|
segmentation and part-of-speech tagging. Using `SudachiPy` greatly simplifies
|
||||||
installing spaCy for Japanese, which is now possible with a single command:
|
installing spaCy for Japanese, which is now possible with a single command:
|
||||||
`pip install spacy[ja]`.
|
`pip install spacy[ja]`.
|
||||||
|
|
||||||
<Infobox>
|
<Infobox>
|
||||||
|
|
||||||
**Japanese:** [Japanese tokenizer usage](/usage/models#japanese)
|
**Models:** [Japanese models](/models/ja) **Usage:**
|
||||||
|
[Japanese tokenizer usage](/usage/models#japanese)
|
||||||
|
|
||||||
</Infobox>
|
</Infobox>
|
||||||
|
|
||||||
### Small CLI updates
|
### Small CLI updates
|
||||||
|
|
||||||
- `spacy debug-data` provides the coverage of the vectors in a base model with
|
- [`spacy debug-data`](/api/cli#debug-data) provides the coverage of the vectors
|
||||||
`spacy debug-data lang train dev -b base_model`
|
in a base model with `spacy debug-data lang train dev -b base_model`
|
||||||
- `spacy evaluate` supports `blank:lg` (e.g. `spacy evaluate blank:en
|
- [`spacy evaluate`](/api/cli#evaluate) supports `blank:lg` (e.g.
|
||||||
dev.json`) to evaluate the tokenization accuracy without loading a model
|
`spacy evaluate blank:en dev.json`) to evaluate the tokenization accuracy
|
||||||
- `spacy train` on GPU restricts the CPU timing evaluation to the first
|
without loading a model
|
||||||
iteration
|
- [`spacy train`](/api/cli#train) on GPU restricts the CPU timing evaluation to
|
||||||
|
the first iteration
|
||||||
|
|
||||||
## Backwards incompatibilities {#incompat}
|
## Backwards incompatibilities {#incompat}
|
||||||
|
|
||||||
|
@ -100,8 +109,8 @@ installing spaCy for Japanese, which is now possible with a single command:
|
||||||
If you've been training **your own models**, you'll need to **retrain** them
|
If you've been training **your own models**, you'll need to **retrain** them
|
||||||
with the new version. Also don't forget to upgrade all models to the latest
|
with the new version. Also don't forget to upgrade all models to the latest
|
||||||
versions. Models for earlier v2 releases (v2.0, v2.1, v2.2) aren't compatible
|
versions. Models for earlier v2 releases (v2.0, v2.1, v2.2) aren't compatible
|
||||||
with models for v2.3. To check if all of your models are up to date, you can
|
with models for v2.3. To check if all of your models are up to date, you can run
|
||||||
run the [`spacy validate`](/api/cli#validate) command.
|
the [`spacy validate`](/api/cli#validate) command.
|
||||||
|
|
||||||
</Infobox>
|
</Infobox>
|
||||||
|
|
||||||
|
@ -116,21 +125,20 @@ run the [`spacy validate`](/api/cli#validate) command.
|
||||||
> directly.
|
> directly.
|
||||||
|
|
||||||
- If you're training new models, you'll want to install the package
|
- If you're training new models, you'll want to install the package
|
||||||
[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data),
|
[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data), which
|
||||||
which now includes both the lemmatization tables (as in v2.2) and the
|
now includes both the lemmatization tables (as in v2.2) and the normalization
|
||||||
normalization tables (new in v2.3). If you're using pretrained models,
|
tables (new in v2.3). If you're using pretrained models, **nothing changes**,
|
||||||
**nothing changes**, because the relevant tables are included in the model
|
because the relevant tables are included in the model packages.
|
||||||
packages.
|
|
||||||
- Due to the updated Universal Dependencies training data, the fine-grained
|
- Due to the updated Universal Dependencies training data, the fine-grained
|
||||||
part-of-speech tags will change for many provided language models. The
|
part-of-speech tags will change for many provided language models. The
|
||||||
coarse-grained part-of-speech tagset remains the same, but the mapping from
|
coarse-grained part-of-speech tagset remains the same, but the mapping from
|
||||||
particular fine-grained to coarse-grained tags may show minor differences.
|
particular fine-grained to coarse-grained tags may show minor differences.
|
||||||
- For French, Italian, Portuguese and Spanish, the fine-grained part-of-speech
|
- For French, Italian, Portuguese and Spanish, the fine-grained part-of-speech
|
||||||
tagsets contain new merged tags related to contracted forms, such as
|
tagsets contain new merged tags related to contracted forms, such as `ADP_DET`
|
||||||
`ADP_DET` for French `"au"`, which maps to UPOS `ADP` based on the head
|
for French `"au"`, which maps to UPOS `ADP` based on the head `"à"`. This
|
||||||
`"à"`. This increases the accuracy of the models by improving the alignment
|
increases the accuracy of the models by improving the alignment between
|
||||||
between spaCy's tokenization and Universal Dependencies multi-word tokens
|
spaCy's tokenization and Universal Dependencies multi-word tokens used for
|
||||||
used for contractions.
|
contractions.
|
||||||
|
|
||||||
### Migrating from spaCy 2.2 {#migrating}
|
### Migrating from spaCy 2.2 {#migrating}
|
||||||
|
|
||||||
|
@ -143,29 +151,28 @@ v2.3 so that `token_match` has priority over prefixes and suffixes as in v2.2.1
|
||||||
and earlier versions.
|
and earlier versions.
|
||||||
|
|
||||||
A new tokenizer setting `url_match` has been introduced in v2.3.0 to handle
|
A new tokenizer setting `url_match` has been introduced in v2.3.0 to handle
|
||||||
cases like URLs where the tokenizer should remove prefixes and suffixes (e.g.,
|
cases like URLs where the tokenizer should remove prefixes and suffixes (e.g., a
|
||||||
a comma at the end of a URL) before applying the match. See the full [tokenizer
|
comma at the end of a URL) before applying the match. See the full
|
||||||
documentation](/usage/linguistic-features#tokenization) and try out
|
[tokenizer documentation](/usage/linguistic-features#tokenization) and try out
|
||||||
[`nlp.tokenizer.explain()`](/usage/linguistic-features#tokenizer-debug) when
|
[`nlp.tokenizer.explain()`](/usage/linguistic-features#tokenizer-debug) when
|
||||||
debugging your tokenizer configuration.
|
debugging your tokenizer configuration.
|
||||||
|
|
||||||
#### Warnings configuration
|
#### Warnings configuration
|
||||||
|
|
||||||
spaCy's custom warnings have been replaced with native python
|
spaCy's custom warnings have been replaced with native Python
|
||||||
[`warnings`](https://docs.python.org/3/library/warnings.html). Instead of
|
[`warnings`](https://docs.python.org/3/library/warnings.html). Instead of
|
||||||
setting `SPACY_WARNING_IGNORE`, use the [warnings
|
setting `SPACY_WARNING_IGNORE`, use the
|
||||||
filters](https://docs.python.org/3/library/warnings.html#the-warnings-filter)
|
[`warnings` filters](https://docs.python.org/3/library/warnings.html#the-warnings-filter)
|
||||||
to manage warnings.
|
to manage warnings.
|
||||||
|
|
||||||
#### Normalization tables
|
#### Normalization tables
|
||||||
|
|
||||||
The normalization tables have moved from the language data in
|
The normalization tables have moved from the language data in
|
||||||
[`spacy/lang`](https://github.com/explosion/spaCy/tree/master/spacy/lang) to
|
[`spacy/lang`](https://github.com/explosion/spaCy/tree/master/spacy/lang) to the
|
||||||
the package
|
package [`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data).
|
||||||
[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data). If
|
If you're adding data for a new language, the normalization table should be
|
||||||
you're adding data for a new language, the normalization table should be added
|
added to `spacy-lookups-data`. See
|
||||||
to `spacy-lookups-data`. See [adding norm
|
[adding norm exceptions](/usage/adding-languages#norm-exceptions).
|
||||||
exceptions](/usage/adding-languages#norm-exceptions).
|
|
||||||
|
|
||||||
#### Probability and cluster features
|
#### Probability and cluster features
|
||||||
|
|
||||||
|
@ -181,28 +188,28 @@ exceptions](/usage/adding-languages#norm-exceptions).
|
||||||
|
|
||||||
The `Token.prob` and `Token.cluster` features, which are no longer used by the
|
The `Token.prob` and `Token.cluster` features, which are no longer used by the
|
||||||
core pipeline components as of spaCy v2, are no longer provided in the
|
core pipeline components as of spaCy v2, are no longer provided in the
|
||||||
pretrained models to reduce the model size. To keep these features available
|
pretrained models to reduce the model size. To keep these features available for
|
||||||
for users relying on them, the `prob` and `cluster` features for the most
|
users relying on them, the `prob` and `cluster` features for the most frequent
|
||||||
frequent 1M tokens have been moved to
|
1M tokens have been moved to
|
||||||
[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data) as
|
[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data) as
|
||||||
`extra` features for the relevant languages (English, German, Greek and
|
`extra` features for the relevant languages (English, German, Greek and
|
||||||
Spanish).
|
Spanish).
|
||||||
|
|
||||||
The extra tables are loaded lazily, so if you have `spacy-lookups-data`
|
The extra tables are loaded lazily, so if you have `spacy-lookups-data`
|
||||||
installed and your code accesses `Token.prob`, the full table is loaded into
|
installed and your code accesses `Token.prob`, the full table is loaded into the
|
||||||
the model vocab, which will take a few seconds on initial loading. When you
|
model vocab, which will take a few seconds on initial loading. When you save
|
||||||
save this model after loading the `prob` table, the full `prob` table will be
|
this model after loading the `prob` table, the full `prob` table will be saved
|
||||||
saved as part of the model vocab.
|
as part of the model vocab.
|
||||||
|
|
||||||
If you'd like to include custom `cluster`, `prob`, or `sentiment` tables as
|
If you'd like to include custom `cluster`, `prob`, or `sentiment` tables as part
|
||||||
part of a new model, add the data to
|
of a new model, add the data to
|
||||||
[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data) under
|
[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data) under
|
||||||
the entry point `lg_extra`, e.g. `en_extra` for English. Alternatively, you can
|
the entry point `lg_extra`, e.g. `en_extra` for English. Alternatively, you can
|
||||||
initialize your [`Vocab`](/api/vocab) with the `lookups_extra` argument with a
|
initialize your [`Vocab`](/api/vocab) with the `lookups_extra` argument with a
|
||||||
[`Lookups`](/api/lookups) object that includes the tables `lexeme_cluster`,
|
[`Lookups`](/api/lookups) object that includes the tables `lexeme_cluster`,
|
||||||
`lexeme_prob`, `lexeme_sentiment` or `lexeme_settings`. `lexeme_settings` is
|
`lexeme_prob`, `lexeme_sentiment` or `lexeme_settings`. `lexeme_settings` is
|
||||||
currently only used to provide a custom `oov_prob`. See examples in the [`data`
|
currently only used to provide a custom `oov_prob`. See examples in the
|
||||||
directory](https://github.com/explosion/spacy-lookups-data/tree/master/spacy_lookups_data/data)
|
[`data` directory](https://github.com/explosion/spacy-lookups-data/tree/master/spacy_lookups_data/data)
|
||||||
in `spacy-lookups-data`.
|
in `spacy-lookups-data`.
|
||||||
|
|
||||||
#### Initializing new models without extra lookups tables
|
#### Initializing new models without extra lookups tables
|
||||||
|
|
Loading…
Reference in New Issue