From 685e4b255407f7a489d2bff7186813c2f9cbdc4c Mon Sep 17 00:00:00 2001 From: Ines Montani Date: Fri, 27 Sep 2019 16:35:01 +0200 Subject: [PATCH] Update v2-2.md [ci skip] --- website/docs/usage/v2-2.md | 48 ++++++++++++++++++++++---------------- 1 file changed, 28 insertions(+), 20 deletions(-) diff --git a/website/docs/usage/v2-2.md b/website/docs/usage/v2-2.md index ded0404a3..8243f26c3 100644 --- a/website/docs/usage/v2-2.md +++ b/website/docs/usage/v2-2.md @@ -336,31 +336,39 @@ check if all of your models are up to date, you can run the -- The Dutch models have been trained on a new NER corpus (custom labelled UD - instead of WikiNER), so their predictions may be very different compared to - the previous version. The results should be significantly better and more - generalizable, though. -- The `spacy download` command does **not** set the `--no-deps` pip argument - anymore by default, meaning that model package dependencies (if available) - will now be also downloaded and installed. If spaCy (which is also a model - dependency) is not installed in the current environment, e.g. if a user has - built from source, `--no-deps` is added back automatically to prevent spaCy - from being downloaded and installed again from pip. -- The built-in `biluo_tags_from_offsets` converter is now stricter and will - raise an error if entities are overlapping (instead of silently skipping - them). If your data contains invalid entity annotations, make sure to clean it - and resolve conflicts. You can now also use the new `debug-data` command to - find problems in your data. +- The [Dutch model](/models/nl) has been trained on a new NER corpus (custom + labelled UD instead of WikiNER), so their predictions may be very different + compared to the previous version. The results should be significantly better + and more generalizable, though. +- The [`spacy download`](/api/cli#download) command does **not** set the + `--no-deps` pip argument anymore by default, meaning that model package + dependencies (if available) will now be also downloaded and installed. If + spaCy (which is also a model dependency) is not installed in the current + environment, e.g. if a user has built from source, `--no-deps` is added back + automatically to prevent spaCy from being downloaded and installed again from + pip. +- The built-in + [`biluo_tags_from_offsets`](/api/goldparse#biluo_tags_from_offsets) converter + is now stricter and will raise an error if entities are overlapping (instead + of silently skipping them). If your data contains invalid entity annotations, + make sure to clean it and resolve conflicts. You can now also use the new + `debug-data` command to find problems in your data. - Pipeline components can now overwrite IOB tags of tokens that are not yet part of an entity. Once a token has an `ent_iob` value set, it won't be reset to an "unset" state and will always have at least `O` assigned. `list(doc.ents)` now actually keeps the annotations on the token level consistent, instead of resetting `O` to an empty string. -- The default punctuation in the `sentencizer` has been extended and now - includes more characters common in various languages. This also means that the - results it produces may change, depending on your text. If you want the - previous behaviour with limited characters, set `punct_chars=[".", "!", "?"]` - on initialization. +- The default punctuation in the [`Sentencizer`](/api/sentencizer) has been + extended and now includes more characters common in various languages. This + also means that the results it produces may change, depending on your text. If + you want the previous behaviour with limited characters, set + `punct_chars=[".", "!", "?"]` on initialization. +- The [`PhraseMatcher`](/api/phrasematcher) algorithm was rewritten from scratch + and it's now 10× faster. The rewrite also resolved a few subtle bugs + with very large terminology lists. So if you were matching large lists, you + may see slightly different results – however, the results should now be fully + correct. See [this PR](https://github.com/explosion/spaCy/pulls/4309) for more + details. - Lemmatization tables (rules, exceptions, index and lookups) are now part of the `Vocab` and serialized with it. This means that serialized objects (`nlp`, pipeline components, vocab) will now include additional data, and models