Update v2-2.md [ci skip]

2019-09-27 16:35:01 +02:00 · 2019-09-27 16:35:01 +02:00 · 685e4b2554
parent aad66d9bb9
commit 685e4b2554
1 changed files with 28 additions and 20 deletions
--- a/website/docs/usage/v2-2.md
+++ b/website/docs/usage/v2-2.md
@ -336,31 +336,39 @@ check if all of your models are up to date, you can run the

 </Infobox>

- The Dutch models have been trained on a new NER corpus (custom labelled UD
-  instead of WikiNER), so their predictions may be very different compared to
-  the previous version. The results should be significantly better and more
-  generalizable, though.
- The `spacy download` command does **not** set the `--no-deps` pip argument
-  anymore by default, meaning that model package dependencies (if available)
-  will now be also downloaded and installed. If spaCy (which is also a model
-  dependency) is not installed in the current environment, e.g. if a user has
-  built from source, `--no-deps` is added back automatically to prevent spaCy
-  from being downloaded and installed again from pip.
- The built-in `biluo_tags_from_offsets` converter is now stricter and will
-  raise an error if entities are overlapping (instead of silently skipping
-  them). If your data contains invalid entity annotations, make sure to clean it
-  and resolve conflicts. You can now also use the new `debug-data` command to
-  find problems in your data.
+- The [Dutch model](/models/nl) has been trained on a new NER corpus (custom
+  labelled UD instead of WikiNER), so their predictions may be very different
+  compared to the previous version. The results should be significantly better
+  and more generalizable, though.
+- The [`spacy download`](/api/cli#download) command does **not** set the
+  `--no-deps` pip argument anymore by default, meaning that model package
+  dependencies (if available) will now be also downloaded and installed. If
+  spaCy (which is also a model dependency) is not installed in the current
+  environment, e.g. if a user has built from source, `--no-deps` is added back
+  automatically to prevent spaCy from being downloaded and installed again from
+  pip.
+- The built-in
+  [`biluo_tags_from_offsets`](/api/goldparse#biluo_tags_from_offsets) converter
+  is now stricter and will raise an error if entities are overlapping (instead
+  of silently skipping them). If your data contains invalid entity annotations,
+  make sure to clean it and resolve conflicts. You can now also use the new
+  `debug-data` command to find problems in your data.
 - Pipeline components can now overwrite IOB tags of tokens that are not yet part
  of an entity. Once a token has an `ent_iob` value set, it won't be reset to an
  "unset" state and will always have at least `O` assigned. `list(doc.ents)` now
  actually keeps the annotations on the token level consistent, instead of
  resetting `O` to an empty string.
- The default punctuation in the `sentencizer` has been extended and now
-  includes more characters common in various languages. This also means that the
-  results it produces may change, depending on your text. If you want the
-  previous behaviour with limited characters, set `punct_chars=[".", "!", "?"]`
-  on initialization.
+- The default punctuation in the [`Sentencizer`](/api/sentencizer) has been
+  extended and now includes more characters common in various languages. This
+  also means that the results it produces may change, depending on your text. If
+  you want the previous behaviour with limited characters, set
+  `punct_chars=[".", "!", "?"]` on initialization.
+- The [`PhraseMatcher`](/api/phrasematcher) algorithm was rewritten from scratch
+  and it's now 10&times; faster. The rewrite also resolved a few subtle bugs
+  with very large terminology lists. So if you were matching large lists, you
+  may see slightly different results – however, the results should now be fully
+  correct. See [this PR](https://github.com/explosion/spaCy/pulls/4309) for more
+  details.
 - Lemmatization tables (rules, exceptions, index and lookups) are now part of
  the `Vocab` and serialized with it. This means that serialized objects (`nlp`,
  pipeline components, vocab) will now include additional data, and models