mirror of https://github.com/explosion/spaCy.git
Update v2-2.md [ci skip]
This commit is contained in:
parent
aad66d9bb9
commit
685e4b2554
|
@ -336,31 +336,39 @@ check if all of your models are up to date, you can run the
|
||||||
|
|
||||||
</Infobox>
|
</Infobox>
|
||||||
|
|
||||||
- The Dutch models have been trained on a new NER corpus (custom labelled UD
|
- The [Dutch model](/models/nl) has been trained on a new NER corpus (custom
|
||||||
instead of WikiNER), so their predictions may be very different compared to
|
labelled UD instead of WikiNER), so their predictions may be very different
|
||||||
the previous version. The results should be significantly better and more
|
compared to the previous version. The results should be significantly better
|
||||||
generalizable, though.
|
and more generalizable, though.
|
||||||
- The `spacy download` command does **not** set the `--no-deps` pip argument
|
- The [`spacy download`](/api/cli#download) command does **not** set the
|
||||||
anymore by default, meaning that model package dependencies (if available)
|
`--no-deps` pip argument anymore by default, meaning that model package
|
||||||
will now be also downloaded and installed. If spaCy (which is also a model
|
dependencies (if available) will now be also downloaded and installed. If
|
||||||
dependency) is not installed in the current environment, e.g. if a user has
|
spaCy (which is also a model dependency) is not installed in the current
|
||||||
built from source, `--no-deps` is added back automatically to prevent spaCy
|
environment, e.g. if a user has built from source, `--no-deps` is added back
|
||||||
from being downloaded and installed again from pip.
|
automatically to prevent spaCy from being downloaded and installed again from
|
||||||
- The built-in `biluo_tags_from_offsets` converter is now stricter and will
|
pip.
|
||||||
raise an error if entities are overlapping (instead of silently skipping
|
- The built-in
|
||||||
them). If your data contains invalid entity annotations, make sure to clean it
|
[`biluo_tags_from_offsets`](/api/goldparse#biluo_tags_from_offsets) converter
|
||||||
and resolve conflicts. You can now also use the new `debug-data` command to
|
is now stricter and will raise an error if entities are overlapping (instead
|
||||||
find problems in your data.
|
of silently skipping them). If your data contains invalid entity annotations,
|
||||||
|
make sure to clean it and resolve conflicts. You can now also use the new
|
||||||
|
`debug-data` command to find problems in your data.
|
||||||
- Pipeline components can now overwrite IOB tags of tokens that are not yet part
|
- Pipeline components can now overwrite IOB tags of tokens that are not yet part
|
||||||
of an entity. Once a token has an `ent_iob` value set, it won't be reset to an
|
of an entity. Once a token has an `ent_iob` value set, it won't be reset to an
|
||||||
"unset" state and will always have at least `O` assigned. `list(doc.ents)` now
|
"unset" state and will always have at least `O` assigned. `list(doc.ents)` now
|
||||||
actually keeps the annotations on the token level consistent, instead of
|
actually keeps the annotations on the token level consistent, instead of
|
||||||
resetting `O` to an empty string.
|
resetting `O` to an empty string.
|
||||||
- The default punctuation in the `sentencizer` has been extended and now
|
- The default punctuation in the [`Sentencizer`](/api/sentencizer) has been
|
||||||
includes more characters common in various languages. This also means that the
|
extended and now includes more characters common in various languages. This
|
||||||
results it produces may change, depending on your text. If you want the
|
also means that the results it produces may change, depending on your text. If
|
||||||
previous behaviour with limited characters, set `punct_chars=[".", "!", "?"]`
|
you want the previous behaviour with limited characters, set
|
||||||
on initialization.
|
`punct_chars=[".", "!", "?"]` on initialization.
|
||||||
|
- The [`PhraseMatcher`](/api/phrasematcher) algorithm was rewritten from scratch
|
||||||
|
and it's now 10× faster. The rewrite also resolved a few subtle bugs
|
||||||
|
with very large terminology lists. So if you were matching large lists, you
|
||||||
|
may see slightly different results – however, the results should now be fully
|
||||||
|
correct. See [this PR](https://github.com/explosion/spaCy/pulls/4309) for more
|
||||||
|
details.
|
||||||
- Lemmatization tables (rules, exceptions, index and lookups) are now part of
|
- Lemmatization tables (rules, exceptions, index and lookups) are now part of
|
||||||
the `Vocab` and serialized with it. This means that serialized objects (`nlp`,
|
the `Vocab` and serialized with it. This means that serialized objects (`nlp`,
|
||||||
pipeline components, vocab) will now include additional data, and models
|
pipeline components, vocab) will now include additional data, and models
|
||||||
|
|
Loading…
Reference in New Issue