Commit Graph

11063 Commits

Author SHA1 Message Date
Matthew Honnibal 29f9fec267
Improve spacy pretrain (#4393)
* Support bilstm_depth arg in spacy pretrain

* Add option to ignore zero vectors in get_cossim_loss

* Use cosine loss in Cloze multitask
2019-10-07 23:34:58 +02:00
Ines Montani 9cd6ca3e4d Improve usage of pkg_resources and handling of entry points (#4387)
* Only import pkg_resources where it's needed

Apparently it's really slow

* Use importlib_metadata for entry points

* Revert "Only import pkg_resources where it's needed"

This reverts commit 5ed8c03afa.

* Revert "Revert "Only import pkg_resources where it's needed""

This reverts commit 8b30b57957.

* Revert "Use importlib_metadata for entry points"

This reverts commit 9f071f5c40.

* Revert "Revert "Use importlib_metadata for entry points""

This reverts commit 02e12a17ec.

* Skip test that weirdly hangs

* Fix hanging test by using global
2019-10-07 17:22:09 +02:00
adrianeboyd d53a8d9313 Consider batch_size when sorting similar vectors (#4388) 2019-10-07 13:38:35 +02:00
adrianeboyd a3509f67d4 Extend unicode character block for Sinhala (#4378)
* Extend unicode character block for Sinhala

* Add sentencizer tests for more languages
2019-10-07 13:17:03 +02:00
Ines Montani 573e543e4a Alphanumeric -> alphabetic [ci skip]
see ines/spacy-course#38
2019-10-06 13:30:01 +02:00
adrianeboyd cbc2cee2c8 Improve URL_PATTERN and handling in tokenizer (#4374)
* Move prefix and suffix detection for URL_PATTERN

Move prefix and suffix detection for `URL_PATTERN` into the tokenizer.
Remove associated lookahead and lookbehind from `URL_PATTERN`.

Fix tokenization for Hungarian given new modified handling of prefixes
and suffixes.

* Match a wider range of URI schemes
2019-10-05 13:00:09 +02:00
Ines Montani e65dffd80b Clarify serialization of extension attributes (closes #4377) [ci skip] 2019-10-05 11:58:00 +02:00
Ines Montani fec9433044 Make PhraseMatcher.vocab consistent with Matcher.vocab (closes #4373) 2019-10-04 12:18:41 +02:00
Ines Montani e7ddc6f662 Add conda install for lookups [ci skip] 2019-10-03 17:52:53 +02:00
Matthew Honnibal 37ef874d8b Set version to v2.2.1 2019-10-03 14:50:39 +02:00
Sofie Van Landeghem 4e7259c6cf Bugfix initializing DocBin with attributes (#4368)
* docbin init fix + documentation fix + unit tests

* newline

* try with zlib instead of gzip (python 2 incompatibilities)
2019-10-03 14:48:45 +02:00
Ines Montani ce1d441de5 Add docs for Vectors.most_similar [ci skip] 2019-10-03 14:29:47 +02:00
Ben Taylor 1db79a33cb most_similar() return the k most similar vectors (#4364)
* most_similar return n-most similar vectors

* updated most_similar comment

* add bintay contributor agreement

* sign bintay contributor agreement

* fix most_similar documentation typo

* fixed error in prune_vectors

* updated prune_vectors test
2019-10-03 14:09:44 +02:00
Ines Montani 4159936720 Update README.md [ci skip] 2019-10-02 19:15:22 +02:00
Ines Montani e4782feae9 Update README.md [ci skip] 2019-10-02 18:49:55 +02:00
Ines Montani 80cf385f65 Update v2-2.md [ci skip] 2019-10-02 16:58:21 +02:00
Ines Montani f8e606c303 Update README.md [ci skip] 2019-10-02 16:47:10 +02:00
Ines Montani 12a941d841 Update binder version [ci skip] 2019-10-02 16:47:01 +02:00
Matthew Honnibal 2eb31012e7 Set version to v2.2.0 2019-10-02 14:40:06 +02:00
Matthew Honnibal 796072e560 Set version to v2.2.0.dev19 2019-10-02 12:51:29 +02:00
Sofie Van Landeghem 9d3ce7cba2 Ensure training doesn't crash with empty batches (#4360)
* unit test for previously resolved unflatten issue

* prevent batch of empty docs to cause problems
2019-10-02 12:50:47 +02:00
Ines Montani 52b5912dbf Tidy up [ci skip] 2019-10-02 12:05:59 +02:00
adrianeboyd d82241218a Make the default NER labels less model-specific [ci skip] (#4361) 2019-10-02 12:05:17 +02:00
adrianeboyd dda86118bd Update Ukrainian lemmatizer with new lookups (#4359)
* Update Ukrainian lemmatizer with new lookups

* Add missing import


Co-authored-by: Ines Montani <ines@ines.io>
2019-10-02 12:04:06 +02:00
Ines Montani b6670bf0c2 Use consistent spelling 2019-10-02 10:37:39 +02:00
Ines Montani 208629615d Auto-format 2019-10-02 10:37:04 +02:00
Ines Montani 867e93aae2 Add Streamlit example [ci skip] 2019-10-02 01:21:20 +02:00
Matthew Honnibal 38b6e69389 Merge branch 'master' of https://github.com/explosion/spaCy 2019-10-01 22:28:25 +02:00
Matthew Honnibal d4b63bb6dd Set version to v2.2.0 2019-10-01 22:28:13 +02:00
Ines Montani 9885b5ae68 Update spacy_lookups_data version [ci skip] 2019-10-01 22:21:21 +02:00
Ines Montani 475e3188ce Add docs on filtering overlapping spans for merging (resolves #4352) [ci skip] 2019-10-01 21:59:50 +02:00
Matthew Honnibal 667f294627 Merge branch 'master' of https://github.com/explosion/spaCy 2019-10-01 21:37:25 +02:00
Ines Montani 0dd127bb00 Update v2-2.md [ci skip] 2019-10-01 21:37:06 +02:00
Matthew Honnibal 64a9577d43 Set version to v2.2.0.dev17 2019-10-01 21:36:59 +02:00
Ines Montani cf65a80f36 Refactor lemmatizer and data table integration (#4353)
* Move test

* Allow default in Lookups.get_table

* Start with blank tables in Lookups.from_bytes

* Refactor lemmatizer to hold instance of Lookups

* Get lookups table within the lemmatization methods to make sure it references the correct table (even if the table was replaced or modified, e.g. when loading a model from disk)
* Deprecate other arguments on Lemmatizer.__init__ and expect Lookups for consistency
* Remove old and unsupported Lemmatizer.load classmethod
* Refactor language-specific lemmatizers to inherit as much as possible from base class and override only what they need

* Update tests and docs

* Fix more tests

* Fix lemmatizer

* Upgrade pytest to try and fix weird CI errors

* Try pytest 4.6.5
2019-10-01 21:36:03 +02:00
Ines Montani 3297a19545 Warn in Tagger.begin_training if no lemma tables are available (#4351) 2019-10-01 15:13:55 +02:00
Ines Montani bc7e7db208 Fix wording [ci skip] 2019-10-01 14:20:44 +02:00
Ines Montani 2a3a4565cd Update infobox [ci skip] 2019-10-01 14:19:34 +02:00
Ines Montani 66aa0d479f Update v2.2 page [ci skip] 2019-10-01 14:11:05 +02:00
Ines Montani a8a1800f2a Update lemma data documentation [ci skip] 2019-10-01 13:22:13 +02:00
Ines Montani 932ad9cb91 Fix typos and formatting [ci skip] 2019-10-01 12:30:04 +02:00
Ines Montani ca0b20ae8b Make prereleases less verbose [ci skip] 2019-10-01 12:29:14 +02:00
Matthew Honnibal 2fb05482dd Set version to v2.2.0 2019-10-01 03:50:13 +02:00
Matthew Honnibal dc22ec0aad Set version to v2.2.0.dev17 2019-10-01 03:26:53 +02:00
Matthew Honnibal 377008bae2 Fix sdist for fabfile 2019-10-01 02:44:10 +02:00
Matthew Honnibal 91978a4de0 Merge branch 'master' of https://github.com/explosion/spaCy 2019-10-01 00:31:08 +02:00
Matthew Honnibal aedfba867a Set version to v2.2.0.dev16 2019-10-01 00:31:00 +02:00
Ines Montani 30d872011d Merge branch 'master' of https://github.com/explosion/spaCy 2019-10-01 00:25:48 +02:00
Ines Montani 75b8021a86 Move setup requirements to setup.cfg 2019-10-01 00:25:46 +02:00
Ines Montani e0cf4796a5 Move lookup tables out of the core library (#4346)
* Add default to util.get_entry_point

* Tidy up entry points

* Read lookups from entry points

* Remove lookup tables and related tests

* Add lookups install option

* Remove lemmatizer tests

* Remove logic to process language data files

* Update setup.cfg
2019-10-01 00:01:27 +02:00