Commit Graph

52 Commits

Author SHA1 Message Date
Ines Montani 296446a1c8
Tidy up and improve docs and docstrings (#3370)
<!--- Provide a general summary of your changes in the title. -->

## Description
* tidy up and adjust Cython code to code style
* improve docstrings and make calling `help()` nicer
* add URLs to new docs pages to docstrings wherever possible, mostly to user-facing objects
* fix various typos and inconsistencies in docs

### Types of change
enhancement, docs

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-03-08 11:42:26 +01:00
Matthew Honnibal 449b889454 Fix KeyError in Vectors.most_similar. Fixes #2648 2018-12-10 16:19:18 +01:00
Matthew Honnibal 90aec6d2f6 Fix vectors for reserved words. Closes #2871 2018-12-10 16:09:49 +01:00
Ines Montani f37863093a 💫 Replace ujson, msgpack and dill/pickle/cloudpickle with srsly (#3003)
Remove hacks and wrappers, keep code in sync across our libraries and move spaCy a few steps closer to only depending on packages with binary wheels 🎉

See here: https://github.com/explosion/srsly

    Serialization is hard, especially across Python versions and multiple platforms. After dealing with many subtle bugs over the years (encodings, locales, large files) our libraries like spaCy and Prodigy have steadily grown a number of utility functions to wrap the multiple serialization formats we need to support (especially json, msgpack and pickle). These wrapping functions ended up duplicated across our codebases, so we wanted to put them in one place.

    At the same time, we noticed that having a lot of small dependencies was making maintainence harder, and making installation slower. To solve this, we've made srsly standalone, by including the component packages directly within it. This way we can provide all the serialization utilities we need in a single binary wheel.

    srsly currently includes forks of the following packages:

        ujson
        msgpack
        msgpack-numpy
        cloudpickle



* WIP: replace json/ujson with srsly

* Replace ujson in examples

Use regular json instead of srsly to make code easier to read and follow

* Update requirements

* Fix imports

* Fix typos

* Replace msgpack with srsly

* Fix warning
2018-12-03 01:28:22 +01:00
Ines Montani 3141e04822
💫 New system for error messages and warnings (#2163)
* Add spacy.errors module

* Update deprecation and user warnings

* Replace errors and asserts with new error message system

* Remove redundant asserts

* Fix whitespace

* Add messages for print/util.prints statements

* Fix typo

* Fix typos

* Move CLI messages to spacy.cli._messages

* Add decorator to display error code with message

An implementation like this is nice because it only modifies the string when it's retrieved from the containing class – so we don't have to worry about manipulating tracebacks etc.

* Remove unused link in spacy.about

* Update errors for invalid pipeline components

* Improve error for unknown factories

* Add displaCy warnings

* Update formatting consistency

* Move error message to spacy.errors

* Update errors and check if doc returned by component is None
2018-04-03 15:50:31 +02:00
Suraj Rajan 1cdbb7c97c [2032] - Changed python set to cpp stl set (#2170)
Changed python set to cpp stl set #2032 

## Description

Changed python set to cpp stl set. CPP stl set works better due to the logarithmic run time of its methods. Finding minimum in the cpp set is done in constant time as opposed to the worst case linear runtime of python set. Operations such as find,count,insert,delete are also done in either constant and logarithmic time thus making cpp set a better option to manage vectors.
Reference : http://www.cplusplus.com/reference/set/set/

### Types of change
Enhancement for `Vectors` for faster initialising of word vectors(fasttext)
2018-03-31 13:28:25 +02:00
Ines Montani a609a1ca29
Merge pull request #2152 from explosion/feature/tidy-up-dependencies
💫 Tidy up dependencies
2018-03-29 14:35:09 +02:00
Matthew Honnibal 8308bbc617 Get msgpack and msgpack_numpy via Thinc, to avoid potential version conflicts 2018-03-29 00:14:55 +02:00
Matthew Honnibal 95a9615221 Fix loading of multiple pre-trained vectors
This patch addresses #1660, which was caused by keying all pre-trained
vectors with the same ID when telling Thinc how to refer to them. This
meant that if multiple models were loaded that had pre-trained vectors,
errors or incorrect behaviour resulted.

The vectors class now includes a .name attribute, which defaults to:
{nlp.meta['lang']_nlp.meta['name']}.vectors
The vectors name is set in the cfg of the pipeline components under the
key pretrained_vectors. This replaces the previous cfg key
pretrained_dims.

In order to make existing models compatible with this change, we check
for the pretrained_dims key when loading models in from_disk and
from_bytes, and add the cfg key pretrained_vectors if we find it.
2018-03-28 16:02:59 +02:00
Matthew Honnibal 8cefc58abc Fix Vectors pickling 2018-03-14 16:59:37 +01:00
Claudiu-Vlad Ursache e28de12cbd
Ensure files opened in `from_disk` are closed
Fixes [issue 1706](https://github.com/explosion/spaCy/issues/1706).
2018-02-13 20:49:43 +01:00
Matthew Honnibal 29897ed1b3 Allow vector loading to work on 1d data files. Fixes #1831 2018-01-22 19:18:26 +01:00
Matthew Honnibal 1a1cca6052 Fix vectors.resize() on Py3. Closes #1539 2018-01-14 14:48:51 +01:00
Matthew Honnibal 36b47e3fa6 Fix (and test) vector pickling 2017-12-07 09:53:30 +01:00
Matthew Honnibal b712de774e Fix vectors pickling 2017-12-05 12:45:24 +01:00
Matthew Honnibal a5ea0fdf5a Fix #1518: vocab.vectors.resize() didn't work 2017-11-08 22:18:37 +01:00
Matthew Honnibal 225cc249c9 Pass string path to numpy, to fix #1479 2017-11-05 14:42:46 +01:00
Matthew Honnibal fdb4b8e456 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-11-01 02:07:17 +01:00
Matthew Honnibal c48dd0e1d3 Fix vector pruning 2017-11-01 02:06:58 +01:00
ines 5683fd65ed Update docstrings 2017-11-01 00:42:39 +01:00
Matthew Honnibal c16310d156 Update vectors with find method 2017-11-01 00:34:55 +01:00
ines 2ad2f09d12 Update docstrings and simplify most_similar 2017-11-01 00:18:08 +01:00
ines ba2e6c8c6f Update docstrings and formatting 2017-10-31 23:23:34 +01:00
Matthew Honnibal d90a22afe6 Fix loading previous vectors models 2017-10-31 19:58:35 +01:00
Matthew Honnibal 997a61557a Add vectors.n_keys property 2017-10-31 19:30:52 +01:00
Matthew Honnibal 77d8f5de9a Revise and simplify Vectors class 2017-10-31 18:25:08 +01:00
Matthew Honnibal 9c11ee4a1c WIP on vectors fixes 2017-10-31 11:22:56 +01:00
Matthew Honnibal 368fdb389a WIP on refactoring and fixing vectors 2017-10-31 02:00:26 +01:00
Matthew Honnibal 4112a991ec Fix vector pruning 2017-10-30 19:44:40 +01:00
Explosion Bot d0cf12c8c7 Fix off-by-one error in vectors 2017-10-30 16:22:03 +01:00
Explosion Bot ab5d5ed880 Fix vectors.add() 2017-10-30 16:08:09 +01:00
Explosion Bot 72aea8f105 Update vectors.add() to allow setting keys to rows 2017-10-30 10:03:08 +01:00
ines 5167a0cce2 Tidy up Vectors and docs 2017-10-27 19:45:19 +02:00
Matthew Honnibal cfae54c507 Make change to Vectors.__init__ 2017-10-20 14:19:04 +02:00
Matthew Honnibal 92ac9316b5 Fix initialization of vectors, to address serialization problem 2017-10-20 13:59:24 +02:00
Matthew Honnibal df488274b1 Fix deserialization of vectors 2017-10-16 20:55:00 +02:00
Matthew Honnibal d90cc917fa Merge vectors.pyx doc strings 2017-10-01 17:05:54 -05:00
Matthew Honnibal b2a8b9be77 Fix inconsistency of Vectors class API 2017-10-01 17:00:34 -05:00
Matthew Honnibal 97c409b602 Add docstrings for spacy.vectors 2017-10-01 22:10:33 +02:00
Matthew Honnibal 4f38a67a89 Make width default to 0 in vectors.pyx 2017-09-17 12:29:14 -05:00
Matthew Honnibal e0a2aa9289 Support having word vectors data on GPU 2017-09-16 12:45:09 -05:00
Matthew Honnibal 7742a6d559 Add GloVe vectors reader 2017-09-01 16:39:22 +02:00
Matthew Honnibal b8e1603cc4 Fix load fail for missing vectors 2017-08-19 22:07:00 +02:00
Matthew Honnibal 6a94648373 Fix serialization 2017-08-19 21:27:35 +02:00
Matthew Honnibal 1157294434 Improve vector handling 2017-08-19 20:35:33 +02:00
Matthew Honnibal 93fb8b64e9 Fix vector loading 2017-08-19 19:52:25 +02:00
Matthew Honnibal 3d049af563 Improve vectors to/from disk 2017-08-19 18:42:11 +02:00
Matthew Honnibal 19c495f451 Fix vectors deserialization 2017-08-19 04:33:03 +02:00
Matthew Honnibal ed4fb991dc Work on vectors loading 2017-08-18 20:45:48 +02:00
Matthew Honnibal 5489b49203 Remove print statement 2017-06-05 13:20:41 +02:00