Commit Graph

8679 Commits

Author SHA1 Message Date
ines 62b4b527d7 Don't raise error if set_extension has getter and setter (closes #2177)
Improve error messages, raise error if setter is specified without a getter and compare against _unset to allow default=None. Also add more tests.
2018-04-03 18:30:17 +02:00
ines ee3082ad29 Fix whitespace 2018-04-03 18:29:53 +02:00
ines de137fba84 Add TensorBoard examples to examples overview [ci skip] 2018-04-03 16:01:52 +02:00
ines 6d87b28f15 Add Vietnamese to language overview [ci skip] 2018-04-03 16:01:36 +02:00
Ines Montani 3141e04822
💫 New system for error messages and warnings (#2163)
* Add spacy.errors module

* Update deprecation and user warnings

* Replace errors and asserts with new error message system

* Remove redundant asserts

* Fix whitespace

* Add messages for print/util.prints statements

* Fix typo

* Fix typos

* Move CLI messages to spacy.cli._messages

* Add decorator to display error code with message

An implementation like this is nice because it only modifies the string when it's retrieved from the containing class – so we don't have to worry about manipulating tracebacks etc.

* Remove unused link in spacy.about

* Update errors for invalid pipeline components

* Improve error for unknown factories

* Add displaCy warnings

* Update formatting consistency

* Move error message to spacy.errors

* Update errors and check if doc returned by component is None
2018-04-03 15:50:31 +02:00
Matthew Honnibal abf8b16d71
Add doc.retokenize() context manager (#2172)
This patch takes a step towards #1487 by introducing the
doc.retokenize() context manager, to handle merging spans, and soon
splitting tokens.

The idea is to do merging and splitting like this:

with doc.retokenize() as retokenizer:
    for start, end, label in matches:
        retokenizer.merge(doc[start : end], attrs={'ent_type': label})

The retokenizer accumulates the merge requests, and applies them
together at the end of the block. This will allow retokenization to be
more efficient, and much less error prone.

A retokenizer.split() function will then be added, to handle splitting a
single token into multiple tokens. These methods take `Span` and `Token`
objects; if the user wants to go directly from offsets, they can append
to the .merges and .splits lists on the retokenizer.

The doc.merge() method's behaviour remains unchanged, so this patch
should be 100% backwards incompatible (modulo bugs). Internally,
doc.merge() fixes up the arguments (to handle the various deprecated styles),
opens the retokenizer, and makes the single merge.

We can later start making deprecation warnings on direct calls to doc.merge(),
to migrate people to use of the retokenize context manager.
2018-04-03 14:10:35 +02:00
Matthew Honnibal 8a120fb455 Disable batch size compounding in ud-train 2018-04-01 08:45:00 +00:00
Matthew Honnibal 98165e43a7 Sometimes update beam with greedy oracle 2018-04-01 08:44:35 +00:00
ines 638068ec6c Restore contributor agreement 2018-03-31 14:06:37 +02:00
Suraj Rajan 1cdbb7c97c [2032] - Changed python set to cpp stl set (#2170)
Changed python set to cpp stl set #2032 

## Description

Changed python set to cpp stl set. CPP stl set works better due to the logarithmic run time of its methods. Finding minimum in the cpp set is done in constant time as opposed to the worst case linear runtime of python set. Operations such as find,count,insert,delete are also done in either constant and logarithmic time thus making cpp set a better option to manage vectors.
Reference : http://www.cplusplus.com/reference/set/set/

### Types of change
Enhancement for `Vectors` for faster initialising of word vectors(fasttext)
2018-03-31 13:28:25 +02:00
Katrin Leinweber 6f84e32253 Formalise citation info (#2167)
* Create CITATION file

* Add Katrinleinweber contributor agreement
2018-03-30 10:34:14 +02:00
Matthew Honnibal f3b7c5e537 Fix syntax error 2018-03-29 21:50:32 +02:00
Matthew Honnibal 23afa6429f Add input length error, to address #1826 2018-03-29 21:45:26 +02:00
Matthew Honnibal cca7e7ad11 Merge branch 'master' of https://github.com/explosion/spaCy 2018-03-29 20:27:06 +02:00
Matthew Honnibal 68ad366935 Improve train_new_entity_type example 2018-03-29 20:26:41 +02:00
Ines Montani a609a1ca29
Merge pull request #2152 from explosion/feature/tidy-up-dependencies
💫 Tidy up dependencies
2018-03-29 14:35:09 +02:00
Viet Trung Tran ea2af94cd9 Add support for Vietnamese in spaCy by leveraging Pyvi, an external Vietnamese tokenizer (#2155)
* support for Vietnamese

* Contributor Agreement for adding Vietnamese support on spaCy
2018-03-29 12:19:51 +02:00
Matthew Honnibal 6efb76bb3f Require next thinc 2018-03-28 23:30:32 +00:00
ines e6979bdbbd Merge branch 'feature/tidy-up-dependencies' of https://github.com/explosion/spaCy into feature/tidy-up-dependencies 2018-03-29 00:19:37 +02:00
ines 83146458a2 Fix urllib for Python 3 2018-03-29 00:19:33 +02:00
Matthew Honnibal 8308bbc617 Get msgpack and msgpack_numpy via Thinc, to avoid potential version conflicts 2018-03-29 00:14:55 +02:00
Matthew Honnibal b5098079d8 Fix error on urllib 2018-03-29 00:08:16 +02:00
Ines Montani 0de599b16b
Merge pull request #2159 from explosion/feature/fix-merged-entity-iob (resolves #1554, resolves #1752)
💫 Fix token.ent_iob after doc.merge(), and ensure consistency in doc.ents
2018-03-28 23:10:00 +02:00
Ines Montani 98e9cda677
Merge pull request #2158 from explosion/feature/fix-multiple-vectors (resolves #1660)
💫 Fix loading of multiple vector models
2018-03-28 23:08:24 +02:00
Matthew Honnibal a7c5ae2beb Avoid forcing a name on empty vectors, and remove print statement 2018-03-28 21:08:58 +02:00
ines 3eb67bbe4b Allow entity types with dashes (resolves #1967) 2018-03-28 20:51:26 +02:00
Matthew Honnibal cf5fcf0546 Update serialization test 2018-03-28 20:12:53 +02:00
Matthew Honnibal 4555e3e251 Dont assume pretrained_vectors cfg set in build_tagger 2018-03-28 20:12:45 +02:00
ines 9615ed5ed7 Update emoji/hashtag matcher example (resolves #2156) [ci skip] 2018-03-28 18:41:28 +02:00
Matthew Honnibal 0b375d50c8 Fix ent_iob tags in doc.merge to avoid inconsistent sequences 2018-03-28 18:39:03 +02:00
Matthew Honnibal 95fa89c4b8 Update doc.ents test 2018-03-28 18:39:03 +02:00
Matthew Honnibal e807f88410 Resolve merge when cherry-picking ent iob patches from develop 2018-03-28 18:38:13 +02:00
Matthew Honnibal 99fbc7db33 Improve error message when entity sequence is inconsistent 2018-03-28 18:36:53 +02:00
Matthew Honnibal cbd2794be0 Add test for ent_iob during span merge 2018-03-28 18:36:53 +02:00
Matthew Honnibal f8dd905a24 Warn and fallback if vectors have no name 2018-03-28 18:24:53 +02:00
Matthew Honnibal fd9e259414 Add test for #1660 2018-03-28 18:22:51 +02:00
Matthew Honnibal bc4afa9881 Remove print statement 2018-03-28 17:48:37 +02:00
Matthew Honnibal 79dc241caa Set pretrained_vectors in parser cfg 2018-03-28 17:35:07 +02:00
Matthew Honnibal 17c3e7efa2 Add message noting vectors 2018-03-28 16:33:43 +02:00
Matthew Honnibal 9bf6e93b3e Set pretrained_vectors in begin_training 2018-03-28 16:32:41 +02:00
Matthew Honnibal 95a9615221 Fix loading of multiple pre-trained vectors
This patch addresses #1660, which was caused by keying all pre-trained
vectors with the same ID when telling Thinc how to refer to them. This
meant that if multiple models were loaded that had pre-trained vectors,
errors or incorrect behaviour resulted.

The vectors class now includes a .name attribute, which defaults to:
{nlp.meta['lang']_nlp.meta['name']}.vectors
The vectors name is set in the cfg of the pipeline components under the
key pretrained_vectors. This replaces the previous cfg key
pretrained_dims.

In order to make existing models compatible with this change, we check
for the pretrained_dims key when loading models in from_disk and
from_bytes, and add the cfg key pretrained_vectors if we find it.
2018-03-28 16:02:59 +02:00
ines 07b8c255a5 Updatee example with note to install requests 2018-03-28 12:46:27 +02:00
ines 366c98a94b Remove requests dependency 2018-03-28 12:46:18 +02:00
ines 7fbc9e5874 Replace requests with urllib 2018-03-28 12:46:07 +02:00
ines da1f200362 Add compat helpers for urllib 2018-03-28 12:45:53 +02:00
ines ac88c72c9a Fix ftfy workaround and remove old import 2018-03-28 12:14:28 +02:00
ines ce6071ca89 Remove ftfy dependency and update docs 2018-03-28 12:09:42 +02:00
Matthew Honnibal 070b6c6495 Remove dependency on ftfy 2018-03-28 12:07:02 +02:00
ines 6d2c85f428 Drop six and related hacks as a dependency 2018-03-28 10:45:25 +02:00
ines 9e83513004 Add position of invalid token to error message 2018-03-27 23:56:59 +02:00