Commit Graph

352 Commits

Author SHA1 Message Date
Shooter23 6ae8e49bff Fix docstring for is_right_punct(). (#3044) 2018-12-14 10:11:11 +01:00
Grivaz 57f274b693 raise error when setting overlapping entities as doc.ents (#2880) 2018-10-26 23:29:16 +02:00
darindf 8227566805 Fix error (#2802)
* Fix error
ValueError: cannot resize an array that references or is referenced
by another array in this way.  Use the resize function

* added spaCy Contributor Agreement
2018-09-26 21:31:03 +02:00
Ines Montani 3c4e3ade30 Fix typo (closes #2784) 2018-09-21 10:45:11 +02:00
Grivaz aeba99ab0d Introduces a bulk merge function, in order to solve issue #653 (#2696)
* Fix comment

* Introduce bulk merge to increase performance on many span merges

* Sign contributor agreement

* Implement pull request suggestions
2018-09-10 16:41:42 +02:00
Piotr Żelasko bdb2165bd1 Less norm computations in token similarity (#2730)
* Less norm computations in token similarity

* Contributor agreement
2018-09-05 21:50:23 +02:00
Ole Henrik Skogstrøm 0473add369 Feature/span ents (#2599)
* Created Span.ents property

* Add tests for span.ents

* Add tests for start and end of sentence
2018-08-07 13:52:32 +02:00
Matthew Honnibal e0caf3ae8c Fix msgpack for new version 2018-07-20 17:32:00 +02:00
Matthew Honnibal 9db77fd914 Fix deserialization for msgpack 2018-07-20 14:11:09 +02:00
Ole Henrik Skogstrøm c21efea9bb Add sent property to token (#2521)
* Add sent property to token

* Refactored and cleaned up copy paste errors.
2018-07-06 15:54:15 +02:00
ines b59e3b157f Don't require attrs argument in Doc.retokenize and allow both ints and unicode (resolves #2304) 2018-05-20 15:15:37 +02:00
Douglas Knox 9b49a40f4e Test and fix for Issue #2219 (#2272)
Test and fix for Issue #2219: Token.similarity() failed if single letter
2018-05-03 18:40:46 +02:00
Mr Roboto 6f5ccda19c Addresses Issue #2228 - Deserialization fails when using tensor=False or sentiment=False (#2230)
* Fixes issue #2228

* Adds a new contributor
2018-05-01 13:40:22 +02:00
ines 1c6d77610c Add remove_extension method on Doc, Token and Span (closes #2242) 2018-04-28 23:33:09 +02:00
ines 9632595fb4 Use correct, non-deprecated merge syntax (resolves #2226) 2018-04-18 18:28:28 -04:00
Suraj Rajan 5957f15227 Fixed typos for #2222,#2223 (#2233) (closes #2222, closes #2223) 2018-04-18 14:55:26 -07:00
Xiaoquan Kong e2f13ec722 bugfix: `Doc.noun_chunks` call `Doc.noun_chunks_iterator` without checking (closes #2194) 2018-04-08 23:44:05 +02:00
ines e5f47cd82d Update errors 2018-04-03 21:40:29 +02:00
ines 62b4b527d7 Don't raise error if set_extension has getter and setter (closes #2177)
Improve error messages, raise error if setter is specified without a getter and compare against _unset to allow default=None. Also add more tests.
2018-04-03 18:30:17 +02:00
ines ee3082ad29 Fix whitespace 2018-04-03 18:29:53 +02:00
Ines Montani 3141e04822
💫 New system for error messages and warnings (#2163)
* Add spacy.errors module

* Update deprecation and user warnings

* Replace errors and asserts with new error message system

* Remove redundant asserts

* Fix whitespace

* Add messages for print/util.prints statements

* Fix typo

* Fix typos

* Move CLI messages to spacy.cli._messages

* Add decorator to display error code with message

An implementation like this is nice because it only modifies the string when it's retrieved from the containing class – so we don't have to worry about manipulating tracebacks etc.

* Remove unused link in spacy.about

* Update errors for invalid pipeline components

* Improve error for unknown factories

* Add displaCy warnings

* Update formatting consistency

* Move error message to spacy.errors

* Update errors and check if doc returned by component is None
2018-04-03 15:50:31 +02:00
Matthew Honnibal abf8b16d71
Add doc.retokenize() context manager (#2172)
This patch takes a step towards #1487 by introducing the
doc.retokenize() context manager, to handle merging spans, and soon
splitting tokens.

The idea is to do merging and splitting like this:

with doc.retokenize() as retokenizer:
    for start, end, label in matches:
        retokenizer.merge(doc[start : end], attrs={'ent_type': label})

The retokenizer accumulates the merge requests, and applies them
together at the end of the block. This will allow retokenization to be
more efficient, and much less error prone.

A retokenizer.split() function will then be added, to handle splitting a
single token into multiple tokens. These methods take `Span` and `Token`
objects; if the user wants to go directly from offsets, they can append
to the .merges and .splits lists on the retokenizer.

The doc.merge() method's behaviour remains unchanged, so this patch
should be 100% backwards incompatible (modulo bugs). Internally,
doc.merge() fixes up the arguments (to handle the various deprecated styles),
opens the retokenizer, and makes the single merge.

We can later start making deprecation warnings on direct calls to doc.merge(),
to migrate people to use of the retokenize context manager.
2018-04-03 14:10:35 +02:00
Matthew Honnibal 0b375d50c8 Fix ent_iob tags in doc.merge to avoid inconsistent sequences 2018-03-28 18:39:03 +02:00
Matthew Honnibal e807f88410 Resolve merge when cherry-picking ent iob patches from develop 2018-03-28 18:38:13 +02:00
Matthew Honnibal 99fbc7db33 Improve error message when entity sequence is inconsistent 2018-03-28 18:36:53 +02:00
ines 9e83513004 Add position of invalid token to error message 2018-03-27 23:56:59 +02:00
ines 693971dd8f Improve error message if token text is empty string (see #2101) 2018-03-27 22:25:40 +02:00
ines 0c829e6605 Fix whitespace 2018-03-27 22:20:59 +02:00
Matthew Honnibal 63a267b34d Fix #2073: Token.set_extension not working 2018-03-27 13:36:20 +02:00
Thomas Opsomer fbf48b3f9f lemma property to return hash instead of unicode 2018-03-14 17:03:00 +01:00
4altinok ca8728035d added new lex feat to token 2018-02-11 18:55:48 +01:00
Thomas Opsomer 515e25910e fix sent_start in serialization 2018-01-28 19:50:42 +01:00
Matthew Honnibal 56164ab688 Set l_edge and r_edge correctly for non-projective parses. Fixes #1799 2018-01-22 20:18:04 +01:00
Matthew Honnibal ccb51a9f36 Make .similarity() return 1.0 if all orth attrs match 2018-01-15 16:29:48 +01:00
Matthew Honnibal b904d81e9a Fix rich comparison against None objects. Closes #1757 2018-01-15 15:51:25 +01:00
Matthew Honnibal ab7c45b12d Fix error message and handling of doc.sents 2018-01-15 15:21:11 +01:00
Matthew Honnibal 465a6f6452 Add missing Span.vocab property. Closes #1633 2018-01-14 15:06:30 +01:00
Matthew Honnibal 0cb090e526 Fix infinite recursion in token.sent_start. Closes #1640 2018-01-14 15:02:15 +01:00
Matthew Honnibal 5cbe913b6f Don't raise deprecation warning in property. Closes #1813, #1712 2018-01-14 14:55:58 +01:00
Matthew Honnibal e10e9ad2c5 Improve efficiency of Doc.to_array 2017-11-23 12:33:27 +00:00
Matthew Honnibal fa62427300 Remove lookup-based lemmatization 2017-11-23 12:32:22 +00:00
Matthew Honnibal fb26b2cb12 Use lookup lemmatizer if lemma unset 2017-11-23 12:31:58 +00:00
Burton DeWilde a5c6869b2d Fix bug where span.orth_ != span.text (see #1612) 2017-11-20 12:05:43 -06:00
Motoki Wu a52e195a0a Fixes Issue #1207 where `noun_chunks` of `Span` gives an error.
Make sure to reference `self.doc` when getting the noun chunks.

Same fix as 9750a0128c
2017-11-17 17:16:20 -08:00
ines 1c218397f6 Ensure path in Doc.to_disk/from_disk (resolves ##1521)
Also add Doc serialization tests with both Path and string path options
2017-11-09 02:29:03 +01:00
Matthew Honnibal 144a93c2a5 Back-off to tensor for similarity if no vectors 2017-11-03 20:56:33 +01:00
Matthew Honnibal 62ed58935a Add Doc.extend_tensor() method 2017-11-03 11:20:31 +01:00
ines 9659391944 Update deprecated methods and add warnings 2017-11-01 16:49:42 +01:00
ines 705a4e3e4a Fix formatting 2017-11-01 16:44:08 +01:00
Matthew Honnibal 9e0ebee81c Add Token.is_sent_start property, so can deprecate Token.sent_start 2017-11-01 13:27:14 +01:00