spaCy

Commit Graph

Author	SHA1	Message	Date
Shooter23	6ae8e49bff	Fix docstring for is_right_punct(). (#3044 )	2018-12-14 10:11:11 +01:00
Grivaz	57f274b693	raise error when setting overlapping entities as doc.ents (#2880 )	2018-10-26 23:29:16 +02:00
darindf	8227566805	Fix error (#2802 ) * Fix error ValueError: cannot resize an array that references or is referenced by another array in this way. Use the resize function * added spaCy Contributor Agreement	2018-09-26 21:31:03 +02:00
Ines Montani	3c4e3ade30	Fix typo (closes #2784 )	2018-09-21 10:45:11 +02:00
Grivaz	aeba99ab0d	Introduces a bulk merge function, in order to solve issue #653 (#2696 ) * Fix comment * Introduce bulk merge to increase performance on many span merges * Sign contributor agreement * Implement pull request suggestions	2018-09-10 16:41:42 +02:00
Piotr Żelasko	bdb2165bd1	Less norm computations in token similarity (#2730 ) * Less norm computations in token similarity * Contributor agreement	2018-09-05 21:50:23 +02:00
Ole Henrik Skogstrøm	0473add369	Feature/span ents (#2599 ) * Created Span.ents property * Add tests for span.ents * Add tests for start and end of sentence	2018-08-07 13:52:32 +02:00
Matthew Honnibal	e0caf3ae8c	Fix msgpack for new version	2018-07-20 17:32:00 +02:00
Matthew Honnibal	9db77fd914	Fix deserialization for msgpack	2018-07-20 14:11:09 +02:00
Ole Henrik Skogstrøm	c21efea9bb	Add sent property to token (#2521 ) * Add sent property to token * Refactored and cleaned up copy paste errors.	2018-07-06 15:54:15 +02:00
ines	b59e3b157f	Don't require attrs argument in Doc.retokenize and allow both ints and unicode (resolves #2304 )	2018-05-20 15:15:37 +02:00
Douglas Knox	9b49a40f4e	Test and fix for Issue #2219 (#2272 ) Test and fix for Issue #2219: Token.similarity() failed if single letter	2018-05-03 18:40:46 +02:00
Mr Roboto	6f5ccda19c	Addresses Issue #2228 - Deserialization fails when using tensor=False or sentiment=False (#2230 ) * Fixes issue #2228 * Adds a new contributor	2018-05-01 13:40:22 +02:00
ines	1c6d77610c	Add remove_extension method on Doc, Token and Span (closes #2242 )	2018-04-28 23:33:09 +02:00
ines	9632595fb4	Use correct, non-deprecated merge syntax (resolves #2226 )	2018-04-18 18:28:28 -04:00
Suraj Rajan	5957f15227	Fixed typos for #2222,#2223 (#2233 ) (closes #2222 , closes #2223 )	2018-04-18 14:55:26 -07:00
Xiaoquan Kong	e2f13ec722	bugfix: `Doc.noun_chunks` call `Doc.noun_chunks_iterator` without checking (closes #2194 )	2018-04-08 23:44:05 +02:00
ines	e5f47cd82d	Update errors	2018-04-03 21:40:29 +02:00
ines	62b4b527d7	Don't raise error if set_extension has getter and setter (closes #2177 ) Improve error messages, raise error if setter is specified without a getter and compare against _unset to allow default=None. Also add more tests.	2018-04-03 18:30:17 +02:00
ines	ee3082ad29	Fix whitespace	2018-04-03 18:29:53 +02:00
Ines Montani	3141e04822	💫 New system for error messages and warnings (#2163 ) * Add spacy.errors module * Update deprecation and user warnings * Replace errors and asserts with new error message system * Remove redundant asserts * Fix whitespace * Add messages for print/util.prints statements * Fix typo * Fix typos * Move CLI messages to spacy.cli._messages * Add decorator to display error code with message An implementation like this is nice because it only modifies the string when it's retrieved from the containing class – so we don't have to worry about manipulating tracebacks etc. * Remove unused link in spacy.about * Update errors for invalid pipeline components * Improve error for unknown factories * Add displaCy warnings * Update formatting consistency * Move error message to spacy.errors * Update errors and check if doc returned by component is None	2018-04-03 15:50:31 +02:00
Matthew Honnibal	abf8b16d71	Add doc.retokenize() context manager (#2172 ) This patch takes a step towards #1487 by introducing the doc.retokenize() context manager, to handle merging spans, and soon splitting tokens. The idea is to do merging and splitting like this: with doc.retokenize() as retokenizer: for start, end, label in matches: retokenizer.merge(doc[start : end], attrs={'ent_type': label}) The retokenizer accumulates the merge requests, and applies them together at the end of the block. This will allow retokenization to be more efficient, and much less error prone. A retokenizer.split() function will then be added, to handle splitting a single token into multiple tokens. These methods take `Span` and `Token` objects; if the user wants to go directly from offsets, they can append to the .merges and .splits lists on the retokenizer. The doc.merge() method's behaviour remains unchanged, so this patch should be 100% backwards incompatible (modulo bugs). Internally, doc.merge() fixes up the arguments (to handle the various deprecated styles), opens the retokenizer, and makes the single merge. We can later start making deprecation warnings on direct calls to doc.merge(), to migrate people to use of the retokenize context manager.	2018-04-03 14:10:35 +02:00
Matthew Honnibal	0b375d50c8	Fix ent_iob tags in doc.merge to avoid inconsistent sequences	2018-03-28 18:39:03 +02:00
Matthew Honnibal	e807f88410	Resolve merge when cherry-picking ent iob patches from develop	2018-03-28 18:38:13 +02:00
Matthew Honnibal	99fbc7db33	Improve error message when entity sequence is inconsistent	2018-03-28 18:36:53 +02:00
ines	9e83513004	Add position of invalid token to error message	2018-03-27 23:56:59 +02:00
ines	693971dd8f	Improve error message if token text is empty string (see #2101 )	2018-03-27 22:25:40 +02:00
ines	0c829e6605	Fix whitespace	2018-03-27 22:20:59 +02:00
Matthew Honnibal	63a267b34d	Fix #2073 : Token.set_extension not working	2018-03-27 13:36:20 +02:00
Thomas Opsomer	fbf48b3f9f	lemma property to return hash instead of unicode	2018-03-14 17:03:00 +01:00
4altinok	ca8728035d	added new lex feat to token	2018-02-11 18:55:48 +01:00
Thomas Opsomer	515e25910e	fix sent_start in serialization	2018-01-28 19:50:42 +01:00
Matthew Honnibal	56164ab688	Set l_edge and r_edge correctly for non-projective parses. Fixes #1799	2018-01-22 20:18:04 +01:00
Matthew Honnibal	ccb51a9f36	Make .similarity() return 1.0 if all orth attrs match	2018-01-15 16:29:48 +01:00
Matthew Honnibal	b904d81e9a	Fix rich comparison against None objects. Closes #1757	2018-01-15 15:51:25 +01:00
Matthew Honnibal	ab7c45b12d	Fix error message and handling of doc.sents	2018-01-15 15:21:11 +01:00
Matthew Honnibal	465a6f6452	Add missing Span.vocab property. Closes #1633	2018-01-14 15:06:30 +01:00
Matthew Honnibal	0cb090e526	Fix infinite recursion in token.sent_start. Closes #1640	2018-01-14 15:02:15 +01:00
Matthew Honnibal	5cbe913b6f	Don't raise deprecation warning in property. Closes #1813 , #1712	2018-01-14 14:55:58 +01:00
Matthew Honnibal	e10e9ad2c5	Improve efficiency of Doc.to_array	2017-11-23 12:33:27 +00:00
Matthew Honnibal	fa62427300	Remove lookup-based lemmatization	2017-11-23 12:32:22 +00:00
Matthew Honnibal	fb26b2cb12	Use lookup lemmatizer if lemma unset	2017-11-23 12:31:58 +00:00
Burton DeWilde	a5c6869b2d	Fix bug where span.orth_ != span.text (see #1612 )	2017-11-20 12:05:43 -06:00
Motoki Wu	a52e195a0a	Fixes Issue #1207 where `noun_chunks` of `Span` gives an error. Make sure to reference `self.doc` when getting the noun chunks. Same fix as `9750a0128c`	2017-11-17 17:16:20 -08:00
ines	1c218397f6	Ensure path in Doc.to_disk/from_disk (resolves ##1521) Also add Doc serialization tests with both Path and string path options	2017-11-09 02:29:03 +01:00
Matthew Honnibal	144a93c2a5	Back-off to tensor for similarity if no vectors	2017-11-03 20:56:33 +01:00
Matthew Honnibal	62ed58935a	Add Doc.extend_tensor() method	2017-11-03 11:20:31 +01:00
ines	9659391944	Update deprecated methods and add warnings	2017-11-01 16:49:42 +01:00
ines	705a4e3e4a	Fix formatting	2017-11-01 16:44:08 +01:00
Matthew Honnibal	9e0ebee81c	Add Token.is_sent_start property, so can deprecate Token.sent_start	2017-11-01 13:27:14 +01:00

1 2 3 4 5 ...

352 Commits