spaCy

Commit Graph

Author	SHA1	Message	Date
Matthew Honnibal	abf8b16d71	Add doc.retokenize() context manager (#2172 ) This patch takes a step towards #1487 by introducing the doc.retokenize() context manager, to handle merging spans, and soon splitting tokens. The idea is to do merging and splitting like this: with doc.retokenize() as retokenizer: for start, end, label in matches: retokenizer.merge(doc[start : end], attrs={'ent_type': label}) The retokenizer accumulates the merge requests, and applies them together at the end of the block. This will allow retokenization to be more efficient, and much less error prone. A retokenizer.split() function will then be added, to handle splitting a single token into multiple tokens. These methods take `Span` and `Token` objects; if the user wants to go directly from offsets, they can append to the .merges and .splits lists on the retokenizer. The doc.merge() method's behaviour remains unchanged, so this patch should be 100% backwards incompatible (modulo bugs). Internally, doc.merge() fixes up the arguments (to handle the various deprecated styles), opens the retokenizer, and makes the single merge. We can later start making deprecation warnings on direct calls to doc.merge(), to migrate people to use of the retokenize context manager.	2018-04-03 14:10:35 +02:00
Matthew Honnibal	0b375d50c8	Fix ent_iob tags in doc.merge to avoid inconsistent sequences	2018-03-28 18:39:03 +02:00
Matthew Honnibal	e807f88410	Resolve merge when cherry-picking ent iob patches from develop	2018-03-28 18:38:13 +02:00
Matthew Honnibal	99fbc7db33	Improve error message when entity sequence is inconsistent	2018-03-28 18:36:53 +02:00
ines	9e83513004	Add position of invalid token to error message	2018-03-27 23:56:59 +02:00
ines	693971dd8f	Improve error message if token text is empty string (see #2101 )	2018-03-27 22:25:40 +02:00
ines	0c829e6605	Fix whitespace	2018-03-27 22:20:59 +02:00
Thomas Opsomer	515e25910e	fix sent_start in serialization	2018-01-28 19:50:42 +01:00
Matthew Honnibal	56164ab688	Set l_edge and r_edge correctly for non-projective parses. Fixes #1799	2018-01-22 20:18:04 +01:00
Matthew Honnibal	ccb51a9f36	Make .similarity() return 1.0 if all orth attrs match	2018-01-15 16:29:48 +01:00
Matthew Honnibal	ab7c45b12d	Fix error message and handling of doc.sents	2018-01-15 15:21:11 +01:00
Matthew Honnibal	e10e9ad2c5	Improve efficiency of Doc.to_array	2017-11-23 12:33:27 +00:00
Matthew Honnibal	fa62427300	Remove lookup-based lemmatization	2017-11-23 12:32:22 +00:00
ines	1c218397f6	Ensure path in Doc.to_disk/from_disk (resolves ##1521) Also add Doc serialization tests with both Path and string path options	2017-11-09 02:29:03 +01:00
Matthew Honnibal	144a93c2a5	Back-off to tensor for similarity if no vectors	2017-11-03 20:56:33 +01:00
Matthew Honnibal	62ed58935a	Add Doc.extend_tensor() method	2017-11-03 11:20:31 +01:00
ines	9659391944	Update deprecated methods and add warnings	2017-11-01 16:49:42 +01:00
ines	705a4e3e4a	Fix formatting	2017-11-01 16:44:08 +01:00
Matthew Honnibal	7e7116cdf7	Fix Doc.to_array when only one string attr provided	2017-11-01 13:26:43 +01:00
ines	544a407b93	Tidy up Doc, Token and Span and add missing docs	2017-10-27 17:07:26 +02:00
ines	6a0483b7aa	Tidy up and document Doc, Token and Span	2017-10-27 15:41:45 +02:00
Matthew Honnibal	ccd2ab1a62	Merge pull request #1443 from ramananbalakrishnan/develop-get-lca-matrix Add LCA matrix for spans and docs	2017-10-24 11:22:46 +02:00
Ramanan Balakrishnan	d2fe56a577	Add LCA matrix for spans and docs	2017-10-20 23:58:00 +05:30
Ramanan Balakrishnan	0726946563	cleanup to_array implementation using fixes on master	2017-10-20 17:09:37 +05:30
Ramanan Balakrishnan	b3ab124fc5	Support strings for attribute list in doc.to_array	2017-10-20 11:46:57 +05:30
Ramanan Balakrishnan	7b9b1be44c	Support single value for attribute list in doc.to_array	2017-10-19 17:00:41 +05:30
Matthew Honnibal	394633efce	Make doc pickling support hooks	2017-10-17 19:44:09 +02:00
Matthew Honnibal	cdb0c426d8	Improve deserialization of user_data, esp. for Underscore	2017-10-17 19:29:20 +02:00
Matthew Honnibal	32a8564c79	Fix doc pickling	2017-10-17 18:20:24 +02:00
Matthew Honnibal	92c1eb2d6f	Fix Doc pickling. This also removes need for Binder class	2017-10-17 16:11:13 +02:00
Matthew Honnibal	a002264fec	Remove caching of Token in Doc, as caused cycle.	2017-10-16 19:34:21 +02:00
ines	e0ff145a8b	Merge branch 'develop' into feature/dot-underscore	2017-10-11 11:57:05 +02:00
Matthew Honnibal	3b527fa52b	Call morphology.assign_untagged when pushing token to Doc	2017-10-11 03:23:57 +02:00
Matthew Honnibal	e0a9b02b67	Merge Span._ and Span.as_doc methods	2017-10-09 22:00:15 -05:00
Matthew Honnibal	e938bce320	Adjust parsing transition system to allow preset sentence segments.	2017-10-08 23:53:34 +02:00
Matthew Honnibal	668a0ea640	Pass extensions into Underscore class	2017-10-07 18:56:01 +02:00
ines	2480f8f521	Add missing return in Doc.from_disk() (closes #1330 )	2017-09-18 15:32:00 +02:00
Matthew Honnibal	03b5b9727a	Fix Doc.vector for empty doc objects	2017-08-22 19:52:19 +02:00
Matthew Honnibal	0551b7b03a	Fix doc.vector	2017-08-22 19:46:52 +02:00
Matthew Honnibal	8b7ac77c23	Allow span label to be string in Doc.char_span	2017-08-19 16:18:09 +02:00
Matthew Honnibal	80236116a6	Add Doc.char_span method, to get a span by character offset	2017-08-19 12:21:09 +02:00
Matthew Honnibal	a6a2159969	Add slot for text categories to Doc	2017-07-22 00:34:15 +02:00
Matthew Honnibal	2a3bd5ee90	Fix fetching of noun chunk iterator	2017-06-04 15:53:05 -05:00
Matthew Honnibal	92ae36f84e	Improve way noun chunks iterator is looked up	2017-06-04 21:53:39 +02:00
Matthew Honnibal	675f448313	Fix vector linkage on Doc	2017-06-04 14:25:30 -05:00
ines	459a1e8470	Fix whitespace	2017-06-03 11:31:18 +02:00
ines	5109bba910	Port over fix from #1070	2017-06-03 11:31:11 +02:00
Matthew Honnibal	498ad85309	Try using tensor for vector/similarity methdos	2017-05-30 23:35:17 +02:00
Matthew Honnibal	4ddff020c3	Fix compile error	2017-05-28 23:30:40 +02:00
Matthew Honnibal	6d3caeadd2	Fix type check for long	2017-05-28 23:22:45 +02:00

1 2 3 4

188 Commits