Commit Graph

188 Commits

Author SHA1 Message Date
Matthew Honnibal abf8b16d71
Add doc.retokenize() context manager (#2172)
This patch takes a step towards #1487 by introducing the
doc.retokenize() context manager, to handle merging spans, and soon
splitting tokens.

The idea is to do merging and splitting like this:

with doc.retokenize() as retokenizer:
    for start, end, label in matches:
        retokenizer.merge(doc[start : end], attrs={'ent_type': label})

The retokenizer accumulates the merge requests, and applies them
together at the end of the block. This will allow retokenization to be
more efficient, and much less error prone.

A retokenizer.split() function will then be added, to handle splitting a
single token into multiple tokens. These methods take `Span` and `Token`
objects; if the user wants to go directly from offsets, they can append
to the .merges and .splits lists on the retokenizer.

The doc.merge() method's behaviour remains unchanged, so this patch
should be 100% backwards incompatible (modulo bugs). Internally,
doc.merge() fixes up the arguments (to handle the various deprecated styles),
opens the retokenizer, and makes the single merge.

We can later start making deprecation warnings on direct calls to doc.merge(),
to migrate people to use of the retokenize context manager.
2018-04-03 14:10:35 +02:00
Matthew Honnibal 0b375d50c8 Fix ent_iob tags in doc.merge to avoid inconsistent sequences 2018-03-28 18:39:03 +02:00
Matthew Honnibal e807f88410 Resolve merge when cherry-picking ent iob patches from develop 2018-03-28 18:38:13 +02:00
Matthew Honnibal 99fbc7db33 Improve error message when entity sequence is inconsistent 2018-03-28 18:36:53 +02:00
ines 9e83513004 Add position of invalid token to error message 2018-03-27 23:56:59 +02:00
ines 693971dd8f Improve error message if token text is empty string (see #2101) 2018-03-27 22:25:40 +02:00
ines 0c829e6605 Fix whitespace 2018-03-27 22:20:59 +02:00
Thomas Opsomer 515e25910e fix sent_start in serialization 2018-01-28 19:50:42 +01:00
Matthew Honnibal 56164ab688 Set l_edge and r_edge correctly for non-projective parses. Fixes #1799 2018-01-22 20:18:04 +01:00
Matthew Honnibal ccb51a9f36 Make .similarity() return 1.0 if all orth attrs match 2018-01-15 16:29:48 +01:00
Matthew Honnibal ab7c45b12d Fix error message and handling of doc.sents 2018-01-15 15:21:11 +01:00
Matthew Honnibal e10e9ad2c5 Improve efficiency of Doc.to_array 2017-11-23 12:33:27 +00:00
Matthew Honnibal fa62427300 Remove lookup-based lemmatization 2017-11-23 12:32:22 +00:00
ines 1c218397f6 Ensure path in Doc.to_disk/from_disk (resolves ##1521)
Also add Doc serialization tests with both Path and string path options
2017-11-09 02:29:03 +01:00
Matthew Honnibal 144a93c2a5 Back-off to tensor for similarity if no vectors 2017-11-03 20:56:33 +01:00
Matthew Honnibal 62ed58935a Add Doc.extend_tensor() method 2017-11-03 11:20:31 +01:00
ines 9659391944 Update deprecated methods and add warnings 2017-11-01 16:49:42 +01:00
ines 705a4e3e4a Fix formatting 2017-11-01 16:44:08 +01:00
Matthew Honnibal 7e7116cdf7 Fix Doc.to_array when only one string attr provided 2017-11-01 13:26:43 +01:00
ines 544a407b93 Tidy up Doc, Token and Span and add missing docs 2017-10-27 17:07:26 +02:00
ines 6a0483b7aa Tidy up and document Doc, Token and Span 2017-10-27 15:41:45 +02:00
Matthew Honnibal ccd2ab1a62 Merge pull request #1443 from ramananbalakrishnan/develop-get-lca-matrix
Add LCA matrix for spans and docs
2017-10-24 11:22:46 +02:00
Ramanan Balakrishnan d2fe56a577
Add LCA matrix for spans and docs 2017-10-20 23:58:00 +05:30
Ramanan Balakrishnan 0726946563
cleanup to_array implementation using fixes on master 2017-10-20 17:09:37 +05:30
Ramanan Balakrishnan b3ab124fc5
Support strings for attribute list in doc.to_array 2017-10-20 11:46:57 +05:30
Ramanan Balakrishnan 7b9b1be44c
Support single value for attribute list in doc.to_array 2017-10-19 17:00:41 +05:30
Matthew Honnibal 394633efce Make doc pickling support hooks 2017-10-17 19:44:09 +02:00
Matthew Honnibal cdb0c426d8 Improve deserialization of user_data, esp. for Underscore 2017-10-17 19:29:20 +02:00
Matthew Honnibal 32a8564c79 Fix doc pickling 2017-10-17 18:20:24 +02:00
Matthew Honnibal 92c1eb2d6f Fix Doc pickling. This also removes need for Binder class 2017-10-17 16:11:13 +02:00
Matthew Honnibal a002264fec Remove caching of Token in Doc, as caused cycle. 2017-10-16 19:34:21 +02:00
ines e0ff145a8b Merge branch 'develop' into feature/dot-underscore 2017-10-11 11:57:05 +02:00
Matthew Honnibal 3b527fa52b Call morphology.assign_untagged when pushing token to Doc 2017-10-11 03:23:57 +02:00
Matthew Honnibal e0a9b02b67 Merge Span._ and Span.as_doc methods 2017-10-09 22:00:15 -05:00
Matthew Honnibal e938bce320 Adjust parsing transition system to allow preset sentence segments. 2017-10-08 23:53:34 +02:00
Matthew Honnibal 668a0ea640 Pass extensions into Underscore class 2017-10-07 18:56:01 +02:00
ines 2480f8f521 Add missing return in Doc.from_disk() (closes #1330) 2017-09-18 15:32:00 +02:00
Matthew Honnibal 03b5b9727a Fix Doc.vector for empty doc objects 2017-08-22 19:52:19 +02:00
Matthew Honnibal 0551b7b03a Fix doc.vector 2017-08-22 19:46:52 +02:00
Matthew Honnibal 8b7ac77c23 Allow span label to be string in Doc.char_span 2017-08-19 16:18:09 +02:00
Matthew Honnibal 80236116a6 Add Doc.char_span method, to get a span by character offset 2017-08-19 12:21:09 +02:00
Matthew Honnibal a6a2159969 Add slot for text categories to Doc 2017-07-22 00:34:15 +02:00
Matthew Honnibal 2a3bd5ee90 Fix fetching of noun chunk iterator 2017-06-04 15:53:05 -05:00
Matthew Honnibal 92ae36f84e Improve way noun chunks iterator is looked up 2017-06-04 21:53:39 +02:00
Matthew Honnibal 675f448313 Fix vector linkage on Doc 2017-06-04 14:25:30 -05:00
ines 459a1e8470 Fix whitespace 2017-06-03 11:31:18 +02:00
ines 5109bba910 Port over fix from #1070 2017-06-03 11:31:11 +02:00
Matthew Honnibal 498ad85309 Try using tensor for vector/similarity methdos 2017-05-30 23:35:17 +02:00
Matthew Honnibal 4ddff020c3 Fix compile error 2017-05-28 23:30:40 +02:00
Matthew Honnibal 6d3caeadd2 Fix type check for long 2017-05-28 23:22:45 +02:00