Commit Graph

360 Commits

Author SHA1 Message Date
Matthew Honnibal abf8b16d71
Add doc.retokenize() context manager (#2172)
This patch takes a step towards #1487 by introducing the
doc.retokenize() context manager, to handle merging spans, and soon
splitting tokens.

The idea is to do merging and splitting like this:

with doc.retokenize() as retokenizer:
    for start, end, label in matches:
        retokenizer.merge(doc[start : end], attrs={'ent_type': label})

The retokenizer accumulates the merge requests, and applies them
together at the end of the block. This will allow retokenization to be
more efficient, and much less error prone.

A retokenizer.split() function will then be added, to handle splitting a
single token into multiple tokens. These methods take `Span` and `Token`
objects; if the user wants to go directly from offsets, they can append
to the .merges and .splits lists on the retokenizer.

The doc.merge() method's behaviour remains unchanged, so this patch
should be 100% backwards incompatible (modulo bugs). Internally,
doc.merge() fixes up the arguments (to handle the various deprecated styles),
opens the retokenizer, and makes the single merge.

We can later start making deprecation warnings on direct calls to doc.merge(),
to migrate people to use of the retokenize context manager.
2018-04-03 14:10:35 +02:00
Matthew Honnibal 8308bbc617 Get msgpack and msgpack_numpy via Thinc, to avoid potential version conflicts 2018-03-29 00:14:55 +02:00
ines 366c98a94b Remove requests dependency 2018-03-28 12:46:18 +02:00
ines ce6071ca89 Remove ftfy dependency and update docs 2018-03-28 12:09:42 +02:00
ines 6d2c85f428 Drop six and related hacks as a dependency 2018-03-28 10:45:25 +02:00
ines f5f4de98d1 Version-lock msgpack-python (see #2015) 2018-02-22 16:02:32 +01:00
ines 002ee80ddf Add html5lib to setup.py to fix six error (see #1924) 2018-02-02 20:32:08 +01:00
Matthew Honnibal 2e449c1fbf Fix compiler flags, addressing #1591 2018-01-14 14:34:36 +01:00
Matthew Honnibal 04a92bd75e Pin msgpack-numpy requirement 2017-12-06 03:24:24 +01:00
Hugo aa898ab4e4 Drop support for EOL Python 2.6 and 3.3 2017-11-26 19:46:24 +02:00
Matthew Honnibal 716ccbb71e Require thinc 6.10.1 2017-11-15 14:59:34 +01:00
Matthew Honnibal 314f5b9cdb Require thinc 6.10.0 2017-10-28 18:20:10 +00:00
Matthew Honnibal 64e4ff7c4b Merge 'tidy-up' changes into branch. Resolve conflicts 2017-10-28 13:16:06 +02:00
ines 7946464742 Remove spacy.tagger (now in pipeline) 2017-10-27 19:45:04 +02:00
Matthew Honnibal 531142a933 Merge remote-tracking branch 'origin/develop' into feature/better-parser 2017-10-27 12:34:48 +00:00
Matthew Honnibal 642eb28c16 Don't compile with OpenMP by default 2017-10-27 10:16:58 +00:00
Matthew Honnibal 90d1d9b230 Remove obsolete parser code 2017-10-26 13:22:45 +02:00
Matthew Honnibal 79fcf8576a Compile with march=native 2017-10-18 21:46:34 +02:00
Matthew Honnibal 2eb0fe4957 Fix setup.py 2017-10-03 21:40:04 +02:00
Matthew Honnibal b49cc8153a Require correct thinc 2017-09-26 10:00:18 -05:00
ines 68f66aebf8 Use pkg_resources instead of pip for is_package (resolves #1293) 2017-09-16 20:27:59 +02:00
Matthew Honnibal 07cdbd1219 Require thinc 6.8.1, for Windows 2017-09-15 22:47:53 +02:00
Matthew Honnibal 96a4a9070b Compile _beam_utils 2017-08-18 21:56:19 +02:00
Matthew Honnibal f9ae86b01c Fix requirement 2017-08-18 20:56:53 +02:00
Matthew Honnibal 69bcacdc09 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-08-18 20:47:13 +02:00
Matthew Honnibal de7f3509d2 Compile CFile, for vector loading 2017-08-18 20:46:41 +02:00
Matthew Honnibal 426f84937f Resolve conflicts when merging new beam parsing stuff 2017-08-18 13:38:32 -05:00
Matthew Honnibal 60d8111245 Require thinc 6.8.1 2017-08-15 03:12:26 -05:00
Matthew Honnibal 52c180ecf5 Revert "Merge branch 'develop' of https://github.com/explosion/spaCy into develop"
This reverts commit ea8de11ad5, reversing
changes made to 08e443e083.
2017-08-14 13:00:23 +02:00
Matthew Honnibal b353e4d843 Work on parser beam training 2017-08-12 14:47:45 -05:00
ines 495e042429 Add entry point-style auto alias for "spacy"
Simplest way to run commands as spacy xxx instead of python -m spacy
xxx, while avoiding environment conflicts
2017-08-09 12:17:30 +02:00
Matthew Honnibal ff7418b0d9 Update requirements 2017-07-25 18:58:15 +02:00
Matthew Honnibal b4cdd05466 Add vectors.pyx in setup 2017-06-05 12:45:29 +02:00
Matthew Honnibal c811790095 Register vectors.pyx in setup 2017-06-05 12:32:22 +02:00
ines 152dc018a6 Remove syntax iterators from setup.py 2017-06-05 12:30:22 +02:00
Matthew Honnibal a4dcc96c54 Require thinc bugfix 2017-06-05 04:02:52 -05:00
ines 71954d5fe7 Update Thinc version 2017-06-03 10:32:53 +02:00
ines f45cd174bf Update Thinc version 2017-06-02 18:48:16 +02:00
Matthew Honnibal ae8010b526 Move weight serialization to Thinc 2017-06-01 02:56:12 -05:00
Matthew Honnibal 2e364f7ecd Require msgpack 2017-05-29 13:47:29 +02:00
ines 3cc6fe1484 Add pip to requirements.txt and setup.py 2017-05-17 12:04:03 +02:00
Matthew Honnibal 48de4ed49f Require thinc 6.6, and compile the nn_parser module 2017-05-14 01:20:28 +02:00
Matthew Honnibal 825c6403d8 Remove serializer 2017-05-09 17:28:30 +02:00
ines 564939391a Remove spacy.orth 2017-05-09 01:21:47 +02:00
ines 229b8c3974 Tidy up 2017-05-07 18:36:35 +02:00
ines a793174ae9 Use setuptools.find_packages() 2017-05-03 20:11:02 +02:00
Yasuaki Uechi c8f83aeb87 Add basic japanese support 2017-05-03 13:56:21 +09:00
Ines Montani 7da9cefd25 Merge pull request #1022 from luvogels/master
Initial support for Norwegian Bokmål
2017-04-27 11:16:06 +02:00
Ines Montani 417f430d23 Relax version contstraint 2017-04-20 15:39:24 +02:00
Gyorgy Orosz 4a06a2572c Using ftfy for handling broken encoded strings. 2017-04-20 13:34:51 +02:00