Commit Graph

90 Commits

Author SHA1 Message Date
Matthew Honnibal 6bc0f4d29f
Merge pull request #1611 from fsonntag/master
Solving #1494
2017-11-29 23:11:23 +01:00
Felix Sonntag 724ae7dc55 Fixed issue of infix capturing prefixes 2017-11-28 17:17:12 +01:00
Matthew Honnibal 542e6fd4ea Don't remove entries from specials 2017-11-23 12:17:42 +00:00
Felix Sonntag 33b0f86de3 Changed tokenizer to add infix when infix_start is offset 2017-11-19 16:32:10 +01:00
Roman Domrachev 61d28d03e4 Try again to do selective remove cache 2017-11-15 19:11:12 +03:00
Roman Domrachev b3311100c7 Merge branch 'master' of github.com:explosion/spaCy 2017-11-15 18:30:04 +03:00
Roman Domrachev 505c6a2f2f Completely cleanup tokenizer cache
Tokenizer cache can have be different keys than string

That modification can slow down tokenizer and need to be measured
2017-11-15 17:55:48 +03:00
Matthew Honnibal fe3c42a06b Fix caching in tokenizer 2017-11-15 13:55:46 +01:00
Roman Domrachev 91e2fa6561 Clean all caches 2017-11-14 21:15:04 +03:00
Daniel Hershcovich d7ae54ff44
Fix typo in message 2017-11-08 16:06:28 +02:00
ines 9659391944 Update deprecated methods and add warnings 2017-11-01 16:49:42 +01:00
ines d96e72f656 Tidy up rest 2017-10-27 21:07:59 +02:00
ines 72497c8cb2 Remove comments and add TODO 2017-10-25 12:15:43 +02:00
Matthew Honnibal b0f6fd3f1d Disable tokenizer cache for special-cases. Fixes #1250 2017-10-24 16:08:05 +02:00
Matthew Honnibal f45973848c Rename 'tokens' variable 'doc' in tokenizer 2017-10-17 18:21:41 +02:00
ines cd6a29dce7 Port over changes from #1294 2017-10-14 13:28:46 +02:00
ines 7c919aeb09 Make sure serializers and deserializers are ordered 2017-06-03 17:05:09 +02:00
ines 0153b66a86 Return self in Tokenizer.from_bytes 2017-06-03 13:26:13 +02:00
Matthew Honnibal 0561df2a9d Fix tokenizer serialization 2017-05-31 14:12:38 +02:00
Matthew Honnibal e9419072e7 Fix tokenizer serialisation 2017-05-31 13:43:31 +02:00
Matthew Honnibal 66af019d5d Fix serialization of tokenizer 2017-05-31 11:43:40 +02:00
Matthew Honnibal a318f0cae1 Add to/from disk/bytes methods for tokenizer 2017-05-29 12:24:41 +02:00
ines c5a653fa48 Update docstrings and API docs for Tokenizer 2017-05-21 13:18:14 +02:00
ines f216422ac5 Remove deprecated load classmethod 2017-05-21 13:18:01 +02:00
Matthew Honnibal 793430aa7a Get spaCy train command working with neural network
* Integrate models into pipeline
* Add basic serialization (maybe incorrect)
* Fix pickle on vocab
2017-05-17 12:04:50 +02:00
ines e1efd589c3 Fix json imports and use ujson 2017-04-15 12:13:34 +02:00
ines c05ec4b89a Add compat functions and remove old workarounds
Add ensure_path util function to handle checking instance of path
2017-04-15 12:11:16 +02:00
ines d24589aa72 Clean up imports, unused code, whitespace, docstrings 2017-04-15 12:05:47 +02:00
ines 561f2a3eb4 Use consistent formatting for docstrings 2017-04-15 11:59:21 +02:00
Raphaël Bournhonesque f332bf05be Remove unused import statements 2017-03-21 21:08:54 +01:00
Matthew Honnibal 0ac3d27689 Fix handling of trailing whitespace
Fix off-by-one error that meant trailing spaces were being dropped.
Closes #792
2017-03-08 15:01:40 +01:00
Matthew Honnibal 0a6d7ca200 Fix spacing after token_match
The boolean flag indicating a space after the token was
being set incorrectly after the token_match regex was applied.
Fixes #859.
2017-03-08 14:33:32 +01:00
Raphaël Bournhonesque dce8f5515e Allow zero-width 'infix' token 2017-01-23 18:28:01 +01:00
Ines Montani aa876884f0 Revert "Revert "Merge remote-tracking branch 'origin/master'""
This reverts commit fb9d3bb022.
2017-01-09 13:28:13 +01:00
Matthew Honnibal a36353df47 Temporarily put back the tokenize_from_strings method, while tests aren't updated yet. 2016-11-04 19:18:07 +01:00
Matthew Honnibal e0c9695615 Fix doc strings for tokenizer 2016-11-02 23:15:39 +01:00
Matthew Honnibal e9e6fce576 Handle null prefix/suffix/infix search in tokenizer 2016-11-02 20:35:48 +01:00
Matthew Honnibal 8ce8803824 Fix JSON in tokenizer 2016-10-21 01:44:20 +02:00
Matthew Honnibal 95aaea0d3f Refactor so that the tokenizer data is read from Python data, rather than from disk 2016-09-25 14:49:53 +02:00
Matthew Honnibal fd65cf6cbb Finish refactoring data loading 2016-09-24 20:26:17 +02:00
Matthew Honnibal 83e364188c Mostly finished loading refactoring. Design is in place, but doesn't work yet. 2016-09-24 15:42:01 +02:00
Matthew Honnibal cc8bf62208 * Fix Issue #360: Tokenizer failed when the infix regex matched the start of the string while trying to tokenize multi-infix tokens. 2016-05-09 13:23:47 +02:00
Matthew Honnibal 519366f677 * Fix Issue #351: Indices off when leading whitespace 2016-05-04 15:53:36 +02:00
Matthew Honnibal 04d0209be9 * Recognise multiple infixes in a token. 2016-04-13 18:38:26 +10:00
Henning Peters b8f63071eb add lang registration facility 2016-03-25 18:54:45 +01:00
Matthew Honnibal 141639ea3a * Fix bug in tokenizer that caused new tokens to be added for affixes 2016-02-21 23:17:47 +00:00
Matthew Honnibal f9e765cae7 * Add pipe() method to tokenizer 2016-02-03 02:32:37 +01:00
Matthew Honnibal 3e9961d2c4 * If final token is whitespace, don't mark it as owning a trailing space. Fixes Issue #154 2016-01-16 17:08:59 +01:00
Henning Peters 235f094534 untangle data_path/via 2016-01-16 12:23:45 +01:00
Henning Peters 846fa49b2a distinct load() and from_package() methods 2016-01-16 10:00:57 +01:00