Commit Graph

101 Commits

Author SHA1 Message Date
Sofie 46dfe773e1 Replacing regex library with re to increase tokenization speed (#3218)
* replace unicode categories with raw list of code points

* simplifying ranges

* fixing variable length quotes

* removing redundant regular expression

* small cleanup of regexp notations

* quotes and alpha as ranges instead of alterations

* removed most regexp dependencies and features

* exponential backtracking - unit tests

* rewrote expression with pathological backtracking

* disabling double hyphen tests for now

* test additional variants of repeating punctuation

* remove regex and redundant backslashes from load_reddit script

* small typo fixes

* disable double punctuation test for russian

* clean up old comments

* format block code

* final cleanup

* naming consistency

* french strings as unicode for python 2 support

* french regular expression case insensitive
2019-02-01 18:05:22 +11:00
Matthew Honnibal 82277f63a3 💫 Small efficiency fixes to tokenizer (#2587)
This patch improves tokenizer speed by about 10%, and reduces memory usage in the `Vocab` by removing a redundant index. The `vocab._by_orth` and `vocab._by_hash` indexed on different data in v1, but in v2 the orth and the hash are identical.

The patch also fixes an uninitialized variable in the tokenizer, the `has_special` flag. This checks whether a chunk we're tokenizing triggers a special-case rule. If it does, then we avoid caching within the chunk. This check led to incorrectly rejecting some chunks from the cache. 

With the `en_core_web_md` model, we now tokenize the IMDB train data at 503,104k words per second. Prior to this patch, we had 465,764k words per second.

Before switching to the regex library and supporting more languages, we had 1.3m words per second for the tokenizer. In order to recover the missing speed, we need to:

* Fix the variable-length lookarounds in the suffix, infix and `token_match` rules
* Improve the performance of the `token_match` regex
* Switch back from the `regex` library to the `re` library.

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2018-07-24 23:35:54 +02:00
Matthew Honnibal 43dcaa473e Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2018-07-06 12:36:42 +02:00
Matthew Honnibal 6c8d627733 Fix tokenizer deserialization 2018-07-06 12:36:33 +02:00
ines c001d46153 Tidy up 2018-07-06 12:33:42 +02:00
Matthew Honnibal 63f5651f8d Fix tokenizer serialization 2018-07-06 12:32:11 +02:00
Matthew Honnibal 1a2f61725c Fix tokenizer serialization 2018-07-06 12:23:04 +02:00
ines 63666af328 Merge branch 'master' into develop 2018-07-04 14:52:25 +02:00
Bùi Trung Chí 9af46b4f1b Fix loading tokenizer with custom prefix search (#2495)
* Add contributor agreement

* Fix loading tokenizer with cutom prefix search
2018-07-04 12:56:07 +02:00
Matthew Honnibal 46d8a66fef Fix tokenizer serialization if token_match is None 2018-06-29 14:24:46 +02:00
Ines Montani 3141e04822
💫 New system for error messages and warnings (#2163)
* Add spacy.errors module

* Update deprecation and user warnings

* Replace errors and asserts with new error message system

* Remove redundant asserts

* Fix whitespace

* Add messages for print/util.prints statements

* Fix typo

* Fix typos

* Move CLI messages to spacy.cli._messages

* Add decorator to display error code with message

An implementation like this is nice because it only modifies the string when it's retrieved from the containing class – so we don't have to worry about manipulating tracebacks etc.

* Remove unused link in spacy.about

* Update errors for invalid pipeline components

* Improve error for unknown factories

* Add displaCy warnings

* Update formatting consistency

* Move error message to spacy.errors

* Update errors and check if doc returned by component is None
2018-04-03 15:50:31 +02:00
Matthew Honnibal 6bc0f4d29f
Merge pull request #1611 from fsonntag/master
Solving #1494
2017-11-29 23:11:23 +01:00
Felix Sonntag 724ae7dc55 Fixed issue of infix capturing prefixes 2017-11-28 17:17:12 +01:00
Matthew Honnibal 542e6fd4ea Don't remove entries from specials 2017-11-23 12:17:42 +00:00
Felix Sonntag 33b0f86de3 Changed tokenizer to add infix when infix_start is offset 2017-11-19 16:32:10 +01:00
Roman Domrachev 61d28d03e4 Try again to do selective remove cache 2017-11-15 19:11:12 +03:00
Roman Domrachev b3311100c7 Merge branch 'master' of github.com:explosion/spaCy 2017-11-15 18:30:04 +03:00
Roman Domrachev 505c6a2f2f Completely cleanup tokenizer cache
Tokenizer cache can have be different keys than string

That modification can slow down tokenizer and need to be measured
2017-11-15 17:55:48 +03:00
Matthew Honnibal fe3c42a06b Fix caching in tokenizer 2017-11-15 13:55:46 +01:00
Roman Domrachev 91e2fa6561 Clean all caches 2017-11-14 21:15:04 +03:00
Daniel Hershcovich d7ae54ff44
Fix typo in message 2017-11-08 16:06:28 +02:00
ines 9659391944 Update deprecated methods and add warnings 2017-11-01 16:49:42 +01:00
ines d96e72f656 Tidy up rest 2017-10-27 21:07:59 +02:00
ines 72497c8cb2 Remove comments and add TODO 2017-10-25 12:15:43 +02:00
Matthew Honnibal b0f6fd3f1d Disable tokenizer cache for special-cases. Fixes #1250 2017-10-24 16:08:05 +02:00
Matthew Honnibal f45973848c Rename 'tokens' variable 'doc' in tokenizer 2017-10-17 18:21:41 +02:00
ines cd6a29dce7 Port over changes from #1294 2017-10-14 13:28:46 +02:00
ines 7c919aeb09 Make sure serializers and deserializers are ordered 2017-06-03 17:05:09 +02:00
ines 0153b66a86 Return self in Tokenizer.from_bytes 2017-06-03 13:26:13 +02:00
Matthew Honnibal 0561df2a9d Fix tokenizer serialization 2017-05-31 14:12:38 +02:00
Matthew Honnibal e9419072e7 Fix tokenizer serialisation 2017-05-31 13:43:31 +02:00
Matthew Honnibal 66af019d5d Fix serialization of tokenizer 2017-05-31 11:43:40 +02:00
Matthew Honnibal a318f0cae1 Add to/from disk/bytes methods for tokenizer 2017-05-29 12:24:41 +02:00
ines c5a653fa48 Update docstrings and API docs for Tokenizer 2017-05-21 13:18:14 +02:00
ines f216422ac5 Remove deprecated load classmethod 2017-05-21 13:18:01 +02:00
Matthew Honnibal 793430aa7a Get spaCy train command working with neural network
* Integrate models into pipeline
* Add basic serialization (maybe incorrect)
* Fix pickle on vocab
2017-05-17 12:04:50 +02:00
ines e1efd589c3 Fix json imports and use ujson 2017-04-15 12:13:34 +02:00
ines c05ec4b89a Add compat functions and remove old workarounds
Add ensure_path util function to handle checking instance of path
2017-04-15 12:11:16 +02:00
ines d24589aa72 Clean up imports, unused code, whitespace, docstrings 2017-04-15 12:05:47 +02:00
ines 561f2a3eb4 Use consistent formatting for docstrings 2017-04-15 11:59:21 +02:00
Raphaël Bournhonesque f332bf05be Remove unused import statements 2017-03-21 21:08:54 +01:00
Matthew Honnibal 0ac3d27689 Fix handling of trailing whitespace
Fix off-by-one error that meant trailing spaces were being dropped.
Closes #792
2017-03-08 15:01:40 +01:00
Matthew Honnibal 0a6d7ca200 Fix spacing after token_match
The boolean flag indicating a space after the token was
being set incorrectly after the token_match regex was applied.
Fixes #859.
2017-03-08 14:33:32 +01:00
Raphaël Bournhonesque dce8f5515e Allow zero-width 'infix' token 2017-01-23 18:28:01 +01:00
Ines Montani aa876884f0 Revert "Revert "Merge remote-tracking branch 'origin/master'""
This reverts commit fb9d3bb022.
2017-01-09 13:28:13 +01:00
Matthew Honnibal a36353df47 Temporarily put back the tokenize_from_strings method, while tests aren't updated yet. 2016-11-04 19:18:07 +01:00
Matthew Honnibal e0c9695615 Fix doc strings for tokenizer 2016-11-02 23:15:39 +01:00
Matthew Honnibal e9e6fce576 Handle null prefix/suffix/infix search in tokenizer 2016-11-02 20:35:48 +01:00
Matthew Honnibal 8ce8803824 Fix JSON in tokenizer 2016-10-21 01:44:20 +02:00
Matthew Honnibal 95aaea0d3f Refactor so that the tokenizer data is read from Python data, rather than from disk 2016-09-25 14:49:53 +02:00