spaCy

Commit Graph

Author	SHA1	Message	Date
Matthew Honnibal	82277f63a3	💫 Small efficiency fixes to tokenizer (#2587 ) This patch improves tokenizer speed by about 10%, and reduces memory usage in the `Vocab` by removing a redundant index. The `vocab._by_orth` and `vocab._by_hash` indexed on different data in v1, but in v2 the orth and the hash are identical. The patch also fixes an uninitialized variable in the tokenizer, the `has_special` flag. This checks whether a chunk we're tokenizing triggers a special-case rule. If it does, then we avoid caching within the chunk. This check led to incorrectly rejecting some chunks from the cache. With the `en_core_web_md` model, we now tokenize the IMDB train data at 503,104k words per second. Prior to this patch, we had 465,764k words per second. Before switching to the regex library and supporting more languages, we had 1.3m words per second for the tokenizer. In order to recover the missing speed, we need to: * Fix the variable-length lookarounds in the suffix, infix and `token_match` rules * Improve the performance of the `token_match` regex * Switch back from the `regex` library to the `re` library. ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information.	2018-07-24 23:35:54 +02:00
Explosion Bot	7b56b2f04b	Add Vocab.cfg attr, to hold stuff like oov probs	2017-10-30 16:08:50 +01:00
Matthew Honnibal	a131981f3b	Work on vectors	2017-05-30 23:34:50 +02:00
Matthew Honnibal	9e167b7bb6	Strip serializer from code	2017-05-09 17:28:50 +02:00
Matthew Honnibal	ca32a1ab01	Revert "Work on Issue #285 : intern strings into document-specific pools, to address streaming data memory growth. StringStore.__getitem__ now raises KeyError when it can't find the string. Use StringStore.intern() to get the old behaviour. Still need to hunt down all uses of StringStore.__getitem__ in library and do testing, but logic looks good." This reverts commit `8423e8627f`.	2016-09-30 20:20:22 +02:00
Matthew Honnibal	8423e8627f	Work on Issue #285 : intern strings into document-specific pools, to address streaming data memory growth. StringStore.__getitem__ now raises KeyError when it can't find the string. Use StringStore.intern() to get the old behaviour. Still need to hunt down all uses of StringStore.__getitem__ in library and do testing, but logic looks good.	2016-09-30 10:14:47 +02:00
Matthew Honnibal	95aaea0d3f	Refactor so that the tokenizer data is read from Python data, rather than from disk	2016-09-25 14:49:53 +02:00
Matthew Honnibal	85e7944572	* Start trying to pickle Vocab	2015-10-13 13:44:41 +11:00
Matthew Honnibal	362526b592	* Rename vectors_length attribute	2015-09-15 14:43:31 +10:00
Matthew Honnibal	e285ca7d6c	* Load serializer freqs in vocab	2015-09-10 15:22:48 +02:00
Matthew Honnibal	86c888667f	* Merge in changes from de branch	2015-09-06 19:49:28 +02:00
Matthew Honnibal	d2fc104a26	* Begin merge of Gazetteer and DE branches	2015-09-06 19:45:15 +02:00
Matthew Honnibal	c2307fa9ee	* More work on language-generic parsing	2015-08-28 02:02:33 +02:00
Matthew Honnibal	2d521768a3	* Store Morphology class in Vocab	2015-08-26 19:21:03 +02:00
Matthew Honnibal	6f1743692a	* Work on language-independent refactoring	2015-08-23 20:49:18 +02:00
Matthew Honnibal	fd525f0675	* Pass OOV probability around	2015-07-25 23:29:51 +02:00
Matthew Honnibal	a7c4d72e83	* Add serializer property to Vocab, and lazy-load it. Add get_by_orth method.	2015-07-23 01:18:19 +02:00
Matthew Honnibal	109106a949	* Replace UniStr, using unicode objects instead	2015-07-22 04:52:05 +02:00
Matthew Honnibal	317cbbc015	* Serialization round trip now working with decent API, but with rough spots in the organisation and requiring vocabulary to be fixed ahead of time.	2015-07-19 15:18:17 +02:00
Matthew Honnibal	82d84b0f2b	* Index lexemes by orth, instead of a lexemes vector. Breaks the mechanism for deciding not to own LexemeC structs during parsing. Need to reinstate this.	2015-07-18 22:42:15 +02:00
Matthew Honnibal	4dddc8a69b	* Fix type declarations for attr_t. Remove unused id_t.	2015-07-18 22:39:57 +02:00
Matthew Honnibal	db9dfd2e23	* Major refactor of serialization. Nearly complete now.	2015-07-17 01:27:54 +02:00
Matthew Honnibal	af5cc926a4	* Add codec property to Vocab, to use the Huffman encoding	2015-07-13 13:55:14 +02:00
Matthew Honnibal	abc43b852d	* Add pos_tags attr to Vocab.	2015-07-08 12:36:38 +02:00
Matthew Honnibal	c04e6ebca6	* Allow user to load different sized vectors.	2015-06-05 16:26:39 +02:00
Jordan Suchow	3a8d9b37a6	Remove trailing whitespace	2015-04-19 13:01:38 -07:00
Matthew Honnibal	0930892fc1	* Tmp. Working on refactor. Compiles, must hook up lexical feats.	2015-01-14 00:03:48 +11:00
Matthew Honnibal	ce2edd6312	* Tmp commit. Refactoring to create a Python Lexeme class.	2015-01-12 10:26:22 +11:00
Matthew Honnibal	b8b65903fc	* Tmp	2014-12-24 17:42:00 +11:00
Matthew Honnibal	d11c1edf8c	* Import slice_unicode from strings.pyx	2014-12-20 07:56:26 +11:00
Matthew Honnibal	116f7f3bc1	* Rename Lexicon to Vocab, and move it to its own file	2014-12-20 06:54:03 +11:00

31 Commits