Commit Graph

1477 Commits

Author SHA1 Message Date
Matthew Honnibal 2a5d050134 * Give codec loading back to Vocab. 2015-07-16 17:45:42 +02:00
Matthew Honnibal 8bf0f65f1c * Remove dead code in strings.pyx 2015-07-16 17:35:53 +02:00
Matthew Honnibal a9c3863665 * Fix inefficiency in StringStore.dump function 2015-07-16 17:34:32 +02:00
Matthew Honnibal b59d271510 * Move serialization functionality into Serializer class 2015-07-16 11:23:48 +02:00
Matthew Honnibal 30be4f15da * Import attrs from spacy.attrs, not spacy.typedefs 2015-07-16 11:23:25 +02:00
Matthew Honnibal 6c99e5f4aa * Move serialization into Serializer class, with __call__ and train() api 2015-07-16 11:22:35 +02:00
Matthew Honnibal e2133d990e * Move serialization functionality out into a Serializer object 2015-07-16 11:21:44 +02:00
Matthew Honnibal a6d040bd11 * Import Lexeme attrs from spacy.attrs, not spacy.typedefs 2015-07-16 11:20:08 +02:00
Matthew Honnibal d8bc279e0c * Fix 'you' contraction capitals in specials.json 2015-07-16 01:28:32 +02:00
Matthew Honnibal 45ae1ce428 * Remove unused declaration in parser 2015-07-16 01:27:11 +02:00
Matthew Honnibal efa80096f1 * Upd attrs id list 2015-07-16 01:26:54 +02:00
Matthew Honnibal 01fab6bb90 * Improve de/serialize functions 2015-07-16 01:26:35 +02:00
Matthew Honnibal 0e07c1ed2a * draft de/serialization functions in doc.pyx 2015-07-16 01:16:33 +02:00
Matthew Honnibal 9d956b07e9 * Fix import of attrs in doc.pyx, and update the get_token_attr function. 2015-07-16 01:15:34 +02:00
Matthew Honnibal 65251e7625 * Remove redundant attr_id_t from typedefs.pxd 2015-07-16 00:58:51 +02:00
Matthew Honnibal 9a8db9743c * Remove gil from parser.call 2015-07-14 23:47:33 +02:00
Matthew Honnibal 3c1e3e9ee8 * Fix capitalization problems in specials.json 2015-07-14 23:46:31 +02:00
Matthew Honnibal 38ca0c33f5 Merge branch 'neuralnet' into refactor
Mostly refactors parser, to use new thinc3.2 Example class.
Aim is to remove use of shared memory, so that we can parallelize
over documents easily.

Conflicts:
	setup.py
	spacy/syntax/parser.pxd
	spacy/syntax/parser.pyx
	spacy/syntax/stateclass.pyx
2015-07-14 14:13:47 +02:00
Matthew Honnibal 6405d2384c * Add first draft of annotation standards doc 2015-07-14 12:50:13 +02:00
Matthew Honnibal 935ac53ee3 * Extend count_by method 2015-07-14 03:20:09 +02:00
Matthew Honnibal 39c93116eb * Add get_freqs script 2015-07-14 02:31:32 +02:00
Matthew Honnibal 3b5baa660f * Fix tokenizer 2015-07-14 00:10:51 +02:00
Matthew Honnibal 2ae0b439b2 * Fix space check in gold.pyx 2015-07-14 00:10:27 +02:00
Matthew Honnibal 81aa4e6dcc * Go back to having token reference doc, instead of complicated gymnastics. Rename the attr 'doc', to expose it in the API 2015-07-14 00:10:11 +02:00
Matthew Honnibal e1c702e498 * Upd tests after refactor 2015-07-14 00:08:50 +02:00
Matthew Honnibal ba9a22ae0b * Ignore cpp files in spacy/tokens 2015-07-13 22:30:15 +02:00
Matthew Honnibal 98382bd7a0 * Update tests after refactor 2015-07-13 22:30:01 +02:00
Matthew Honnibal d87d71caf4 * Compile the new modules after refactor 2015-07-13 22:29:33 +02:00
Matthew Honnibal 24d6ce99ec * Add comment to tokenizer, explaining the spacy attr 2015-07-13 22:29:13 +02:00
Matthew Honnibal 8214b74eec * Restore _py_tokens cache, to handle orphan tokens. 2015-07-13 22:28:10 +02:00
Matthew Honnibal 67641f3b58 * Refactor tokenizer, to set the 'spacy' field on TokenC instead of passing a string 2015-07-13 21:46:02 +02:00
Matthew Honnibal 6eef0bf9ab * Break up tokens.pyx into tokens/doc.pyx, tokens/token.pyx, tokens/spans.pyx 2015-07-13 20:20:58 +02:00
Matthew Honnibal 3ea8756c24 * Add spacy/tokens/doc.pyx, for Doc class in its own file 2015-07-13 19:58:26 +02:00
Matthew Honnibal c99387155f * Refactor tokens, moving classes into a module instead of a single file 2015-07-13 19:49:55 +02:00
Matthew Honnibal d27899658e * Import classes in spacy.tokens.__init__ 2015-07-13 19:48:55 +02:00
Matthew Honnibal aa82caf8f5 * Add TokenC.spacy attr 2015-07-13 19:48:07 +02:00
Matthew Honnibal dba6b47d4e * Refactor monster tokens.pyx file, into a tokens/ subpackage. Try to break the cycle between Doc and Token, and remove the need to pass around a unicode string reference 2015-07-13 19:20:48 +02:00
Matthew Honnibal 5b0a7190c9 * Round-trip for serialization finally working. Needs a lot of optimization. 2015-07-13 18:39:38 +02:00
Matthew Honnibal edd371246c * Make huffman coder take BitArray in encode/decode. Add __iter__ method to BitArray. 2015-07-13 17:33:33 +02:00
Matthew Honnibal af5cc926a4 * Add codec property to Vocab, to use the Huffman encoding 2015-07-13 13:55:14 +02:00
Matthew Honnibal 77385d5580 * Make .pxd file for huffman codec 2015-07-13 13:54:51 +02:00
Matthew Honnibal 0628e0e2a8 * Add tests for huffman encoding 2015-07-13 12:58:07 +02:00
Matthew Honnibal 083b6ea7ae * Clean up encoder a bit. now read for integration into Vocab. 2015-07-13 12:57:22 +02:00
Matthew Honnibal 8d0f1d98da * Draft dockstring for HuffmanCache 2015-07-13 12:01:18 +02:00
Matthew Honnibal 281f1faefb * Nearly finished huffman coder 2015-07-12 23:48:46 +02:00
Matthew Honnibal e1a25fba32 * Work on huffman coder 2015-07-12 19:58:05 +02:00
Matthew Honnibal 3fb9de2d13 * Remove vector[bint], in favor of simple Code struct. 2015-07-12 17:58:27 +02:00
Matthew Honnibal aa7bfd932b * Work on compressor 2015-07-12 16:03:43 +02:00
Matthew Honnibal 14eafcab15 * Refactor to use vector[bint] 2015-07-12 05:27:47 +02:00
Matthew Honnibal 6a6e852a39 * Refactor huffman coding stuff into class 2015-07-12 05:06:36 +02:00