Commit Graph

1720 Commits

Author SHA1 Message Date
Matthew Honnibal ae78c9e3ce * Implement character-based codec, so that we can do word/char backoff 2015-07-19 22:03:39 +02:00
Matthew Honnibal cd1d047cb8 * Delete out-dated HuffmanCodec comment 2015-07-19 18:28:14 +02:00
Matthew Honnibal 879ef9fa3e * Update tests for huffman codec 2015-07-19 17:59:51 +02:00
Matthew Honnibal b8086067d5 * Build Huffman codec from unsorted inputs 2015-07-19 17:58:44 +02:00
Matthew Honnibal 317cbbc015 * Serialization round trip now working with decent API, but with rough spots in the organisation and requiring vocabulary to be fixed ahead of time. 2015-07-19 15:18:17 +02:00
Matthew Honnibal 0973e2f107 * Update serializer tests 2015-07-18 22:46:40 +02:00
Matthew Honnibal 6b13e7227c * Remove duplicate get_lex_attr method from doc.pyx 2015-07-18 22:46:07 +02:00
Matthew Honnibal e49c7f1478 * Update oov check in tokenizer 2015-07-18 22:45:28 +02:00
Matthew Honnibal cfd842769e * Allow infix tokens to be variable length 2015-07-18 22:45:00 +02:00
Matthew Honnibal 5b4c78bbb2 * Use an AttributeCodec based on orth for words. Still no oov handling mechanism. 2015-07-18 22:43:18 +02:00
Matthew Honnibal 82d84b0f2b * Index lexemes by orth, instead of a lexemes vector. Breaks the mechanism for deciding not to own LexemeC structs during parsing. Need to reinstate this. 2015-07-18 22:42:15 +02:00
Matthew Honnibal 4dddc8a69b * Fix type declarations for attr_t. Remove unused id_t. 2015-07-18 22:39:57 +02:00
Matthew Honnibal ced59ab9ea * Make minor efficiency improvement in Doc.__iter__ 2015-07-18 04:10:53 +02:00
Matthew Honnibal cd91914dd8 * Fix hard-coded length 2015-07-18 04:09:56 +02:00
Matthew Honnibal b1d74ce60d * Remove unused joint.pyx and joint.pxd files 2015-07-17 23:31:44 +02:00
Matthew Honnibal c27514512b * Remove cruft ner/ directory 2015-07-17 23:24:32 +02:00
Matthew Honnibal f8d6d319f4 * Remove cruft module 2015-07-17 23:23:05 +02:00
Matthew Honnibal fb0a641a2d * Don't release the gil around Parser.parse. Does this indicate thread problems? 2015-07-17 23:07:37 +02:00
Matthew Honnibal a6ff7e6ca4 * Fix redundant options in train.py 2015-07-17 22:38:05 +02:00
Matthew Honnibal e29daea85f * Fix bint/int typing problem in TransitionSystem. In C++ bint* means bool*, but in C it means int*. So, type-casting to bint* is unsafe. 2015-07-17 22:37:24 +02:00
Matthew Honnibal 6cfa83157e Merge branch 'refactor' of ssh://github.com/honnibal/spaCy into refactor 2015-07-17 21:38:04 +02:00
Matthew Honnibal f7f0ad1a78 * Fix tests 2015-07-17 21:31:44 +02:00
Matthew Honnibal 68374149ae * Move huffman encoding test to tests/serialize directory 2015-07-17 21:22:18 +02:00
Matthew Honnibal e950f5a408 * Tests for serializer 2015-07-17 21:21:10 +02:00
Matthew Honnibal cf0c788892 * Tests passing on round-trip pack/unpack on basic example 2015-07-17 21:20:48 +02:00
Matthew Honnibal 44f39a876f * Add a blank attrs.pyx 2015-07-17 16:40:42 +02:00
Matthew Honnibal c2c83120d4 * Remove codec property from Vocab 2015-07-17 16:40:11 +02:00
Matthew Honnibal dfdf19f6a9 * Draft a from_orth method for Doc 2015-07-17 16:39:54 +02:00
Matthew Honnibal a9149fdcbd * Compile attrs.pyx 2015-07-17 16:39:25 +02:00
Matthew Honnibal 9e3f17051b * Move to ORTH instead of ID for encoding lexemes. Basic tests of the codec wrappers now passing 2015-07-17 16:38:29 +02:00
Matthew Honnibal 15ff739996 * Fix passing of ID attribute in string store 2015-07-17 14:49:42 +02:00
Matthew Honnibal 95e57c2780 * Remove unnecessary key and id properties from Utf8String. 2015-07-17 01:40:18 +02:00
Matthew Honnibal 234c7e440a * Add spacy/serialize/__init__ files 2015-07-17 01:37:33 +02:00
Matthew Honnibal 221f7e51c7 * Ignore spacy/serialize/*.cpp 2015-07-17 01:36:49 +02:00
Matthew Honnibal db9dfd2e23 * Major refactor of serialization. Nearly complete now. 2015-07-17 01:27:54 +02:00
Matthew Honnibal c8282f9934 * Work on serialization. Needs more reorganisation 2015-07-16 19:56:02 +02:00
Matthew Honnibal d8458d6a25 * Fix attr_id_t import in Spans 2015-07-16 19:55:21 +02:00
Matthew Honnibal d1cb30dbc4 * Remove unnecessary key and id properties from Utf8String. 2015-07-16 19:29:02 +02:00
Matthew Honnibal 897de2d438 * Add 'bitter' property for serializer in English class 2015-07-16 17:47:53 +02:00
Matthew Honnibal fb54052ae0 * Work on serializer design 2015-07-16 17:46:46 +02:00
Matthew Honnibal a6f401580d * Add from_array function to Doc. 2015-07-16 17:46:11 +02:00
Matthew Honnibal 2a5d050134 * Give codec loading back to Vocab. 2015-07-16 17:45:42 +02:00
Matthew Honnibal 8bf0f65f1c * Remove dead code in strings.pyx 2015-07-16 17:35:53 +02:00
Matthew Honnibal a9c3863665 * Fix inefficiency in StringStore.dump function 2015-07-16 17:34:32 +02:00
Matthew Honnibal b59d271510 * Move serialization functionality into Serializer class 2015-07-16 11:23:48 +02:00
Matthew Honnibal 30be4f15da * Import attrs from spacy.attrs, not spacy.typedefs 2015-07-16 11:23:25 +02:00
Matthew Honnibal 6c99e5f4aa * Move serialization into Serializer class, with __call__ and train() api 2015-07-16 11:22:35 +02:00
Matthew Honnibal e2133d990e * Move serialization functionality out into a Serializer object 2015-07-16 11:21:44 +02:00
Matthew Honnibal a6d040bd11 * Import Lexeme attrs from spacy.attrs, not spacy.typedefs 2015-07-16 11:20:08 +02:00
Matthew Honnibal d8bc279e0c * Fix 'you' contraction capitals in specials.json 2015-07-16 01:28:32 +02:00