Commit Graph

85 Commits

Author SHA1 Message Date
Matthew Honnibal 2348a08481 * Load/dump strings with a json file, instead of the hacky strings file we were using. 2015-10-22 21:13:03 +11:00
Matthew Honnibal 7a15d1b60c * Add Python 2/3 compatibility fix for copy_reg 2015-10-13 20:04:40 +11:00
Matthew Honnibal 20fd36a0f7 * Very scrappy, likely buggy first-cut pickle implementation, to work on Issue #125: allow pickle for Apache Spark. The current implementation sends stuff to temp files, and does almost nothing to ensure all modifiable state is actually preserved. The Language() instance is a deep tree of extension objects, and if pickling during training, some of the C-data state is hard to preserve. 2015-10-13 13:44:41 +11:00
Matthew Honnibal f8de403483 * Work on pickling Vocab instances. The current implementation is not correct, but it may serve to see whether this approach is workable. Pickling is necessary to address Issue #125 2015-10-13 13:44:41 +11:00
Matthew Honnibal 85e7944572 * Start trying to pickle Vocab 2015-10-13 13:44:41 +11:00
Matthew Honnibal 41012907a8 * Fix variable name 2015-10-13 13:44:40 +11:00
Matthew Honnibal 37b909b6b6 * Use the symbols file in vocab instead of the symbols subfiles like attrs.pxd 2015-10-13 13:44:40 +11:00
Matthew Honnibal d70e8cac2c * Fix empty values in attributes and parts of speech, so symbols align correctly with the StringStore 2015-10-13 13:44:40 +11:00
Matthew Honnibal a29c8ee23d * Add symbols to the vocab before reading the strings, so that they line up correctly 2015-10-13 13:44:39 +11:00
Matthew Honnibal 85ce36ab11 * Refactor symbols, so that frequency rank can be derived from the orth id of a word. 2015-10-13 13:44:39 +11:00
Matthew Honnibal 83dccf0fd7 * Use io module insteads of deprecated codecs module 2015-10-10 14:13:01 +11:00
Matthew Honnibal 3d9f41c2c9 * Add LookupError for better error reporting in Vocab 2015-10-06 10:34:59 +11:00
alvations 8caedba42a caught more codecs.open -> io.open 2015-09-30 20:20:09 +02:00
Matthew Honnibal abf0d930af * Fix API for loading word vectors from a file. 2015-09-23 23:51:08 +10:00
Matthew Honnibal f7283a5067 * Fix vectors bugs for OOV words 2015-09-22 02:10:25 +02:00
Matthew Honnibal ac459278d1 * Fix vector length error reporting, and ensure vec_len is returned 2015-09-21 18:08:32 +10:00
Matthew Honnibal ba4e563701 * Ensure vectors are same length, and return vector length in load_vectors_bz2 2015-09-21 18:03:08 +10:00
Matthew Honnibal d6945bf880 * Add way to load vectors from bz2 file to vocab 2015-09-17 12:58:23 +10:00
Matthew Honnibal 3d87519f64 * Remove vectors argument from Vocab object 2015-09-15 14:47:14 +10:00
Matthew Honnibal 27f988b167 * Remove the vectors option to Vocab, preferring to either load vectors from disk, or set them on the Lexeme objects. 2015-09-15 14:41:48 +10:00
Matthew Honnibal e9c59693ea * Remove assertion from vocab.pyx 2015-09-13 10:30:08 +10:00
Matthew Honnibal e1dfaeed8a * Check serializer freqs exist before loading 2015-09-12 23:49:38 +02:00
Matthew Honnibal a412c66c8c * Check serializer freqs exist before loading 2015-09-12 23:40:01 +02:00
Matthew Honnibal e285ca7d6c * Load serializer freqs in vocab 2015-09-10 15:22:48 +02:00
Matthew Honnibal 094440f9f5 Merge branch 'develop' of ssh://github.com/honnibal/spaCy into develop 2015-09-10 14:51:17 +02:00
Matthew Honnibal 90da3a695d * Load lemmatizer from disk in Vocab.from_dir 2015-09-10 14:49:10 +02:00
Matthew Honnibal f634191e27 * Fix vocab read/write 2015-09-10 14:44:38 +02:00
Matthew Honnibal a7f4b26c8c * Tmp 2015-09-09 14:33:26 +02:00
Matthew Honnibal d6561988cf * Fix lexemes.bin 2015-09-09 11:49:51 +02:00
Matthew Honnibal c301bebd33 Merge branch 'master' of https://github.com/honnibal/spaCy into develop 2015-09-09 10:55:39 +02:00
Matthew Honnibal 623329b19a Merge branch 'master' of ssh://github.com/honnibal/spaCy into develop 2015-09-08 14:27:01 +02:00
Matthew Honnibal 62a01dd41d * Fix issue #92: lexemes.bin read error on 32-bit platforms. 2015-09-08 14:23:58 +02:00
Matthew Honnibal f6ec5bf1b0 * Use empty tag map in vocab if none supplied 2015-09-06 20:19:27 +02:00
Matthew Honnibal 534e3dda3c * More work on language independent parsing 2015-08-28 03:44:54 +02:00
Matthew Honnibal c2307fa9ee * More work on language-generic parsing 2015-08-28 02:02:33 +02:00
Matthew Honnibal 1302d35dff * Rework interfaces in vocab 2015-08-26 19:21:46 +02:00
Matthew Honnibal 6f1743692a * Work on language-independent refactoring 2015-08-23 20:49:18 +02:00
Matthew Honnibal cad0cca4e3 * Tmp 2015-08-22 22:04:34 +02:00
Matthew Honnibal 3d43f49f69 * Revert prev change 2015-07-27 10:58:15 +02:00
Matthew Honnibal 6b586cdad4 * Change lexemes.bin format. Add a header specifying size of LexemeC and number of lexemes, and don't have the redundant orth information. 2015-07-27 08:31:51 +02:00
Matthew Honnibal 8e4c69ee8c * Add is_oov property, and fix up handling of attributes 2015-07-27 01:50:06 +02:00
Matthew Honnibal fc268f03eb * Assert against null pointer exceptions in vocab 2015-07-27 01:00:10 +02:00
Matthew Honnibal 0f093fdb30 * Fix get_by_orth for py3 2015-07-26 19:26:41 +02:00
Matthew Honnibal ceeda5a739 * Fix get_by_orth for py3 2015-07-26 18:39:27 +02:00
Matthew Honnibal 6bb96c122d * Host IS_ flags in attrs.pxd, and add properties for them on Token and Lexeme objects 2015-07-26 16:37:16 +02:00
Matthew Honnibal 7eb2446082 * Return empty lexeme on empty string 2015-07-26 00:18:30 +02:00
Matthew Honnibal fd525f0675 * Pass OOV probability around 2015-07-25 23:29:51 +02:00
Matthew Honnibal 22028602a9 * Add unicode_literals declaration in vocab.pyx 2015-07-23 13:24:20 +02:00
Matthew Honnibal a7c4d72e83 * Add serializer property to Vocab, and lazy-load it. Add get_by_orth method. 2015-07-23 01:18:19 +02:00
Matthew Honnibal 109106a949 * Replace UniStr, using unicode objects instead 2015-07-22 04:52:05 +02:00