Commit Graph

966 Commits

Author SHA1 Message Date
Matthew Honnibal c27514512b * Remove cruft ner/ directory 2015-07-17 23:24:32 +02:00
Matthew Honnibal f8d6d319f4 * Remove cruft module 2015-07-17 23:23:05 +02:00
Matthew Honnibal fb0a641a2d * Don't release the gil around Parser.parse. Does this indicate thread problems? 2015-07-17 23:07:37 +02:00
Matthew Honnibal e29daea85f * Fix bint/int typing problem in TransitionSystem. In C++ bint* means bool*, but in C it means int*. So, type-casting to bint* is unsafe. 2015-07-17 22:37:24 +02:00
Matthew Honnibal cf0c788892 * Tests passing on round-trip pack/unpack on basic example 2015-07-17 21:20:48 +02:00
Matthew Honnibal 44f39a876f * Add a blank attrs.pyx 2015-07-17 16:40:42 +02:00
Matthew Honnibal c2c83120d4 * Remove codec property from Vocab 2015-07-17 16:40:11 +02:00
Matthew Honnibal dfdf19f6a9 * Draft a from_orth method for Doc 2015-07-17 16:39:54 +02:00
Matthew Honnibal 9e3f17051b * Move to ORTH instead of ID for encoding lexemes. Basic tests of the codec wrappers now passing 2015-07-17 16:38:29 +02:00
Matthew Honnibal 15ff739996 * Fix passing of ID attribute in string store 2015-07-17 14:49:42 +02:00
Matthew Honnibal 95e57c2780 * Remove unnecessary key and id properties from Utf8String. 2015-07-17 01:40:18 +02:00
Matthew Honnibal 234c7e440a * Add spacy/serialize/__init__ files 2015-07-17 01:37:33 +02:00
Matthew Honnibal db9dfd2e23 * Major refactor of serialization. Nearly complete now. 2015-07-17 01:27:54 +02:00
Matthew Honnibal c8282f9934 * Work on serialization. Needs more reorganisation 2015-07-16 19:56:02 +02:00
Matthew Honnibal d8458d6a25 * Fix attr_id_t import in Spans 2015-07-16 19:55:21 +02:00
Matthew Honnibal d1cb30dbc4 * Remove unnecessary key and id properties from Utf8String. 2015-07-16 19:29:02 +02:00
Matthew Honnibal 897de2d438 * Add 'bitter' property for serializer in English class 2015-07-16 17:47:53 +02:00
Matthew Honnibal fb54052ae0 * Work on serializer design 2015-07-16 17:46:46 +02:00
Matthew Honnibal a6f401580d * Add from_array function to Doc. 2015-07-16 17:46:11 +02:00
Matthew Honnibal 2a5d050134 * Give codec loading back to Vocab. 2015-07-16 17:45:42 +02:00
Matthew Honnibal 8bf0f65f1c * Remove dead code in strings.pyx 2015-07-16 17:35:53 +02:00
Matthew Honnibal a9c3863665 * Fix inefficiency in StringStore.dump function 2015-07-16 17:34:32 +02:00
Matthew Honnibal b59d271510 * Move serialization functionality into Serializer class 2015-07-16 11:23:48 +02:00
Matthew Honnibal 30be4f15da * Import attrs from spacy.attrs, not spacy.typedefs 2015-07-16 11:23:25 +02:00
Matthew Honnibal 6c99e5f4aa * Move serialization into Serializer class, with __call__ and train() api 2015-07-16 11:22:35 +02:00
Matthew Honnibal e2133d990e * Move serialization functionality out into a Serializer object 2015-07-16 11:21:44 +02:00
Matthew Honnibal a6d040bd11 * Import Lexeme attrs from spacy.attrs, not spacy.typedefs 2015-07-16 11:20:08 +02:00
Matthew Honnibal 45ae1ce428 * Remove unused declaration in parser 2015-07-16 01:27:11 +02:00
Matthew Honnibal efa80096f1 * Upd attrs id list 2015-07-16 01:26:54 +02:00
Matthew Honnibal 01fab6bb90 * Improve de/serialize functions 2015-07-16 01:26:35 +02:00
Matthew Honnibal 0e07c1ed2a * draft de/serialization functions in doc.pyx 2015-07-16 01:16:33 +02:00
Matthew Honnibal 9d956b07e9 * Fix import of attrs in doc.pyx, and update the get_token_attr function. 2015-07-16 01:15:34 +02:00
Matthew Honnibal 65251e7625 * Remove redundant attr_id_t from typedefs.pxd 2015-07-16 00:58:51 +02:00
Matthew Honnibal 9a8db9743c * Remove gil from parser.call 2015-07-14 23:47:33 +02:00
Matthew Honnibal 38ca0c33f5 Merge branch 'neuralnet' into refactor
Mostly refactors parser, to use new thinc3.2 Example class.
Aim is to remove use of shared memory, so that we can parallelize
over documents easily.

Conflicts:
	setup.py
	spacy/syntax/parser.pxd
	spacy/syntax/parser.pyx
	spacy/syntax/stateclass.pyx
2015-07-14 14:13:47 +02:00
Matthew Honnibal 935ac53ee3 * Extend count_by method 2015-07-14 03:20:09 +02:00
Matthew Honnibal 3b5baa660f * Fix tokenizer 2015-07-14 00:10:51 +02:00
Matthew Honnibal 2ae0b439b2 * Fix space check in gold.pyx 2015-07-14 00:10:27 +02:00
Matthew Honnibal 81aa4e6dcc * Go back to having token reference doc, instead of complicated gymnastics. Rename the attr 'doc', to expose it in the API 2015-07-14 00:10:11 +02:00
Matthew Honnibal 24d6ce99ec * Add comment to tokenizer, explaining the spacy attr 2015-07-13 22:29:13 +02:00
Matthew Honnibal 8214b74eec * Restore _py_tokens cache, to handle orphan tokens. 2015-07-13 22:28:10 +02:00
Matthew Honnibal 67641f3b58 * Refactor tokenizer, to set the 'spacy' field on TokenC instead of passing a string 2015-07-13 21:46:02 +02:00
Matthew Honnibal 6eef0bf9ab * Break up tokens.pyx into tokens/doc.pyx, tokens/token.pyx, tokens/spans.pyx 2015-07-13 20:20:58 +02:00
Matthew Honnibal 3ea8756c24 * Add spacy/tokens/doc.pyx, for Doc class in its own file 2015-07-13 19:58:26 +02:00
Matthew Honnibal c99387155f * Refactor tokens, moving classes into a module instead of a single file 2015-07-13 19:49:55 +02:00
Matthew Honnibal d27899658e * Import classes in spacy.tokens.__init__ 2015-07-13 19:48:55 +02:00
Matthew Honnibal aa82caf8f5 * Add TokenC.spacy attr 2015-07-13 19:48:07 +02:00
Matthew Honnibal dba6b47d4e * Refactor monster tokens.pyx file, into a tokens/ subpackage. Try to break the cycle between Doc and Token, and remove the need to pass around a unicode string reference 2015-07-13 19:20:48 +02:00
Matthew Honnibal 5b0a7190c9 * Round-trip for serialization finally working. Needs a lot of optimization. 2015-07-13 18:39:38 +02:00
Matthew Honnibal edd371246c * Make huffman coder take BitArray in encode/decode. Add __iter__ method to BitArray. 2015-07-13 17:33:33 +02:00
Matthew Honnibal af5cc926a4 * Add codec property to Vocab, to use the Huffman encoding 2015-07-13 13:55:14 +02:00
Matthew Honnibal 77385d5580 * Make .pxd file for huffman codec 2015-07-13 13:54:51 +02:00
Matthew Honnibal 083b6ea7ae * Clean up encoder a bit. now read for integration into Vocab. 2015-07-13 12:57:22 +02:00
Matthew Honnibal 8d0f1d98da * Draft dockstring for HuffmanCache 2015-07-13 12:01:18 +02:00
Matthew Honnibal 281f1faefb * Nearly finished huffman coder 2015-07-12 23:48:46 +02:00
Matthew Honnibal e1a25fba32 * Work on huffman coder 2015-07-12 19:58:05 +02:00
Matthew Honnibal 3fb9de2d13 * Remove vector[bint], in favor of simple Code struct. 2015-07-12 17:58:27 +02:00
Matthew Honnibal aa7bfd932b * Work on compressor 2015-07-12 16:03:43 +02:00
Matthew Honnibal 14eafcab15 * Refactor to use vector[bint] 2015-07-12 05:27:47 +02:00
Matthew Honnibal 6a6e852a39 * Refactor huffman coding stuff into class 2015-07-12 05:06:36 +02:00
Matthew Honnibal aad96fdb5c * Improve efficiency of huffman coding 2015-07-12 01:31:37 +02:00
Matthew Honnibal ff9ff6f3fa * Ensure unseen words are given low log probability 2015-07-12 01:31:09 +02:00
Matthew Honnibal 9d3b0d83de * Refactor huffman coding 2015-07-11 22:27:43 +02:00
Matthew Honnibal 8d29406cd6 * Rename span.right to span.rights 2015-07-11 22:15:04 +02:00
Matthew Honnibal da9f358166 * Fix span getting 2015-07-11 21:41:41 +02:00
Matthew Honnibal 11e8f2ffb4 * Huffman codes working 2015-07-11 20:01:10 +02:00
Matthew Honnibal cb6fc81909 * Work on huffman coding. 2015-07-11 15:23:35 +02:00
Matthew Honnibal 4c9b77fe95 * Begin working on serialization code 2015-07-11 10:57:30 +02:00
Matthew Honnibal 53d1f5b2eb * Rename Span.head to Span.root. 2015-07-09 17:30:58 +02:00
Matthew Honnibal c0255ed7d8 * Allow slice indexing in Doc.__getitem__, returning a Span object 2015-07-09 15:15:32 +02:00
Matthew Honnibal 89a91ad726 * Add SPACE part-of-speech tag, and train tagger to assign it. Also train tagger not to make whitespace an entity 2015-07-09 13:30:41 +02:00
Matthew Honnibal 55f1042443 * Improve efficiency of L and R features, correcting the non-linear-in-length problem. 2015-07-09 12:17:26 +02:00
Matthew Honnibal 70d2acb579 * Fix edge features 2015-07-09 12:15:01 +02:00
Matthew Honnibal adb868bdad * Add warning for models not found in parser 2015-07-08 20:04:55 +02:00
Matthew Honnibal 05b28ec9eb * Add warning for models not found in parser 2015-07-08 20:02:13 +02:00
Matthew Honnibal ef700401a6 * Add warning for models not found in parser 2015-07-08 20:00:46 +02:00
Matthew Honnibal 6218d8b389 * Add warning for models not found in parser 2015-07-08 19:59:16 +02:00
Matthew Honnibal f6a6c39ce8 * Add warning for models not found in parser 2015-07-08 19:52:30 +02:00
Matthew Honnibal 78db7e32f7 * Remove has_sense method from Lexeme declaration 2015-07-08 19:41:20 +02:00
Matthew Honnibal 6ddb2f5e45 * Restore merge_mwe in English class 2015-07-08 19:35:30 +02:00
Matthew Honnibal 6859f6adac * Restore merge_mwe in English class 2015-07-08 19:34:55 +02:00
Matthew Honnibal 3c270fc8ff * Remove has_sense method from Lexeme 2015-07-08 19:28:29 +02:00
Matthew Honnibal b64c843861 * Remove senses attr 2015-07-08 19:26:24 +02:00
Matthew Honnibal 1d3a592edf * Remove the senses attr from LexemeC, to keep data compatibility 2015-07-08 19:24:44 +02:00
Matthew Honnibal 0ceb1f71c2 * Update parse features 2015-07-08 19:11:36 +02:00
Matthew Honnibal 2e51b5027a * Alias Doc to Tokens, for backwards compatibility 2015-07-08 18:59:35 +02:00
Matthew Honnibal e3c53f5ecd * Fix mention of Tokens in docstring 2015-07-08 18:56:27 +02:00
Matthew Honnibal bb522496dd * Rename Tokens to Doc 2015-07-08 18:53:00 +02:00
Matthew Honnibal b24e8be2b9 * Whitespace in docstring 2015-07-08 12:37:03 +02:00
Matthew Honnibal abc43b852d * Add pos_tags attr to Vocab. 2015-07-08 12:36:38 +02:00
Matthew Honnibal 935bcdf3e5 * Remove redundant tag_names argument to Tokenizer 2015-07-08 12:36:04 +02:00
Matthew Honnibal ff885e8511 * Add ParserFactory convenience function 2015-07-08 12:35:46 +02:00
Matthew Honnibal 4e4fac452b * Refactor __init__ for simplicity. Allow parse=True, tag=True etc flags to be passed at top-level. Do not lazy-load parser. 2015-07-08 12:35:29 +02:00
Matthew Honnibal 1d2deb4616 * Work on refactoring default arguments to English.__init__ 2015-07-07 15:53:25 +02:00
Matthew Honnibal 2d0e99a096 * Pass pos_tags into Tokenizer.from_dir 2015-07-07 14:23:08 +02:00
Matthew Honnibal 6788c86b2f * Begin refactor 2015-07-07 14:00:07 +02:00
Matthew Honnibal 52fd80c6c6 * Add experimental supersense features for parsing, based on lookup into wordnet. 2015-07-01 20:12:44 +02:00
Matthew Honnibal e6d828a9af * Set up an array POS_SENSES that denotes the set of valid senses for each POS tag. This way, we can do bitwise & between a lexeme's senses and the ones available for its POS tag, to get the allowable senses for the token. 2015-07-01 20:12:13 +02:00
Matthew Honnibal 2b8459d9a8 * Add senses flag to Lexeme 2015-07-01 20:10:41 +02:00
Matthew Honnibal e23d1582a2 * Add supersense data to Lexeme objects. Add simple has_sense method to check the flag. 2015-07-01 18:50:37 +02:00
Matthew Honnibal 64fafa98be * Add senses.pyx and senses.pxd 2015-07-01 18:49:44 +02:00
Matthew Honnibal 94dab94e5f uerge branch 'master' of https://github.com/honnibal/spaCy 2015-06-30 18:16:26 +02:00
Matthew Honnibal 9af86b0b0b * Fix attrs.pxd 2015-06-30 18:16:30 +02:00
Matthew Honnibal af9c82f7a6 Merge branch 'master' of https://github.com/honnibal/spaCy 2015-06-30 18:11:37 +02:00
Matthew Honnibal 5d595b5a8c * Inc versions 2015-06-30 18:11:06 +02:00
Matthew Honnibal d2eeba6667 * Start wiring up color and emotion lexicons. Hopefully we get to use them. 2015-06-30 16:22:23 +02:00
Matthew Honnibal e20106fdff * Begin reorganizing neuralnet work 2015-06-30 14:26:32 +02:00
Matthew Honnibal 5cd3ed42d4 * Reenable averaging 2015-06-29 16:44:42 +02:00
Matthew Honnibal 894cbef8ba * Wire eta and mu parameters up for neural net 2015-06-29 07:10:33 +02:00
Matthew Honnibal 3bb5876c5a * Inline methods in StateClass 2015-06-29 01:10:14 +02:00
Matthew Honnibal 313a7f87b3 * Inline methods in StateClass 2015-06-29 01:06:28 +02:00
Matthew Honnibal a02fd3af5d * Check valency in L and R feature methods, to make feaure calculation faster 2015-06-29 00:27:56 +02:00
Matthew Honnibal 5d870720bc * Check valency in L and R feature methods, to make feaure calculation faster 2015-06-29 00:17:29 +02:00
Matthew Honnibal f4986d5d3c * Use new Example class 2015-06-28 22:36:03 +02:00
Matthew Honnibal 735f1af91f * Fix neural net stuff 2015-06-28 11:44:58 +02:00
Matthew Honnibal e7003f1cf3 * Remove hard-coding of vector lengths 2015-06-28 11:37:17 +02:00
Matthew Honnibal 897dd0dd0b * Merge changes, and adjust Example to use memoryview 2015-06-28 11:36:11 +02:00
Matthew Honnibal 9282a8e72c * Prepare for new models to be plugged in by using Example class 2015-06-28 11:02:35 +02:00
Matthew Honnibal 75aeccc064 * Rejig parser interface to use new thinc.api.Example class, in prep of theano model. Comment out beam search 2015-06-28 11:02:34 +02:00
Matthew Honnibal bf33598b34 * Work on a theano-driven model for the parser 2015-06-28 11:02:34 +02:00
Matthew Honnibal bbef71f213 * Fix min function in fill_context 2015-06-28 10:46:39 +02:00
Matthew Honnibal 142b6f9510 * Revert last changes 2015-06-28 10:44:28 +02:00
Matthew Honnibal b06962f18b * Pad buffers in state 2015-06-28 10:36:14 +02:00
Matthew Honnibal 53be72387c * Hack at fill_context to investigate performance loss 2015-06-28 10:34:28 +02:00
Matthew Honnibal 71a4e876a9 * Fix parse features 2015-06-28 09:27:33 +02:00
Matthew Honnibal 0c4b5a2bb0 * Start scoring tokens 2015-06-28 06:21:38 +02:00
Matthew Honnibal 5af500909c * Remove unused directve from parser.pyx 2015-06-28 06:20:21 +02:00
Matthew Honnibal d5b4090705 * Add profile directive 2015-06-28 06:19:33 +02:00
Matthew Honnibal 2b5421e60c * Add profile directive 2015-06-28 06:07:04 +02:00
Matthew Honnibal 8b5de4a411 * Add word / tag / label sets, for use in neural net 2015-06-28 05:46:53 +02:00
Matthew Honnibal cfcbd8d256 * Fix punctuation eval in scorer.py 2015-06-28 01:31:39 +02:00
Matthew Honnibal ed40a8380e * Remove hard-coding of vector lengths 2015-06-27 04:18:47 +02:00
Matthew Honnibal ebe630cc8d * Enable more features for NN 2015-06-27 04:17:29 +02:00
Matthew Honnibal f8bb43475e * Bridge to Theano working. Very disorganised. Using thinc adb60aba966ed2 2015-06-27 02:39:18 +02:00
Matthew Honnibal 2fe98b8a9a * Prepare for new models to be plugged in by using Example class 2015-06-26 13:51:39 +02:00
Matthew Honnibal 6896455884 * Rejig parser interface to use new thinc.api.Example class, in prep of theano model. Comment out beam search 2015-06-26 06:25:36 +02:00
Matthew Honnibal b266a63f2c * Inc version of downloadble data 2015-06-24 04:53:08 +02:00
Matthew Honnibal 02b171ee67 * Bug fixes to edge calculation 2015-06-24 04:28:02 +02:00
Matthew Honnibal a4e9bdf4c1 * Work on a theano-driven model for the parser 2015-06-24 01:02:40 +02:00
Matthew Honnibal 7f9384f53c * Remove deprecated _state module 2015-06-23 17:28:24 +02:00
Matthew Honnibal 6dbe182491 * Fix merge conflicts 2015-06-23 17:28:00 +02:00
Matthew Honnibal 579735a095 * Remove import of _state module 2015-06-23 17:25:08 +02:00
Matthew Honnibal 88f55d136b * Remove deprecated _state module 2015-06-23 17:19:51 +02:00
Matthew Honnibal 9ab9dd2bf7 * Clean up unused orig_arc_eager and tree_arc_eager modules, which were only added for EMNLP experiments 2015-06-23 17:17:33 +02:00
Matthew Honnibal 7ebfe4b983 * Fixes to edge features 2015-06-23 16:32:54 +02:00
Matthew Honnibal 7b125f5a86 * Fixes to edge features 2015-06-23 16:31:01 +02:00
Matthew Honnibal 8d4bbacfc5 * Fix edge navigation in Token objects 2015-06-23 16:07:34 +02:00
Matthew Honnibal 35c290bee4 * Fix edge features 2015-06-23 15:50:56 +02:00
Matthew Honnibal 221e2e485f * Assign 'ROOT' as label, not 'root' 2015-06-23 15:09:54 +02:00
Matthew Honnibal a7bf7b0626 * Rename sent_start to sent_end, to reflect its new usage in the Break transition 2015-06-23 05:39:43 +02:00