Commit Graph

365 Commits

Author SHA1 Message Date
Matthew Honnibal 327383e38a * Remove unused code in tagger.pyx 2014-12-07 22:16:17 +11:00
Matthew Honnibal 8f2f319c57 * Add a couple more contractions tests 2014-12-07 22:08:04 +11:00
Matthew Honnibal 9f17467c2e * Fix EMPTY_TOKEN 2014-12-07 22:07:41 +11:00
Matthew Honnibal 3819a88e1b * Add support for tag dictionary, and fix error-code for predict method 2014-12-07 22:07:16 +11:00
Matthew Honnibal f00afe12c4 * Load POS tagger in load() function if path exists 2014-12-07 22:05:57 +11:00
Matthew Honnibal 677e111ee7 * Revise tokenization rules to match PTB. Rules are pretty messy around periods, need better support for these. 2014-12-07 22:04:47 +11:00
Matthew Honnibal 5fe5e6e66b * Move context functions to header, inlining them. 2014-12-07 21:59:04 +11:00
Matthew Honnibal 91e8d9ea1c * Compile context.pyx and tagger.pyx modules 2014-12-07 15:29:54 +11:00
Matthew Honnibal 5caabec789 * Link in tagger, to work on integrating POS tagging 2014-12-07 15:29:41 +11:00
Matthew Honnibal 0c7aeb9de7 * Begin revising tagger, focussing on POS tagging 2014-12-07 15:29:04 +11:00
Matthew Honnibal f5c4f2eb52 * Revise context, focussing on POS tagging for now 2014-12-07 15:28:22 +11:00
Matthew Honnibal e27b912ef9 * Remove need for confusing _data pointer to be stored on Tokens 2014-12-05 16:31:30 +11:00
Matthew Honnibal 1c9253701d * Introduce a TokenC struct, to handle token indices, pos tags and sense tags 2014-12-05 15:56:14 +11:00
Matthew Honnibal 187372c7f3 * Allow the lexicon to create lexemes using an external memory pool, so that it can decide to make some lexemes temporary, rather than cached 2014-12-05 03:29:50 +11:00
Matthew Honnibal 75b8dfb348 * Remove upper_pc from lexeme.pyx 2014-12-04 22:14:34 +11:00
Matthew Honnibal a14f9eaf63 * Add index.pyx to setup 2014-12-04 22:14:11 +11:00
Matthew Honnibal 49f3780ff5 * Fiddle with lexeme attrs 2014-12-04 21:22:38 +11:00
Matthew Honnibal 564082e48e * Hack Token class to take lex.dense inplace of the old lex.norm. This needs to be fixed... 2014-12-04 20:51:29 +11:00
Matthew Honnibal 69bb022204 * Add as_array and count_by method 2014-12-04 20:46:55 +11:00
Matthew Honnibal e1b1f45cc9 * Add STEM attribute to lexeme 2014-12-04 20:46:20 +11:00
Matthew Honnibal d7952634ca * Make the string-store serve const pointers to Utf8Str 2014-12-03 16:01:47 +11:00
Matthew Honnibal 7e04c22f8f * const added to Lexicon interface. Seems to work. 2014-12-03 15:58:17 +11:00
Matthew Honnibal d70d31aa45 * Introduce first attempt at const-ness 2014-12-03 15:44:25 +11:00
Matthew Honnibal d0d812c548 * Hack setup.py to exclude tagger stuff 2014-12-03 11:06:57 +11:00
Matthew Honnibal 4560ada85b * Add typedef for attr_t. Change flag_t to flags_t 2014-12-03 11:06:31 +11:00
Matthew Honnibal e600f7b327 * Move String struct stuff into the utf8string module, from spacy.lang 2014-12-03 11:06:00 +11:00
Matthew Honnibal e170faf5b0 * Hack Tokens to work without tagger.pyx 2014-12-03 11:05:15 +11:00
Matthew Honnibal b463a7eb86 * Make flag-setting a language-specific thing 2014-12-03 11:04:32 +11:00
Matthew Honnibal 71b009e323 * Fix bug in refactored StringStore.__getitem__ 2014-12-03 11:02:24 +11:00
Matthew Honnibal 14097311ae * Make StringStore.__getitem__ accept unicode-typed keys. 2014-12-03 01:33:20 +11:00
Matthew Honnibal 522bb0346e * Work on get_array method of Tokens 2014-12-02 23:48:05 +11:00
Matthew Honnibal 8c2938fe01 * Rename Lexicon._dict to Lexicon._map 2014-12-02 23:46:59 +11:00
Matthew Honnibal 2ee8a1e61f * Make intro chattier, explain philosophy better 2014-12-02 15:20:18 +11:00
Matthew Honnibal ea19850a69 * Add tokenizer section 2014-12-02 04:39:12 +11:00
Matthew Honnibal 3430d5f629 * Revise intro copy. Add NLTK comparison 2014-12-01 22:55:13 +11:00
Matthew Honnibal 33dfb4933c * Remove taggers from Language class. Work on doc strings 2014-11-26 19:53:55 +11:00
Matthew Honnibal cf55b48ba6 * Switch to predict label on shift. Big increase in accuracy. 2014-11-12 23:50:12 +11:00
Matthew Honnibal 8f84e8a78b * Neaten oracle 2014-11-12 23:38:07 +11:00
Matthew Honnibal 66cb4f96e1 * Upd gitignore 2014-11-12 23:25:27 +11:00
Matthew Honnibal 60c1e78596 * Commit outstanding tests 2014-11-12 23:24:32 +11:00
Matthew Honnibal 7e0a9077dd * Add context files 2014-11-12 23:22:36 +11:00
Matthew Honnibal 9b13392ac7 * Add conll experiments 2014-11-12 23:22:05 +11:00
Matthew Honnibal b934bf1c69 * Compile IOB 2014-11-12 23:21:40 +11:00
Matthew Honnibal 3b0b902384 * IOB-style parsing working. Accuracy down from BILOU, form 87-88 to 85-86 2014-11-12 23:21:09 +11:00
Matthew Honnibal e6bb8aa3a9 * Move moves to bilou_moves. Refactor context, returning to the simpler giant-enum style 2014-11-12 00:54:50 +11:00
Matthew Honnibal c788633429 * Add tokens_from_list method to Language 2014-11-11 23:43:14 +11:00
Matthew Honnibal da70b6bd60 * Upd tokenization special-cases 2014-11-11 22:10:15 +11:00
Matthew Honnibal 95282d4993 * Use the dynamic oracle 'follow' strategy 2014-11-11 21:11:17 +11:00
Matthew Honnibal 60ffdc2eb7 * Upd fabfile 2014-11-11 21:10:40 +11:00
Matthew Honnibal d5e9dce039 * Compile ner NER code 2014-11-11 21:10:22 +11:00