Commit Graph

46 Commits

Author SHA1 Message Date
Matthew Honnibal 9959a64f7b * Working morphology and lemmatisation. POS tagging quite fast. 2014-12-10 08:09:32 +11:00
Matthew Honnibal ef4398b204 * Rearrange POS stuff, so that language-specific stuff can live in language-specific modules 2014-12-07 23:52:41 +11:00
Matthew Honnibal 49f3780ff5 * Fiddle with lexeme attrs 2014-12-04 21:22:38 +11:00
Matthew Honnibal e1b1f45cc9 * Add STEM attribute to lexeme 2014-12-04 20:46:20 +11:00
Matthew Honnibal d70d31aa45 * Introduce first attempt at const-ness 2014-12-03 15:44:25 +11:00
Matthew Honnibal b463a7eb86 * Make flag-setting a language-specific thing 2014-12-03 11:04:32 +11:00
Matthew Honnibal 50309e6e49 * Fix context vector, importing all features 2014-11-05 22:11:39 +11:00
Matthew Honnibal 70ea862703 * Remove vocab10k field, and add flags for gazetteers 2014-11-03 00:13:51 +11:00
Matthew Honnibal 8335706321 * Add LIKE_URL and LIKE_NUMBER flag features 2014-11-02 13:19:23 +11:00
Matthew Honnibal 6c807aa45f * Restore id attribute to lexeme, and rename pos field to postype, to store clustered tag dictionaries 2014-10-31 17:43:00 +11:00
Matthew Honnibal 87c2418a89 * Fiddle with data types on Lexeme, to compress them to a much smaller size. 2014-10-30 15:42:15 +11:00
Matthew Honnibal e6b87766fe * Remove lexemes vector from Lexicon, and the id and hash attributes from Lexeme 2014-10-30 15:21:38 +11:00
Matthew Honnibal 13909a2e24 * Rewriting Lexeme serialization. 2014-10-29 23:19:38 +11:00
Matthew Honnibal 08ce602243 * Large refactor, particularly to Python API 2014-10-24 00:59:17 +11:00
Matthew Honnibal e5e951ae67 * Remove the feature array stuff from Tokens class, and replace vector with array-based implementation, with padding. 2014-10-23 01:57:59 +11:00
Matthew Honnibal 0a0e41f6c8 * Add prefix and suffix features 2014-10-22 12:56:09 +11:00
Matthew Honnibal 65d3ead4fd * Rename LexStr_casefix to LexStr_norm and LexInt_i to LexInt_id 2014-10-14 15:19:07 +11:00
Matthew Honnibal 71ee921055 * Slight cleaning of tokenizer code 2014-10-10 19:17:22 +11:00
Matthew Honnibal 59b41a9fd3 * Switch to new data model, tests passing 2014-10-10 08:11:31 +11:00
Matthew Honnibal 1b0e01d3d8 * Revising data model of lexeme. Compiles. 2014-10-09 19:53:30 +11:00
Matthew Honnibal e40caae51f * Update Lexicon class to expect a list of lexeme dict descriptions 2014-10-09 14:51:35 +11:00
Matthew Honnibal 51d75b244b * Add serialize/deserialize functions for lexeme, transport to/from python dict. 2014-10-09 14:10:46 +11:00
Matthew Honnibal d73d89a2de * Add i attribute to lexeme, giving lexemes sequential IDs. 2014-10-09 13:50:05 +11:00
Matthew Honnibal ac522e2553 * Switch from own memory class to cymem, in pip 2014-09-17 23:09:24 +02:00
Matthew Honnibal 6266cac593 * Switch to using a Python ref counted gateway to malloc/free, to prevent memory leaks 2014-09-17 20:02:26 +02:00
Matthew Honnibal f77b7098c0 * Upd Tokens to use vector, with bounds checking. 2014-09-15 03:22:40 +02:00
Matthew Honnibal b488224c09 * Restoring Lexeme-as-struct 2014-09-10 20:41:37 +02:00
Matthew Honnibal 88095666dc * Remove Lexeme struct, preparing to rename Word to Lexeme. 2014-08-24 19:24:42 +02:00
Matthew Honnibal e289896603 * Fix ptb3 module 2014-08-22 16:36:17 +02:00
Matthew Honnibal 811b7a6b91 * Struggling with arbitrary attr access... 2014-08-21 23:49:14 +02:00
Matthew Honnibal d10993f41a * More docs work 2014-08-21 16:37:13 +02:00
Matthew Honnibal a78ad4152d * Broken version being refactored for docs 2014-08-20 13:39:39 +02:00
Matthew Honnibal 5fddb8d165 * Working refactor, with updated data model for Lexemes 2014-08-19 04:21:20 +02:00
Matthew Honnibal 3379d7a571 * Reforming data model for lexemes 2014-08-19 02:40:37 +02:00
Matthew Honnibal 01469b0888 * Refactor spacy so that chunks return arrays of lexemes, so that there is properly one lexeme per word. 2014-08-18 19:14:00 +02:00
Matthew Honnibal 515d41d325 * Restore string saving to spacy 2014-08-16 16:09:24 +02:00
Matthew Honnibal a225ca5b0d * Refactoring tokenizer 2014-08-16 03:22:03 +02:00
Matthew Honnibal d6e07aa922 * Switch to 32bit hash for strings 2014-08-02 21:51:52 +01:00
Matthew Honnibal 6319ff0f22 * Add length property 2014-08-02 21:26:44 +01:00
Matthew Honnibal 571808a274 Group-by seems to be working 2014-07-07 20:27:02 +02:00
Matthew Honnibal 80b36f9f27 * 710k words per second for counts 2014-07-07 19:12:19 +02:00
Matthew Honnibal 057c21969b * Refactor for string view features. Working on setting up flags and enums. 2014-07-07 16:58:48 +02:00
Matthew Honnibal f1bcbd4c4e * Reorganized code to accomodate Tokens class. Need string views before group_by and count_by can be done well. 2014-07-07 12:47:21 +02:00
Matthew Honnibal ff1869ff07 * Fixed major efficiency problem, from not quite grokking pass by reference in cython c++ 2014-07-07 07:36:43 +02:00
Matthew Honnibal d5bef02c72 * Reorganized, moving language-independent stuff to spacy. The functions in spacy ask for the dictionaries and split function on input, but the language-specific modules are curried versions that use the globals 2014-07-07 04:21:06 +02:00
Matthew Honnibal 556f6a18ca * Initial commit. Tests passing for punctuation handling. Need contractions, file transport, tokenize function, etc. 2014-07-05 20:51:42 +02:00