Commit Graph

50 Commits

Author SHA1 Message Date
Matthew Honnibal 59b41a9fd3 * Switch to new data model, tests passing 2014-10-10 08:11:31 +11:00
Matthew Honnibal b15619e170 * Use PointerHash instead of locally provided _hashing module 2014-09-25 18:23:35 +02:00
Matthew Honnibal 6266cac593 * Switch to using a Python ref counted gateway to malloc/free, to prevent memory leaks 2014-09-17 20:02:26 +02:00
Matthew Honnibal 0152831c89 * Refactor tokenization, enable cache, and ensure we look up specials correctly even when there's confusing punctuation surrounding the token. 2014-09-16 18:01:46 +02:00
Matthew Honnibal 143e51ec73 * Refactor tokenization, splitting it into a clearer life-cycle. 2014-09-16 13:16:02 +02:00
Matthew Honnibal 7959141d36 * Add a few abbreviations, to get tests to pass 2014-09-15 06:32:18 +02:00
Matthew Honnibal df24e3708c * Move EnglishTokens stuff to Tokens 2014-09-15 01:31:44 +02:00
Matthew Honnibal 0447279c57 * PointerHash working, efficiency is good. 6-7 mins 2014-09-13 16:43:59 +02:00
Matthew Honnibal 85d68e8e95 * Replaced cache with own hash table. Similar timing 2014-09-13 03:14:43 +02:00
Matthew Honnibal 126a8453a5 * Fix performance issues by implementing a better cache. Add own String struct to help 2014-09-12 23:50:37 +02:00
Matthew Honnibal 9298e36b36 * Move special tokenization into its own lookup table, away from the cache. 2014-09-12 19:43:14 +02:00
Matthew Honnibal 985bc68327 * Fix bug with trailing punct on contractions. Reduced efficiency, and slightly hacky implementation. 2014-09-12 18:26:26 +02:00
Matthew Honnibal 5aa591106b * Fiddle with token features 2014-09-12 15:49:36 +02:00
Matthew Honnibal 1533041885 * Update the split_one method, so that it doesn't need to cast back to a Python object 2014-09-12 05:10:59 +02:00
Matthew Honnibal 4817277d66 * Replace main lexicon dict with dense_hash_map. May be unsuitable, if strings need recovery. 2014-09-12 04:29:09 +02:00
Matthew Honnibal 073ee0de63 * Restore dense_hash_map for cache dictionary. Seems to double efficiency 2014-09-12 02:23:51 +02:00
Matthew Honnibal 1a3222af4b * Moving tokens to use an array internally, instead of a list of Lexeme objects. 2014-09-11 16:57:08 +02:00
Matthew Honnibal 7c09c73a14 * Refactor to use tokens class. 2014-09-10 18:27:44 +02:00
Matthew Honnibal cf412adba8 * Refactoring to use Tokens object 2014-09-10 18:11:13 +02:00
Matthew Honnibal dcab14ede2 * Begin testing more functionality 2014-08-30 19:01:15 +02:00
Matthew Honnibal 45a22d6b2c * Docs coming together 2014-08-29 01:59:23 +02:00
Matthew Honnibal c282e6d5fb * Redesign proceeding 2014-08-28 19:45:09 +02:00
Matthew Honnibal fdaf24604a * Basic punct tests updated and passing 2014-08-27 19:38:57 +02:00
Matthew Honnibal 8d20617dfd * Whitespace 2014-08-27 17:16:16 +02:00
Matthew Honnibal e9a62b6eba * Refactoring with Lexeme as a class now compiles. Basic design seems to work 2014-08-27 17:15:39 +02:00
Matthew Honnibal 68bae2fec6 * More refactoring 2014-08-25 16:42:22 +02:00
Matthew Honnibal 88095666dc * Remove Lexeme struct, preparing to rename Word to Lexeme. 2014-08-24 19:24:42 +02:00
Matthew Honnibal 3b793cf4f7 * Tests passing for new Word object version 2014-08-24 18:13:53 +02:00
Matthew Honnibal 782806df08 * Moving to Word objects in place of the Lexeme struct. 2014-08-22 17:28:23 +02:00
Matthew Honnibal e289896603 * Fix ptb3 module 2014-08-22 16:36:17 +02:00
Matthew Honnibal 07ecf5d2f4 * Fixed group_by, removed idea of general attr_of function. 2014-08-22 00:02:37 +02:00
Matthew Honnibal 811b7a6b91 * Struggling with arbitrary attr access... 2014-08-21 23:49:14 +02:00
Matthew Honnibal 314658b31c * Improve module docstring 2014-08-21 18:42:47 +02:00
Matthew Honnibal 248cbb6d07 * Update doc strings 2014-08-21 03:29:15 +02:00
Matthew Honnibal a78ad4152d * Broken version being refactored for docs 2014-08-20 13:39:39 +02:00
Matthew Honnibal 5fddb8d165 * Working refactor, with updated data model for Lexemes 2014-08-19 04:21:20 +02:00
Matthew Honnibal 3379d7a571 * Reforming data model for lexemes 2014-08-19 02:40:37 +02:00
Matthew Honnibal 01469b0888 * Refactor spacy so that chunks return arrays of lexemes, so that there is properly one lexeme per word. 2014-08-18 19:14:00 +02:00
Matthew Honnibal a225ca5b0d * Refactoring tokenizer 2014-08-16 03:22:03 +02:00
Matthew Honnibal a895fe5ddb * Upd from spacy 2014-07-23 17:35:18 +01:00
Matthew Honnibal 87bf205b82 * Fix open apostrophe bug 2014-07-07 23:26:01 +02:00
Matthew Honnibal 057c21969b * Refactor for string view features. Working on setting up flags and enums. 2014-07-07 16:58:48 +02:00
Matthew Honnibal f1bcbd4c4e * Reorganized code to accomodate Tokens class. Need string views before group_by and count_by can be done well. 2014-07-07 12:47:21 +02:00
Matthew Honnibal ff1869ff07 * Fixed major efficiency problem, from not quite grokking pass by reference in cython c++ 2014-07-07 07:36:43 +02:00
Matthew Honnibal d5bef02c72 * Reorganized, moving language-independent stuff to spacy. The functions in spacy ask for the dictionaries and split function on input, but the language-specific modules are curried versions that use the globals 2014-07-07 04:21:06 +02:00
Matthew Honnibal a62c38e1ef * Working tokenization. en doesn't match PTB perfectly. Need to reorganize before adding more schemes. 2014-07-07 01:15:59 +02:00
Matthew Honnibal 4e79446dc2 * Reading in tokenization rules correctly. Passing tests. 2014-07-07 00:02:55 +02:00
Matthew Honnibal 72159e7011 * Fixes to tokenization. Now segment sequences of the same punctuation. 2014-07-06 19:28:42 +02:00
Matthew Honnibal e98e97d483 * Possessive test passing 2014-07-06 18:35:55 +02:00
Matthew Honnibal 556f6a18ca * Initial commit. Tests passing for punctuation handling. Need contractions, file transport, tokenize function, etc. 2014-07-05 20:51:42 +02:00