Commit Graph

50 Commits

Author SHA1 Message Date
Matthew Honnibal d70d31aa45 * Introduce first attempt at const-ness 2014-12-03 15:44:25 +11:00
Matthew Honnibal b463a7eb86 * Make flag-setting a language-specific thing 2014-12-03 11:04:32 +11:00
Matthew Honnibal 8c2938fe01 * Rename Lexicon._dict to Lexicon._map 2014-12-02 23:46:59 +11:00
Matthew Honnibal c788633429 * Add tokens_from_list method to Language 2014-11-11 23:43:14 +11:00
Matthew Honnibal ff8989b63c * Use greedy NER parser 2014-11-11 21:08:35 +11:00
Matthew Honnibal 4ecbe8c893 * Complete refactor of Tagger features, to use a generic list of context names. 2014-11-05 20:45:29 +11:00
Matthew Honnibal 3733444101 * Generalize tagger code, in preparation for NER and supersense tagging. 2014-11-05 03:42:14 +11:00
Matthew Honnibal fcd9490d56 * Add pos_tag method to Language 2014-11-02 14:21:43 +11:00
Matthew Honnibal a8ca078b24 * Restore lexemes field to lexicon 2014-10-31 17:43:25 +11:00
Matthew Honnibal ea8f1e7053 * Tighten interfaces 2014-10-30 18:14:42 +11:00
Matthew Honnibal ea85bf3a0a * Tighten the interface to Language 2014-10-30 18:01:27 +11:00
Matthew Honnibal 87c2418a89 * Fiddle with data types on Lexeme, to compress them to a much smaller size. 2014-10-30 15:42:15 +11:00
Matthew Honnibal e6b87766fe * Remove lexemes vector from Lexicon, and the id and hash attributes from Lexeme 2014-10-30 15:21:38 +11:00
Matthew Honnibal 08ce602243 * Large refactor, particularly to Python API 2014-10-24 00:59:17 +11:00
Matthew Honnibal e5e951ae67 * Remove the feature array stuff from Tokens class, and replace vector with array-based implementation, with padding. 2014-10-23 01:57:59 +11:00
Matthew Honnibal 43743a5d63 * Work on efficiency 2014-10-14 18:22:41 +11:00
Matthew Honnibal 6fb42c4919 * Add offsets to Tokens class. Some changes to interfaces, and reorganization of spacy.Lang 2014-10-14 16:17:45 +11:00
Matthew Honnibal 868e558037 * Preparations in place to handle hyphenation etc 2014-10-10 20:23:23 +11:00
Matthew Honnibal 02e948e7d5 * Remove counts stuff from Language class 2014-10-10 19:25:01 +11:00
Matthew Honnibal 71ee921055 * Slight cleaning of tokenizer code 2014-10-10 19:17:22 +11:00
Matthew Honnibal d73d89a2de * Add i attribute to lexeme, giving lexemes sequential IDs. 2014-10-09 13:50:05 +11:00
Matthew Honnibal 096ef2b199 * Rename external hashing lib, from trustyc to preshed 2014-09-26 18:40:03 +02:00
Matthew Honnibal b15619e170 * Use PointerHash instead of locally provided _hashing module 2014-09-25 18:23:35 +02:00
Matthew Honnibal ac522e2553 * Switch from own memory class to cymem, in pip 2014-09-17 23:09:24 +02:00
Matthew Honnibal 6266cac593 * Switch to using a Python ref counted gateway to malloc/free, to prevent memory leaks 2014-09-17 20:02:26 +02:00
Matthew Honnibal 0152831c89 * Refactor tokenization, enable cache, and ensure we look up specials correctly even when there's confusing punctuation surrounding the token. 2014-09-16 18:01:46 +02:00
Matthew Honnibal 143e51ec73 * Refactor tokenization, splitting it into a clearer life-cycle. 2014-09-16 13:16:02 +02:00
Matthew Honnibal 0bb547ab98 * Fix memory error in cache, where entry wasn't being null-terminated. Various other changes, some good for performance 2014-09-15 06:34:10 +02:00
Matthew Honnibal e68a431e5e * Pass only the tokens vector to _tokenize, instead of the whole python object. 2014-09-15 04:01:38 +02:00
Matthew Honnibal df24e3708c * Move EnglishTokens stuff to Tokens 2014-09-15 01:31:44 +02:00
Matthew Honnibal f3393cf57c * Improve interface for PointerHash 2014-09-13 17:29:58 +02:00
Matthew Honnibal 0447279c57 * PointerHash working, efficiency is good. 6-7 mins 2014-09-13 16:43:59 +02:00
Matthew Honnibal 85d68e8e95 * Replaced cache with own hash table. Similar timing 2014-09-13 03:14:43 +02:00
Matthew Honnibal a8e7cce30f * Efficiency tweaks 2014-09-13 00:14:05 +02:00
Matthew Honnibal 126a8453a5 * Fix performance issues by implementing a better cache. Add own String struct to help 2014-09-12 23:50:37 +02:00
Matthew Honnibal 9298e36b36 * Move special tokenization into its own lookup table, away from the cache. 2014-09-12 19:43:14 +02:00
Matthew Honnibal 985bc68327 * Fix bug with trailing punct on contractions. Reduced efficiency, and slightly hacky implementation. 2014-09-12 18:26:26 +02:00
Matthew Honnibal 4817277d66 * Replace main lexicon dict with dense_hash_map. May be unsuitable, if strings need recovery. 2014-09-12 04:29:09 +02:00
Matthew Honnibal 8b20e9ad97 * Delete ununused _split method 2014-09-12 04:03:52 +02:00
Matthew Honnibal a4863686ec * Changed cache to use a linked-list data structure, to take out Python list code. Taking 6-7 mins for gigaword. 2014-09-12 03:30:50 +02:00
Matthew Honnibal e096f30161 * Tweak signatures and refactor slightly. Processing gigaword taking 8-9 mins. Tests passing, but some sort of memory bug on exit. 2014-09-12 02:43:36 +02:00
Matthew Honnibal 073ee0de63 * Restore dense_hash_map for cache dictionary. Seems to double efficiency 2014-09-12 02:23:51 +02:00
Matthew Honnibal c8f7c8bfde * Moving to storing LexemeC structs internally 2014-09-11 21:54:34 +02:00
Matthew Honnibal 563047e90f * Switch to returning a Tokens object 2014-09-11 21:37:32 +02:00
Matthew Honnibal cf412adba8 * Refactoring to use Tokens object 2014-09-10 18:11:13 +02:00
Matthew Honnibal 45a22d6b2c * Docs coming together 2014-08-29 01:59:23 +02:00
Matthew Honnibal c282e6d5fb * Redesign proceeding 2014-08-28 19:45:09 +02:00
Matthew Honnibal fdaf24604a * Basic punct tests updated and passing 2014-08-27 19:38:57 +02:00
Matthew Honnibal e9a62b6eba * Refactoring with Lexeme as a class now compiles. Basic design seems to work 2014-08-27 17:15:39 +02:00
Matthew Honnibal 68bae2fec6 * More refactoring 2014-08-25 16:42:22 +02:00