Commit Graph

17 Commits

Author SHA1 Message Date
Matthew Honnibal b962fe73d7 * Make suffixes file use full-power regex, so that we can handle periods properly 2014-12-09 19:04:27 +11:00
Matthew Honnibal 302e09018b * Work on fixing special-cases, reading them in as JSON objects so that they can specify lemmas 2014-12-09 14:48:01 +11:00
Matthew Honnibal ea8f1e7053 * Tighten interfaces 2014-10-30 18:14:42 +11:00
Matthew Honnibal 67c8c8019f * Update lexeme serialization, using a binary file format 2014-10-30 01:01:00 +11:00
Matthew Honnibal 43d5964e13 * Add function to read detokenization rules 2014-10-22 12:54:59 +11:00
Matthew Honnibal 12742f4f83 * Add detokenize method and test 2014-10-18 18:07:29 +11:00
Matthew Honnibal 6fb42c4919 * Add offsets to Tokens class. Some changes to interfaces, and reorganization of spacy.Lang 2014-10-14 16:17:45 +11:00
Matthew Honnibal e40caae51f * Update Lexicon class to expect a list of lexeme dict descriptions 2014-10-09 14:51:35 +11:00
Matthew Honnibal 2e44fa7179 * Add util.py 2014-09-25 18:26:22 +02:00
Matthew Honnibal e9a62b6eba * Refactoring with Lexeme as a class now compiles. Basic design seems to work 2014-08-27 17:15:39 +02:00
Matthew Honnibal d10993f41a * More docs work 2014-08-21 16:37:13 +02:00
Matthew Honnibal 3379d7a571 * Reforming data model for lexemes 2014-08-19 02:40:37 +02:00
Matthew Honnibal 01469b0888 * Refactor spacy so that chunks return arrays of lexemes, so that there is properly one lexeme per word. 2014-08-18 19:14:00 +02:00
Matthew Honnibal ff1869ff07 * Fixed major efficiency problem, from not quite grokking pass by reference in cython c++ 2014-07-07 07:36:43 +02:00
Matthew Honnibal 25849fc926 * Generalize tokenization rules to capitals 2014-07-07 05:07:21 +02:00
Matthew Honnibal 4e79446dc2 * Reading in tokenization rules correctly. Passing tests. 2014-07-07 00:02:55 +02:00
Matthew Honnibal 556f6a18ca * Initial commit. Tests passing for punctuation handling. Need contractions, file transport, tokenize function, etc. 2014-07-05 20:51:42 +02:00