Commit Graph

490 Commits

Author SHA1 Message Date
Matthew Honnibal b132b3caa6 * Fix unicode error in lemmatizer 2015-01-05 11:53:54 +11:00
Matthew Honnibal 477e7fbffe * Fix data reading for lemmatizer 2015-01-05 06:01:32 +11:00
Matthew Honnibal 58f75abaca * Fix unicode error in orth 2015-01-05 05:53:08 +11:00
Matthew Honnibal 4e085d5166 * Fix lemmatizer for Python3 2015-01-05 05:51:26 +11:00
Matthew Honnibal ae7c811fd1 * Use Exception instead of StandardError 2015-01-04 01:22:12 +11:00
Matthew Honnibal 0e4c2ba036 * Fix loading of special morph words 2015-01-03 23:13:00 +11:00
Matthew Honnibal f5d41028b5 * Move around data files for test release 2015-01-03 01:59:22 +11:00
Matthew Honnibal a24321b63a * Add downloader 2015-01-02 21:44:41 +11:00
Matthew Honnibal 5d9a096e2f * Some minor clean-up after HastyModel 2014-12-31 19:46:04 +11:00
Matthew Honnibal aafaf58cbe * Refactor _ml.Model, and finish implementing HastyModel so far not worthwhile. 2014-12-31 19:40:59 +11:00
Matthew Honnibal bcd038e7b6 * Implement HastyModel 2014-12-31 01:16:47 +11:00
Matthew Honnibal 1a075f77ff * Don't over-ride pre-loaded POS tags, if set by special-cases 2014-12-30 23:26:32 +11:00
Matthew Honnibal 785c7ba76a * Embed signature on attrs 2014-12-30 23:25:31 +11:00
Matthew Honnibal 30e5805656 * Lazy-load tagger and parser 2014-12-30 23:25:09 +11:00
Matthew Honnibal 9976aa976e * Messily fix morphology and POS tags on special tokens. 2014-12-30 23:24:37 +11:00
Matthew Honnibal c1ef3febee * Embedsignature in tokens.pyx 2014-12-30 21:22:00 +11:00
Matthew Honnibal aac5028b6e * Move tagger to _ml 2014-12-30 21:21:38 +11:00
Matthew Honnibal 1ffb0229ed * Import tokens in parser.pxd 2014-12-30 21:21:17 +11:00
Matthew Honnibal bb0b00f819 * Repurporse the Tagger class as a generic Model, wrapping thinc's interface 2014-12-30 21:20:15 +11:00
Matthew Honnibal fe2a5e0370 * Work on docstrings 2014-12-27 21:46:04 +11:00
Matthew Honnibal bb80937544 * Upd docstrings 2014-12-27 18:45:16 +11:00
Matthew Honnibal b8b65903fc * Tmp 2014-12-24 17:42:00 +11:00
Matthew Honnibal ab61673edd * Fix api of array method 2014-12-23 15:18:48 +11:00
Matthew Honnibal 7708d0e24a * Move lemmatizer to en dir 2014-12-23 15:16:57 +11:00
Matthew Honnibal 98eb4c0426 * Fix path to parser model 2014-12-23 15:09:09 +11:00
Matthew Honnibal b00bc01d8c * All tests now passing for reorg 2014-12-23 13:18:59 +11:00
Matthew Honnibal 73f200436f * Tests passing except for morphology/lemmatization stuff 2014-12-23 11:40:32 +11:00
Matthew Honnibal cf8d26c3d2 * POS tagger training working after reorg 2014-12-22 08:54:47 +11:00
Matthew Honnibal 4c4aa2c5c9 * Work on train 2014-12-22 07:25:43 +11:00
Matthew Honnibal 61df50b598 * Add English-subclass POS tagger 2014-12-21 20:59:07 +11:00
Matthew Honnibal 9f3f07cab6 * Add attrs file for English 2014-12-21 11:29:11 +11:00
Matthew Honnibal 2a89d70429 * Add vocab.pyx to setup, and ensure we can import spacy.en.lang 2014-12-21 06:03:53 +11:00
Matthew Honnibal b34a1325d3 * Everything compiling after reorg. About to start testing. 2014-12-21 05:42:23 +11:00
Matthew Honnibal e1c1a4b868 * Tmp 2014-12-21 05:36:29 +11:00
Matthew Honnibal d11c1edf8c * Import slice_unicode from strings.pyx 2014-12-20 07:56:26 +11:00
Matthew Honnibal be1bdcbd85 * Move lang.pyx to tokenizer.pyx 2014-12-20 07:55:40 +11:00
Matthew Honnibal 89a1cc1a48 * Move murmurhash to .pxd in strings file 2014-12-20 07:41:08 +11:00
Matthew Honnibal d5a942c4a4 * Rename lang.pyx to tokenizer.pyx 2014-12-20 07:30:39 +11:00
Matthew Honnibal a60ae261ae * Move tokenizer to its own file, and refactor 2014-12-20 07:29:16 +11:00
Matthew Honnibal 867a4a000c * Export set_morph_from_dict function 2014-12-20 07:28:27 +11:00
Matthew Honnibal 4e30195c6d * Refactor morphology.pyx 2014-12-20 07:27:28 +11:00
Matthew Honnibal 4c6ce7ee84 * Update tokens.pyx as part of reorg 2014-12-20 07:03:26 +11:00
Matthew Honnibal 116f7f3bc1 * Rename Lexicon to Vocab, and move it to its own file 2014-12-20 06:54:03 +11:00
Matthew Honnibal 780cbd68b1 * Move all struct definitions to structs.pxd, to avoid circular dependencies 2014-12-20 06:51:33 +11:00
Matthew Honnibal f6556d8e5d * Refactor, move Lexeme struct to structs.pxd 2014-12-20 06:51:03 +11:00
Matthew Honnibal 7d48bba6c4 * Move StringStore class to its own file 2014-12-20 06:42:01 +11:00
Matthew Honnibal b066102d2d * Remove POS cache for now 2014-12-20 03:49:58 +11:00
Matthew Honnibal ff252dd535 * Clean up 'guess_cache' idea, which didnt work well enough 2014-12-20 03:49:11 +11:00
Matthew Honnibal 9d3ca13909 * Start work on parse-tree iteration classes 2014-12-20 03:48:10 +11:00
Matthew Honnibal bed680c632 * Remove commented-out features 2014-12-20 03:47:32 +11:00
Matthew Honnibal 3d178c03ae * Prune the features a bit 2014-12-20 02:46:14 +11:00
Matthew Honnibal a0408e1758 * Working DecisionMemory class 2014-12-20 01:43:26 +11:00
Matthew Honnibal 7920ea72b4 * Working parser with the decision memory idea. Disabling that for now, for simplicity 2014-12-20 01:43:15 +11:00
Matthew Honnibal a2f2a48da9 * Add some extra features 2014-12-20 01:42:24 +11:00
Matthew Honnibal 8fd9762d91 * Start laying out parse tree iteration methods 2014-12-20 01:42:09 +11:00
Matthew Honnibal 53b8bc1f3c * Work on implementing a trainable cache for the parser. So far, doesn't improve efficiency 2014-12-19 09:30:50 +11:00
Matthew Honnibal 033d6c9ac2 * Adapt POS tagger decision-memory for use in parser 2014-12-19 07:23:04 +11:00
Matthew Honnibal 809ddf7887 * Add index.pxd 2014-12-19 07:23:00 +11:00
Matthew Honnibal 1879abd16a * Set const-correctness for tagger 2014-12-18 20:41:52 +11:00
Matthew Honnibal f72243b156 * Set const-correctness for Feature* array 2014-12-18 20:41:32 +11:00
Matthew Honnibal 6ab7e40590 * Add non-monotonic parsing with cost-sensitive update. 92.26 on Y&M set 2014-12-18 11:33:25 +11:00
Matthew Honnibal 7e0c692daf * Automatically push when the stack is empty 2014-12-18 09:16:10 +11:00
Matthew Honnibal 61142a8eff * Tweak features 2014-12-18 09:15:03 +11:00
Matthew Honnibal 8446ebfbbb * Work on parser. Up to 92 UAS on YM labels 2014-12-18 09:05:31 +11:00
Matthew Honnibal 55de747bfc * Remove .cpp files 2014-12-18 02:43:13 +11:00
Matthew Honnibal 4448a840f7 * Work on greedy parsing. Scoring about 91.2 2014-12-18 02:42:55 +11:00
Matthew Honnibal 87e9487d76 * Work on parser 2014-12-17 21:10:12 +11:00
Matthew Honnibal 9d7d97978d * Work on greedy parser 2014-12-17 21:09:29 +11:00
Matthew Honnibal d524dd306a * Work on greedy parser 2014-12-17 03:19:43 +11:00
Matthew Honnibal 95ccea03b2 * Work on greedy parser 2014-12-16 22:46:55 +11:00
Matthew Honnibal a432862fde * Add exception type to _arg_max_among in tagger 2014-12-16 09:44:19 +11:00
Matthew Honnibal 9e00798820 * Work on integrating a greedy dependency parser 2014-12-16 08:06:04 +11:00
Matthew Honnibal 792802b2b9 * POS tag memoisation working, with good speed-up 2014-12-12 14:33:51 +11:00
Matthew Honnibal ca54d58638 * Merge setup.py 2014-12-10 15:21:27 +11:00
Matthew Honnibal 9959a64f7b * Working morphology and lemmatisation. POS tagging quite fast. 2014-12-10 08:09:32 +11:00
Matthew Honnibal df3be14987 * Add pos_type features to POS tagger 2014-12-10 08:08:55 +11:00
Matthew Honnibal 42973c4b37 * Improve efficiency of tagger, and improve morphological processing 2014-12-10 01:02:04 +11:00
Matthew Honnibal 6b34a2f34b * Move morphological analysis into its own module, morphology.pyx 2014-12-09 21:16:17 +11:00
Matthew Honnibal b962fe73d7 * Make suffixes file use full-power regex, so that we can handle periods properly 2014-12-09 19:04:27 +11:00
Matthew Honnibal accdbe989b * Remove Tokens.extend method 2014-12-09 17:09:23 +11:00
Matthew Honnibal 495e1c7366 * Use fused type in Tokens.push_back, simplifying the use of the cache 2014-12-09 16:50:01 +11:00
Matthew Honnibal 302e09018b * Work on fixing special-cases, reading them in as JSON objects so that they can specify lemmas 2014-12-09 14:48:01 +11:00
Matthew Honnibal 99bbbb6feb * Work on morphological processing 2014-12-08 21:12:15 +11:00
Matthew Honnibal 7b68f911cf * Add WordNet lemmatizer 2014-12-08 01:39:13 +11:00
Matthew Honnibal c20dd79748 * Fiddle with const correctness and comments 2014-12-08 00:03:55 +11:00
Matthew Honnibal b031c7c430 * Remove language-general context module 2014-12-07 23:53:01 +11:00
Matthew Honnibal ef4398b204 * Rearrange POS stuff, so that language-specific stuff can live in language-specific modules 2014-12-07 23:52:41 +11:00
Matthew Honnibal 327383e38a * Remove unused code in tagger.pyx 2014-12-07 22:16:17 +11:00
Matthew Honnibal 9f17467c2e * Fix EMPTY_TOKEN 2014-12-07 22:07:41 +11:00
Matthew Honnibal 3819a88e1b * Add support for tag dictionary, and fix error-code for predict method 2014-12-07 22:07:16 +11:00
Matthew Honnibal f00afe12c4 * Load POS tagger in load() function if path exists 2014-12-07 22:05:57 +11:00
Matthew Honnibal 5fe5e6e66b * Move context functions to header, inlining them. 2014-12-07 21:59:04 +11:00
Matthew Honnibal 5caabec789 * Link in tagger, to work on integrating POS tagging 2014-12-07 15:29:41 +11:00
Matthew Honnibal 0c7aeb9de7 * Begin revising tagger, focussing on POS tagging 2014-12-07 15:29:04 +11:00
Matthew Honnibal f5c4f2eb52 * Revise context, focussing on POS tagging for now 2014-12-07 15:28:22 +11:00
Matthew Honnibal e27b912ef9 * Remove need for confusing _data pointer to be stored on Tokens 2014-12-05 16:31:30 +11:00
Matthew Honnibal 1c9253701d * Introduce a TokenC struct, to handle token indices, pos tags and sense tags 2014-12-05 15:56:14 +11:00
Matthew Honnibal 187372c7f3 * Allow the lexicon to create lexemes using an external memory pool, so that it can decide to make some lexemes temporary, rather than cached 2014-12-05 03:29:50 +11:00
Matthew Honnibal 75b8dfb348 * Remove upper_pc from lexeme.pyx 2014-12-04 22:14:34 +11:00
Matthew Honnibal 49f3780ff5 * Fiddle with lexeme attrs 2014-12-04 21:22:38 +11:00
Matthew Honnibal 564082e48e * Hack Token class to take lex.dense inplace of the old lex.norm. This needs to be fixed... 2014-12-04 20:51:29 +11:00
Matthew Honnibal 69bb022204 * Add as_array and count_by method 2014-12-04 20:46:55 +11:00
Matthew Honnibal e1b1f45cc9 * Add STEM attribute to lexeme 2014-12-04 20:46:20 +11:00
Matthew Honnibal d7952634ca * Make the string-store serve const pointers to Utf8Str 2014-12-03 16:01:47 +11:00
Matthew Honnibal 7e04c22f8f * const added to Lexicon interface. Seems to work. 2014-12-03 15:58:17 +11:00
Matthew Honnibal d70d31aa45 * Introduce first attempt at const-ness 2014-12-03 15:44:25 +11:00
Matthew Honnibal 4560ada85b * Add typedef for attr_t. Change flag_t to flags_t 2014-12-03 11:06:31 +11:00
Matthew Honnibal e600f7b327 * Move String struct stuff into the utf8string module, from spacy.lang 2014-12-03 11:06:00 +11:00
Matthew Honnibal e170faf5b0 * Hack Tokens to work without tagger.pyx 2014-12-03 11:05:15 +11:00
Matthew Honnibal b463a7eb86 * Make flag-setting a language-specific thing 2014-12-03 11:04:32 +11:00
Matthew Honnibal 71b009e323 * Fix bug in refactored StringStore.__getitem__ 2014-12-03 11:02:24 +11:00
Matthew Honnibal 14097311ae * Make StringStore.__getitem__ accept unicode-typed keys. 2014-12-03 01:33:20 +11:00
Matthew Honnibal 522bb0346e * Work on get_array method of Tokens 2014-12-02 23:48:05 +11:00
Matthew Honnibal 8c2938fe01 * Rename Lexicon._dict to Lexicon._map 2014-12-02 23:46:59 +11:00
Matthew Honnibal 33dfb4933c * Remove taggers from Language class. Work on doc strings 2014-11-26 19:53:55 +11:00
Matthew Honnibal 80baa2e3db * Work on beam parser 2014-11-20 19:49:33 +11:00
Matthew Honnibal 5c3016bac8 * Tmp commit of ner code 2014-11-14 18:27:47 +11:00
Matthew Honnibal 33c421bcf8 * More feature tweaks 2014-11-12 23:59:16 +11:00
Matthew Honnibal 41dedfb14e * Add label features for NER parsing 2014-11-12 23:55:10 +11:00
Matthew Honnibal cf55b48ba6 * Switch to predict label on shift. Big increase in accuracy. 2014-11-12 23:50:12 +11:00
Matthew Honnibal 8f84e8a78b * Neaten oracle 2014-11-12 23:38:07 +11:00
Matthew Honnibal 7e0a9077dd * Add context files 2014-11-12 23:22:36 +11:00
Matthew Honnibal 3b0b902384 * IOB-style parsing working. Accuracy down from BILOU, form 87-88 to 85-86 2014-11-12 23:21:09 +11:00
Matthew Honnibal e6bb8aa3a9 * Move moves to bilou_moves. Refactor context, returning to the simpler giant-enum style 2014-11-12 00:54:50 +11:00
Matthew Honnibal c788633429 * Add tokens_from_list method to Language 2014-11-11 23:43:14 +11:00
Matthew Honnibal 95282d4993 * Use the dynamic oracle 'follow' strategy 2014-11-11 21:11:17 +11:00
Matthew Honnibal 5aaf7a024d * Move ner features to ner subdir 2014-11-11 21:09:03 +11:00
Matthew Honnibal ff8989b63c * Use greedy NER parser 2014-11-11 21:08:35 +11:00
Matthew Honnibal 0d943ab358 * Fixed greedy NER parsing. With static oracle, replicates accuracy from tagger. 2014-11-11 17:17:54 +11:00
Matthew Honnibal 399239760b * Fix moves for new State struct 2014-11-10 22:16:05 +11:00
Matthew Honnibal 82247169f2 * Implement validation and oracle on pystate, for testing 2014-11-10 22:15:32 +11:00
Matthew Honnibal 3709ed9d6d * Add curr field to State, to handle entity being built 2014-11-10 22:14:36 +11:00
Matthew Honnibal af9ed18cf1 * Bug fixes to NER 2014-11-10 17:39:23 +11:00
Matthew Honnibal 9f2587f5ec * Work on shift-reduce NER 2014-11-10 16:28:56 +11:00
Matthew Honnibal f307eb2e36 * Refactor context extraction, and start breaking out gold standards into their own functions 2014-11-09 15:43:07 +11:00
Matthew Honnibal 602f993af9 * Moving tagger to accept multiple correct answers 2014-11-09 15:18:33 +11:00
Matthew Honnibal f37d896a42 * Upd NER feats. With adadelta learner, getting 76.9 on NER 2014-11-07 04:43:54 +11:00
Matthew Honnibal 68d1cdad62 * When encoding POS/NER tags, accept '-' as a missing value 2014-11-07 04:42:31 +11:00
Matthew Honnibal 949a6245f9 * Increase default number of iterations from 5 to 10 2014-11-07 04:42:04 +11:00
Matthew Honnibal 3cab1d9a29 * Refine word_shape feature, by trimming the max sequence length 2014-11-07 04:41:29 +11:00
Matthew Honnibal b4454cf036 * Add extra context tokens 2014-11-07 04:40:36 +11:00
Matthew Honnibal 50309e6e49 * Fix context vector, importing all features 2014-11-05 22:11:39 +11:00
Matthew Honnibal 07a23768de * Play with NER feats a bit. Up to 82.00 training on MUC7. 2014-11-05 21:47:17 +11:00
Matthew Honnibal 4ecbe8c893 * Complete refactor of Tagger features, to use a generic list of context names. 2014-11-05 20:45:29 +11:00
Matthew Honnibal 0a8c84625d * Moving feature context stuff to a generalized place 2014-11-05 19:55:10 +11:00
Matthew Honnibal 3733444101 * Generalize tagger code, in preparation for NER and supersense tagging. 2014-11-05 03:42:14 +11:00
Matthew Honnibal abbe3e44b0 * Move spacy.pos tagger to spacy.tagger, and generalize it so that it can take on other tagging tasks, given a different set of feature templates. 2014-11-05 00:37:59 +11:00
Matthew Honnibal 954c970415 * Add __iter__ method to tokens 2014-11-04 01:07:08 +11:00
Matthew Honnibal f07457a91f * Remove POS alignment stuff. Now use training data based on raw text, instead of clumsy detokenization stuff 2014-11-04 01:06:43 +11:00
Matthew Honnibal ae52f9f38c * Remove vocab10k from tokens 2014-11-03 00:23:20 +11:00
Matthew Honnibal 32fb50dc35 * Remove non_sparse method --- features wanting this can do it easily enough. 2014-11-03 00:15:47 +11:00
Matthew Honnibal b5ae1471db * Fiddle with POS tag features 2014-11-03 00:15:03 +11:00
Matthew Honnibal 70ea862703 * Remove vocab10k field, and add flags for gazetteers 2014-11-03 00:13:51 +11:00
Matthew Honnibal 711ed0f636 * Whitespace 2014-11-02 14:22:32 +11:00
Matthew Honnibal fcd9490d56 * Add pos_tag method to Language 2014-11-02 14:21:43 +11:00
Matthew Honnibal 829bb2bdbe * Add mappings to Twitter POS tag corpus 2014-11-02 13:21:19 +11:00
Matthew Honnibal 437cd2217d * Fix strings i/o, removing use of ujson library in favour of plain text file. Allows better control of codecs. 2014-11-02 13:20:37 +11:00
Matthew Honnibal 3352e89e21 * Use LIKE_URL and LIKE_NUMBER flag features. Seems to improve accuracy on onto web 2014-11-02 13:19:54 +11:00
Matthew Honnibal 8335706321 * Add LIKE_URL and LIKE_NUMBER flag features 2014-11-02 13:19:23 +11:00
Matthew Honnibal 5484fbea69 * Implement is_number 2014-11-01 19:13:24 +11:00
Matthew Honnibal f685218e21 * Add is_urlish function 2014-11-01 17:39:34 +11:00
Matthew Honnibal 09a3e54176 * Delete print statements from stringstore 2014-10-31 17:45:26 +11:00
Matthew Honnibal b186a66bae * Rename Token.lex_pos to Token.postype, and Token.lex_supersense to Token.sensetype 2014-10-31 17:44:39 +11:00
Matthew Honnibal a8ca078b24 * Restore lexemes field to lexicon 2014-10-31 17:43:25 +11:00
Matthew Honnibal 6c807aa45f * Restore id attribute to lexeme, and rename pos field to postype, to store clustered tag dictionaries 2014-10-31 17:43:00 +11:00
Matthew Honnibal aaf6953fe0 * Add count_tags functionto pos.pyx, which should probably live in another file. Feature set achieves 97.9 on wsj19-21, 95.85 on onto web. 2014-10-31 17:42:15 +11:00
Matthew Honnibal f67cb9a5a3 * Add count_tags functionto pos.pyx, which should probably live in another file. Feature set achieves 97.9 on wsj19-21, 95.85 on onto web. 2014-10-31 17:42:04 +11:00
Matthew Honnibal ea8f1e7053 * Tighten interfaces 2014-10-30 18:14:42 +11:00
Matthew Honnibal ea85bf3a0a * Tighten the interface to Language 2014-10-30 18:01:27 +11:00
Matthew Honnibal c6fcd03692 * Small efficiency tweak to lexeme init 2014-10-30 17:56:11 +11:00
Matthew Honnibal 87c2418a89 * Fiddle with data types on Lexeme, to compress them to a much smaller size. 2014-10-30 15:42:15 +11:00
Matthew Honnibal ac88893232 * Fix Token after lexeme changes 2014-10-30 15:30:52 +11:00
Matthew Honnibal e6b87766fe * Remove lexemes vector from Lexicon, and the id and hash attributes from Lexeme 2014-10-30 15:21:38 +11:00
Matthew Honnibal 889b7b48b4 * Fix POS tagger, so that it loads correctly. Lexemes are being read in. 2014-10-30 13:38:55 +11:00
Matthew Honnibal 67c8c8019f * Update lexeme serialization, using a binary file format 2014-10-30 01:01:00 +11:00
Matthew Honnibal 13909a2e24 * Rewriting Lexeme serialization. 2014-10-29 23:19:38 +11:00
Matthew Honnibal 234d49bf4d * Seems to be working after refactor. Need to wire up more POS tag features, and wire up save/load of POS tags. 2014-10-24 02:23:42 +11:00
Matthew Honnibal 08ce602243 * Large refactor, particularly to Python API 2014-10-24 00:59:17 +11:00
Matthew Honnibal 7baef5b7ff * Fix padding on tokens 2014-10-23 04:01:17 +11:00
Matthew Honnibal 96b835a3d4 * Upd for refactored Tokens class. Now gets 95.74, 185ms training on swbd_wsj_ewtb, eval on onto_web, Google POS tags. 2014-10-23 03:20:02 +11:00
Matthew Honnibal e5e951ae67 * Remove the feature array stuff from Tokens class, and replace vector with array-based implementation, with padding. 2014-10-23 01:57:59 +11:00
Matthew Honnibal ea1d4a81eb * Refactoring get_atoms, improving tokens API 2014-10-22 13:10:56 +11:00
Matthew Honnibal ad49e2482e * Tagger now gets 97pc on wsj, parsing 19-21 in 500ms. Gets 92.7 on web text. 2014-10-22 12:57:06 +11:00
Matthew Honnibal 0a0e41f6c8 * Add prefix and suffix features 2014-10-22 12:56:09 +11:00
Matthew Honnibal 7018b53d3a * Improve array features in tokens 2014-10-22 12:55:42 +11:00
Matthew Honnibal 43d5964e13 * Add function to read detokenization rules 2014-10-22 12:54:59 +11:00
Matthew Honnibal 224bdae996 * Add POS utilities 2014-10-22 10:17:57 +11:00
Matthew Honnibal 5ebe14f353 * Add greedy pos tagger 2014-10-22 10:17:26 +11:00
Matthew Honnibal 12742f4f83 * Add detokenize method and test 2014-10-18 18:07:29 +11:00
Matthew Honnibal 99f5e59286 * Have tokenizer emit tokens for whitespace other than single spaces 2014-10-14 20:25:57 +11:00
Matthew Honnibal 43743a5d63 * Work on efficiency 2014-10-14 18:22:41 +11:00
Matthew Honnibal 6fb42c4919 * Add offsets to Tokens class. Some changes to interfaces, and reorganization of spacy.Lang 2014-10-14 16:17:45 +11:00
Matthew Honnibal 2805068ca8 * Have tokens track tuples that record the start offset and pos tag as well as a lexeme pointer 2014-10-14 15:21:03 +11:00
Matthew Honnibal 65d3ead4fd * Rename LexStr_casefix to LexStr_norm and LexInt_i to LexInt_id 2014-10-14 15:19:07 +11:00
Matthew Honnibal 868e558037 * Preparations in place to handle hyphenation etc 2014-10-10 20:23:23 +11:00
Matthew Honnibal ff79dbac2e * More slight cleaning for lang.pyx 2014-10-10 20:11:22 +11:00
Matthew Honnibal 3d82ed1e5e * More slight cleaning for lang.pyx 2014-10-10 19:50:07 +11:00
Matthew Honnibal 02e948e7d5 * Remove counts stuff from Language class 2014-10-10 19:25:01 +11:00
Matthew Honnibal 71ee921055 * Slight cleaning of tokenizer code 2014-10-10 19:17:22 +11:00
Matthew Honnibal 59b41a9fd3 * Switch to new data model, tests passing 2014-10-10 08:11:31 +11:00