Commit Graph

34 Commits

Author SHA1 Message Date
Chris DuBois dac8fe7bdb Add __reduce__ to Tokenizer so that English pickles.
- Add tests to test_pickle and test_tokenizer that save to tempfiles.
2015-10-23 22:24:03 -07:00
Matthew Honnibal 3ba66f2dc7 * Add string length cap in Tokenizer.__call__ 2015-10-16 04:54:16 +11:00
Matthew Honnibal c2307fa9ee * More work on language-generic parsing 2015-08-28 02:02:33 +02:00
Matthew Honnibal 119c0f8c3f * Hack out morphology stuff from tokenizer, while morphology being reimplemented. 2015-08-26 19:20:11 +02:00
Matthew Honnibal 9c4d0aae62 * Switch to better Python2/3 compatible unicode handling 2015-07-28 14:45:37 +02:00
Matthew Honnibal 0c507bd80a * Fix tokenizer 2015-07-22 14:10:30 +02:00
Matthew Honnibal 2fc66e3723 * Use Py_UNICODE in tokenizer for now, while sort out Py_UCS4 stuff 2015-07-22 13:38:45 +02:00
Matthew Honnibal 109106a949 * Replace UniStr, using unicode objects instead 2015-07-22 04:52:05 +02:00
Matthew Honnibal e49c7f1478 * Update oov check in tokenizer 2015-07-18 22:45:28 +02:00
Matthew Honnibal cfd842769e * Allow infix tokens to be variable length 2015-07-18 22:45:00 +02:00
Matthew Honnibal 3b5baa660f * Fix tokenizer 2015-07-14 00:10:51 +02:00
Matthew Honnibal 24d6ce99ec * Add comment to tokenizer, explaining the spacy attr 2015-07-13 22:29:13 +02:00
Matthew Honnibal 67641f3b58 * Refactor tokenizer, to set the 'spacy' field on TokenC instead of passing a string 2015-07-13 21:46:02 +02:00
Matthew Honnibal 6eef0bf9ab * Break up tokens.pyx into tokens/doc.pyx, tokens/token.pyx, tokens/spans.pyx 2015-07-13 20:20:58 +02:00
Matthew Honnibal bb522496dd * Rename Tokens to Doc 2015-07-08 18:53:00 +02:00
Matthew Honnibal 935bcdf3e5 * Remove redundant tag_names argument to Tokenizer 2015-07-08 12:36:04 +02:00
Matthew Honnibal 2d0e99a096 * Pass pos_tags into Tokenizer.from_dir 2015-07-07 14:23:08 +02:00
Matthew Honnibal 6788c86b2f * Begin refactor 2015-07-07 14:00:07 +02:00
Matthew Honnibal 98cfd84123 * Remove hyphenation from main tokenizer loop: do it in infix.txt instead. This lets emoticons work 2015-06-06 05:57:03 +02:00
Matthew Honnibal 20f1d868a3 * Tmp commit. Working on whole document parsing 2015-05-24 02:49:56 +02:00
Jordan Suchow 3a8d9b37a6 Remove trailing whitespace 2015-04-19 13:01:38 -07:00
Matthew Honnibal f02c39dfaf * Compare to is not None, for more robustness 2015-03-26 16:44:48 +01:00
Matthew Honnibal 7237c805c7 * Load tag for specials.json token 2015-03-26 16:44:46 +01:00
Matthew Honnibal 0492cee8b4 * Fix Issue #24: Lemmas are empty when the L field is missing for special-cased tokens 2015-02-08 18:30:30 -05:00
Matthew Honnibal 4ff180db74 * Fix off-by-one error in commit 0a7fceb 2015-01-30 12:49:33 +11:00
Matthew Honnibal 0a7fcebdf7 * Fix Issue #12: Incorrect token.idx calculations for some punctuation, in the presence of token cache 2015-01-30 12:33:38 +11:00
Matthew Honnibal 5928d158ce * Pass the string to Tokens 2015-01-22 02:04:58 +11:00
Matthew Honnibal 6c7e44140b * Work on word vectors, and other stuff 2015-01-17 16:21:17 +11:00
Matthew Honnibal ce2edd6312 * Tmp commit. Refactoring to create a Python Lexeme class. 2015-01-12 10:26:22 +11:00
Matthew Honnibal 3f1944d688 * Make PyPy work 2015-01-05 17:54:38 +11:00
Matthew Honnibal 9976aa976e * Messily fix morphology and POS tags on special tokens. 2014-12-30 23:24:37 +11:00
Matthew Honnibal 4c4aa2c5c9 * Work on train 2014-12-22 07:25:43 +11:00
Matthew Honnibal e1c1a4b868 * Tmp 2014-12-21 05:36:29 +11:00
Matthew Honnibal be1bdcbd85 * Move lang.pyx to tokenizer.pyx 2014-12-20 07:55:40 +11:00