Commit Graph

452 Commits

Author SHA1 Message Date
Matthew Honnibal 0a7fcebdf7 * Fix Issue #12: Incorrect token.idx calculations for some punctuation, in the presence of token cache 2015-01-30 12:33:38 +11:00
Matthew Honnibal ebf7d2fab1 * Use non-joint sbd, for more simplicity and fewer classes 2015-01-29 06:22:03 +11:00
Matthew Honnibal d05c5bf141 * Remove comment 2015-01-29 05:19:27 +11:00
Matthew Honnibal 320b045daa * Oracle now consistent over gold standard derivation 2015-01-29 03:41:58 +11:00
Matthew Honnibal f590382134 * Work on sbd 2015-01-29 03:18:29 +11:00
Matthew Honnibal 1884a7a0be * Attach comment with paper 2015-01-28 03:18:43 +11:00
Matthew Honnibal a2d6b195db * Add messy Break transitions, carefully following the scheme of Dd Zhang et al (2013) 2015-01-28 03:09:45 +11:00
Matthew Honnibal f9ee5d9934 * Build a python list of word strings, for debugging 2015-01-28 01:06:13 +11:00
Matthew Honnibal d819101571 * Improve error message on oracle failure 2015-01-28 00:58:03 +11:00
Matthew Honnibal e6c3d3471f * Tweak documentation for Tokens, and hide constructor as __cinit__ 2015-01-27 18:57:52 +11:00
Matthew Honnibal c38c62d4a3 * Add docstring to English class 2015-01-27 02:45:21 +11:00
Matthew Honnibal d4c99f7dec * Add attrs.pxd 2015-01-26 22:22:09 +11:00
Matthew Honnibal d4a493855e * Fix error msg 2015-01-25 23:01:30 +11:00
Matthew Honnibal 7f87716cf7 * Fix download script 2015-01-25 23:01:10 +11:00
Matthew Honnibal 92fb9257dd * Add parts-of-speech file 2015-01-25 22:00:39 +11:00
Matthew Honnibal c1c3dba4cb * Check whether vector files are present before trying to load them. 2015-01-25 18:16:48 +11:00
Matthew Honnibal 5049d4c2e6 * Add parts_of_speech.pyx 2015-01-25 16:32:26 +11:00
Matthew Honnibal 12b034e3ef * Move POS tag definitions to parts_of_speech.pxd 2015-01-25 16:31:07 +11:00
Matthew Honnibal 7431c133d8 * Add error if try to access head and not is_parsed 2015-01-25 15:33:54 +11:00
Matthew Honnibal 951d06c824 * Silently don't parse if data is not present 2015-01-25 14:47:38 +11:00
Matthew Honnibal 4e857ab7a6 * Fix bug in POS tagger feature 2015-01-25 02:20:15 +11:00
Matthew Honnibal dd56e298e2 * Ensure tagging is applied if parse=True 2015-01-25 02:19:44 +11:00
Matthew Honnibal 94750819cd * Set parse=True by default --- i.e. parse unless told not to. 2015-01-25 01:28:28 +11:00
Matthew Honnibal 71b95202eb * Add docstring to StringStore 2015-01-24 20:49:15 +11:00
Matthew Honnibal 6d1c08dafd * Add docstring to Lexeme 2015-01-24 20:48:34 +11:00
Matthew Honnibal a97bed9359 * Fix POS and dependency label tag names. Add parse and string navigation functions. 2015-01-24 17:29:04 +11:00
Matthew Honnibal 76cd024095 * Add whitespace property to Token 2015-01-24 07:41:21 +11:00
Matthew Honnibal 5fd72bc220 * Have 'string' refer to the whitespace-padded string 2015-01-24 07:32:38 +11:00
Matthew Honnibal fda94271af * Rename NORM1 and NORM2 attrs to lower and norm 2015-01-24 06:17:03 +11:00
Matthew Honnibal 5ed8b2b98f * Rename sic to orth 2015-01-23 02:08:25 +11:00
Matthew Honnibal a27b23cc8f * Have SBD return start/end indices 2015-01-22 22:24:44 +11:00
Matthew Honnibal d460c28838 * Rename vec to repvec 2015-01-22 02:06:22 +11:00
Matthew Honnibal 8b9d913d97 * Rename vec to repvec 2015-01-22 02:05:58 +11:00
Matthew Honnibal 9cd0b6b3e9 * Various tweaks to Tokens class 2015-01-22 02:05:37 +11:00
Matthew Honnibal 5928d158ce * Pass the string to Tokens 2015-01-22 02:04:58 +11:00
Matthew Honnibal 45264e356b * Rename vec to repvec 2015-01-22 02:04:24 +11:00
Matthew Honnibal 5e63c606ad * Rename vec to repvec 2015-01-22 02:03:54 +11:00
Matthew Honnibal 56e6cf0672 * Add _string attr to Tokens object 2015-01-21 18:57:09 +11:00
Matthew Honnibal d6ac60e91c * Bug fixes to sentences method, and improved vector transport for tokens 2015-01-21 18:56:32 +11:00
Matthew Honnibal f2a229136c * Fix data_dir=None argument to English class 2015-01-21 18:27:31 +11:00
Matthew Honnibal ef49b8c179 * Add stop-word flag 2015-01-21 18:22:31 +11:00
Matthew Honnibal 6646bfc5df * Add LOWER attr 2015-01-21 18:19:08 +11:00
Matthew Honnibal f149259bf5 * Fix negative indices in tokens 2015-01-20 01:16:29 +11:00
Matthew Honnibal b65b0c07bf * Messily hook up vector in tokens 2015-01-19 19:59:55 +11:00
Matthew Honnibal 8ff5b8bd84 * Add attribute for POS scheme 2015-01-17 17:33:16 +11:00
Matthew Honnibal 6c7e44140b * Work on word vectors, and other stuff 2015-01-17 16:21:17 +11:00
Matthew Honnibal 802867e96a * Revise interface to Token. Strings now have attribute names like norm1_ 2015-01-15 03:51:47 +11:00
Matthew Honnibal 7d3c40de7d * Tests passing after refactor. API has obvious warts, particularly in Token and Lexeme 2015-01-15 00:33:16 +11:00
Matthew Honnibal 0930892fc1 * Tmp. Working on refactor. Compiles, must hook up lexical feats. 2015-01-14 00:03:48 +11:00
Matthew Honnibal 46da3d74d2 * Tmp. Refactoring, introducing a Lexeme PyObject. 2015-01-12 11:23:44 +11:00
Matthew Honnibal ce2edd6312 * Tmp commit. Refactoring to create a Python Lexeme class. 2015-01-12 10:26:22 +11:00
Matthew Honnibal aacaf1a0f0 * Fix parser 2015-01-08 01:19:23 +11:00
Matthew Honnibal 9a21127bf7 * Fix parser, which was importing the wrong model 2015-01-08 00:10:15 +11:00
Matthew Honnibal 6a3e39cdd1 * Add typedefs.pyx 2015-01-06 04:51:40 +11:00
Matthew Honnibal a58920cc5e * Import orth.word_shape as a C module 2015-01-06 03:18:22 +11:00
Matthew Honnibal 6b68f7ef75 * Finally get string types right for orth function 2015-01-06 03:17:39 +11:00
Matthew Honnibal 90c143bd85 * Fix orth import 2015-01-05 18:49:19 +11:00
Matthew Honnibal 7689dccd0f * Remove unused import 2015-01-05 18:48:48 +11:00
Matthew Honnibal 3f1944d688 * Make PyPy work 2015-01-05 17:54:38 +11:00
Matthew Honnibal a510d9f677 * Another assertion removed 2015-01-05 13:01:40 +11:00
Matthew Honnibal 2856946a66 * Remove assertion that doesn't work on Python 3 2015-01-05 12:51:16 +11:00
Matthew Honnibal 94034f1112 * Fix encoding in lemmatization 2015-01-05 11:54:29 +11:00
Matthew Honnibal b132b3caa6 * Fix unicode error in lemmatizer 2015-01-05 11:53:54 +11:00
Matthew Honnibal 477e7fbffe * Fix data reading for lemmatizer 2015-01-05 06:01:32 +11:00
Matthew Honnibal 58f75abaca * Fix unicode error in orth 2015-01-05 05:53:08 +11:00
Matthew Honnibal 4e085d5166 * Fix lemmatizer for Python3 2015-01-05 05:51:26 +11:00
Matthew Honnibal ae7c811fd1 * Use Exception instead of StandardError 2015-01-04 01:22:12 +11:00
Matthew Honnibal 0e4c2ba036 * Fix loading of special morph words 2015-01-03 23:13:00 +11:00
Matthew Honnibal f5d41028b5 * Move around data files for test release 2015-01-03 01:59:22 +11:00
Matthew Honnibal a24321b63a * Add downloader 2015-01-02 21:44:41 +11:00
Matthew Honnibal 5d9a096e2f * Some minor clean-up after HastyModel 2014-12-31 19:46:04 +11:00
Matthew Honnibal aafaf58cbe * Refactor _ml.Model, and finish implementing HastyModel so far not worthwhile. 2014-12-31 19:40:59 +11:00
Matthew Honnibal bcd038e7b6 * Implement HastyModel 2014-12-31 01:16:47 +11:00
Matthew Honnibal 1a075f77ff * Don't over-ride pre-loaded POS tags, if set by special-cases 2014-12-30 23:26:32 +11:00
Matthew Honnibal 785c7ba76a * Embed signature on attrs 2014-12-30 23:25:31 +11:00
Matthew Honnibal 30e5805656 * Lazy-load tagger and parser 2014-12-30 23:25:09 +11:00
Matthew Honnibal 9976aa976e * Messily fix morphology and POS tags on special tokens. 2014-12-30 23:24:37 +11:00
Matthew Honnibal c1ef3febee * Embedsignature in tokens.pyx 2014-12-30 21:22:00 +11:00
Matthew Honnibal aac5028b6e * Move tagger to _ml 2014-12-30 21:21:38 +11:00
Matthew Honnibal 1ffb0229ed * Import tokens in parser.pxd 2014-12-30 21:21:17 +11:00
Matthew Honnibal bb0b00f819 * Repurporse the Tagger class as a generic Model, wrapping thinc's interface 2014-12-30 21:20:15 +11:00
Matthew Honnibal fe2a5e0370 * Work on docstrings 2014-12-27 21:46:04 +11:00
Matthew Honnibal bb80937544 * Upd docstrings 2014-12-27 18:45:16 +11:00
Matthew Honnibal b8b65903fc * Tmp 2014-12-24 17:42:00 +11:00
Matthew Honnibal ab61673edd * Fix api of array method 2014-12-23 15:18:48 +11:00
Matthew Honnibal 7708d0e24a * Move lemmatizer to en dir 2014-12-23 15:16:57 +11:00
Matthew Honnibal 98eb4c0426 * Fix path to parser model 2014-12-23 15:09:09 +11:00
Matthew Honnibal b00bc01d8c * All tests now passing for reorg 2014-12-23 13:18:59 +11:00
Matthew Honnibal 73f200436f * Tests passing except for morphology/lemmatization stuff 2014-12-23 11:40:32 +11:00
Matthew Honnibal cf8d26c3d2 * POS tagger training working after reorg 2014-12-22 08:54:47 +11:00
Matthew Honnibal 4c4aa2c5c9 * Work on train 2014-12-22 07:25:43 +11:00
Matthew Honnibal 61df50b598 * Add English-subclass POS tagger 2014-12-21 20:59:07 +11:00
Matthew Honnibal 9f3f07cab6 * Add attrs file for English 2014-12-21 11:29:11 +11:00
Matthew Honnibal 2a89d70429 * Add vocab.pyx to setup, and ensure we can import spacy.en.lang 2014-12-21 06:03:53 +11:00
Matthew Honnibal b34a1325d3 * Everything compiling after reorg. About to start testing. 2014-12-21 05:42:23 +11:00
Matthew Honnibal e1c1a4b868 * Tmp 2014-12-21 05:36:29 +11:00
Matthew Honnibal d11c1edf8c * Import slice_unicode from strings.pyx 2014-12-20 07:56:26 +11:00
Matthew Honnibal be1bdcbd85 * Move lang.pyx to tokenizer.pyx 2014-12-20 07:55:40 +11:00
Matthew Honnibal 89a1cc1a48 * Move murmurhash to .pxd in strings file 2014-12-20 07:41:08 +11:00
Matthew Honnibal d5a942c4a4 * Rename lang.pyx to tokenizer.pyx 2014-12-20 07:30:39 +11:00