Commit Graph

490 Commits

Author SHA1 Message Date
Matthew Honnibal ab8bb047d0 * Fix negative index for __getitem__ 2015-02-07 12:58:46 -05:00
Matthew Honnibal 44c7eafe44 * Fix download.py 2015-02-07 12:00:36 -05:00
Matthew Honnibal 6ca7f2eedc * Upd download script 2015-02-07 11:32:33 -05:00
Matthew Honnibal f0e0588833 * Fill L2 norm attribute on LexemeC struct 2015-02-07 08:44:42 -05:00
Matthew Honnibal 75f9b7d6bf * Add L2 norm field to LexemeC struct 2015-02-07 08:43:17 -05:00
Matthew Honnibal 51b618d646 * Add a has_repvec property to Lexeme, and a check function to check flags 2015-02-07 08:42:44 -05:00
Matthew Honnibal 321b402739 * Store the l2 norm of the word's vector 2015-02-07 08:42:16 -05:00
Matthew Honnibal c7d8644149 * Fix regression on 'prob' attr of Token. 2015-02-03 03:32:18 +11:00
Matthew Honnibal c55a33d045 * Catch oracle errors 2015-02-02 23:02:04 +11:00
Matthew Honnibal de772088e6 * Use parse tree for sbd in Tokens.sents 2015-02-02 12:17:32 +11:00
Matthew Honnibal 56c2ef2982 * Tweak POS features for web text 2015-02-02 11:59:36 +11:00
Matthew Honnibal d68678a93e * Add Exception class, OracleError 2015-02-02 11:57:32 +11:00
Matthew Honnibal a20fdbd8ee * Upd download script 2015-02-01 13:22:23 +11:00
Matthew Honnibal 76d9394cb4 * Fix vocab.pyx for Python3 2015-02-01 13:14:04 +11:00
Matthew Honnibal 63abdf154c * Hastily hack download file 2015-01-31 22:48:32 +11:00
Matthew Honnibal 7de00c5a79 * Try not holding a reference to Pool, since that seems to confuse the GC 2015-01-31 22:10:22 +11:00
Matthew Honnibal ce3ae8b5d9 * Fix platform-specific lexicon bug. 2015-01-31 16:38:58 +11:00
Matthew Honnibal a1ed574b7b * Fix default model path for English 2015-01-31 16:38:27 +11:00
Matthew Honnibal 018e0bfa24 * Bug fixes to parse navigation 2015-01-31 16:37:13 +11:00
Matthew Honnibal e013555b25 * Add option to download script 2015-01-31 13:51:56 +11:00
Matthew Honnibal 08ca5c8970 * Add sent_end flag to TokenC struct 2015-01-31 13:44:16 +11:00
Matthew Honnibal 024cfd485c * Pass tag_strings as a tuple, to support new Tokens API 2015-01-31 13:43:37 +11:00
Matthew Honnibal 77d62d0179 * Large refactor of Token objects, making them much thinner. This is to support fast parse-tree navigation. 2015-01-31 13:42:58 +11:00
Matthew Honnibal 88170e6295 * Supply dep_strings as a tuple, for the changed API on Tokens 2015-01-31 13:42:09 +11:00
Matthew Honnibal 0981d68022 * Set a sent_end flag during parsing, for later use 2015-01-31 13:41:46 +11:00
Matthew Honnibal 251dbf24d7 * Fix unintialised variable error 2015-01-30 20:46:34 +11:00
Matthew Honnibal 83a4df5a1a * Fix download script 2015-01-30 20:40:42 +11:00
Matthew Honnibal 6f9ebc2f34 * Fix download script 2015-01-30 20:33:19 +11:00
Matthew Honnibal 8b85d0bb8a * Only download small data if no data dir exists 2015-01-30 20:27:14 +11:00
Matthew Honnibal 1a7a1c2771 * Fix Issue #16: tokens recurse when printing 2015-01-30 19:47:50 +11:00
Matthew Honnibal cb95ef6934 * Fix download script 2015-01-30 19:28:43 +11:00
Matthew Honnibal e578bd37bd * Fix download script 2015-01-30 18:59:31 +11:00
Matthew Honnibal df52014d12 * Fix download script 2015-01-30 18:36:24 +11:00
Matthew Honnibal 0f95712189 * Improve accuracy reporting during training 2015-01-30 18:05:06 +11:00
Matthew Honnibal b68f563c2f * Fix Issue #14: Improve parsing API 2015-01-30 18:04:41 +11:00
Matthew Honnibal 998b607f65 * Upd download script, having it download all data if there's no data/ directory, allowing easier compilation from source 2015-01-30 18:04:01 +11:00
Matthew Honnibal 67d6e53a69 * Ensure parser and tagger function correctly when training from missing values, indicated by -1 2015-01-30 14:08:56 +11:00
Matthew Honnibal 4ff180db74 * Fix off-by-one error in commit 0a7fceb 2015-01-30 12:49:33 +11:00
Matthew Honnibal 0a7fcebdf7 * Fix Issue #12: Incorrect token.idx calculations for some punctuation, in the presence of token cache 2015-01-30 12:33:38 +11:00
Matthew Honnibal ebf7d2fab1 * Use non-joint sbd, for more simplicity and fewer classes 2015-01-29 06:22:03 +11:00
Matthew Honnibal d05c5bf141 * Remove comment 2015-01-29 05:19:27 +11:00
Matthew Honnibal 320b045daa * Oracle now consistent over gold standard derivation 2015-01-29 03:41:58 +11:00
Matthew Honnibal f590382134 * Work on sbd 2015-01-29 03:18:29 +11:00
Matthew Honnibal 1884a7a0be * Attach comment with paper 2015-01-28 03:18:43 +11:00
Matthew Honnibal a2d6b195db * Add messy Break transitions, carefully following the scheme of Dd Zhang et al (2013) 2015-01-28 03:09:45 +11:00
Matthew Honnibal f9ee5d9934 * Build a python list of word strings, for debugging 2015-01-28 01:06:13 +11:00
Matthew Honnibal d819101571 * Improve error message on oracle failure 2015-01-28 00:58:03 +11:00
Matthew Honnibal e6c3d3471f * Tweak documentation for Tokens, and hide constructor as __cinit__ 2015-01-27 18:57:52 +11:00
Matthew Honnibal c38c62d4a3 * Add docstring to English class 2015-01-27 02:45:21 +11:00
Matthew Honnibal d4c99f7dec * Add attrs.pxd 2015-01-26 22:22:09 +11:00
Matthew Honnibal d4a493855e * Fix error msg 2015-01-25 23:01:30 +11:00
Matthew Honnibal 7f87716cf7 * Fix download script 2015-01-25 23:01:10 +11:00
Matthew Honnibal 92fb9257dd * Add parts-of-speech file 2015-01-25 22:00:39 +11:00
Matthew Honnibal c1c3dba4cb * Check whether vector files are present before trying to load them. 2015-01-25 18:16:48 +11:00
Matthew Honnibal 5049d4c2e6 * Add parts_of_speech.pyx 2015-01-25 16:32:26 +11:00
Matthew Honnibal 12b034e3ef * Move POS tag definitions to parts_of_speech.pxd 2015-01-25 16:31:07 +11:00
Matthew Honnibal 7431c133d8 * Add error if try to access head and not is_parsed 2015-01-25 15:33:54 +11:00
Matthew Honnibal 951d06c824 * Silently don't parse if data is not present 2015-01-25 14:47:38 +11:00
Matthew Honnibal 4e857ab7a6 * Fix bug in POS tagger feature 2015-01-25 02:20:15 +11:00
Matthew Honnibal dd56e298e2 * Ensure tagging is applied if parse=True 2015-01-25 02:19:44 +11:00
Matthew Honnibal 94750819cd * Set parse=True by default --- i.e. parse unless told not to. 2015-01-25 01:28:28 +11:00
Matthew Honnibal 71b95202eb * Add docstring to StringStore 2015-01-24 20:49:15 +11:00
Matthew Honnibal 6d1c08dafd * Add docstring to Lexeme 2015-01-24 20:48:34 +11:00
Matthew Honnibal a97bed9359 * Fix POS and dependency label tag names. Add parse and string navigation functions. 2015-01-24 17:29:04 +11:00
Matthew Honnibal 76cd024095 * Add whitespace property to Token 2015-01-24 07:41:21 +11:00
Matthew Honnibal 5fd72bc220 * Have 'string' refer to the whitespace-padded string 2015-01-24 07:32:38 +11:00
Matthew Honnibal fda94271af * Rename NORM1 and NORM2 attrs to lower and norm 2015-01-24 06:17:03 +11:00
Matthew Honnibal 5ed8b2b98f * Rename sic to orth 2015-01-23 02:08:25 +11:00
Matthew Honnibal a27b23cc8f * Have SBD return start/end indices 2015-01-22 22:24:44 +11:00
Matthew Honnibal d460c28838 * Rename vec to repvec 2015-01-22 02:06:22 +11:00
Matthew Honnibal 8b9d913d97 * Rename vec to repvec 2015-01-22 02:05:58 +11:00
Matthew Honnibal 9cd0b6b3e9 * Various tweaks to Tokens class 2015-01-22 02:05:37 +11:00
Matthew Honnibal 5928d158ce * Pass the string to Tokens 2015-01-22 02:04:58 +11:00
Matthew Honnibal 45264e356b * Rename vec to repvec 2015-01-22 02:04:24 +11:00
Matthew Honnibal 5e63c606ad * Rename vec to repvec 2015-01-22 02:03:54 +11:00
Matthew Honnibal 56e6cf0672 * Add _string attr to Tokens object 2015-01-21 18:57:09 +11:00
Matthew Honnibal d6ac60e91c * Bug fixes to sentences method, and improved vector transport for tokens 2015-01-21 18:56:32 +11:00
Matthew Honnibal f2a229136c * Fix data_dir=None argument to English class 2015-01-21 18:27:31 +11:00
Matthew Honnibal ef49b8c179 * Add stop-word flag 2015-01-21 18:22:31 +11:00
Matthew Honnibal 6646bfc5df * Add LOWER attr 2015-01-21 18:19:08 +11:00
Matthew Honnibal f149259bf5 * Fix negative indices in tokens 2015-01-20 01:16:29 +11:00
Matthew Honnibal b65b0c07bf * Messily hook up vector in tokens 2015-01-19 19:59:55 +11:00
Matthew Honnibal 8ff5b8bd84 * Add attribute for POS scheme 2015-01-17 17:33:16 +11:00
Matthew Honnibal 6c7e44140b * Work on word vectors, and other stuff 2015-01-17 16:21:17 +11:00
Matthew Honnibal 802867e96a * Revise interface to Token. Strings now have attribute names like norm1_ 2015-01-15 03:51:47 +11:00
Matthew Honnibal 7d3c40de7d * Tests passing after refactor. API has obvious warts, particularly in Token and Lexeme 2015-01-15 00:33:16 +11:00
Matthew Honnibal 0930892fc1 * Tmp. Working on refactor. Compiles, must hook up lexical feats. 2015-01-14 00:03:48 +11:00
Matthew Honnibal 46da3d74d2 * Tmp. Refactoring, introducing a Lexeme PyObject. 2015-01-12 11:23:44 +11:00
Matthew Honnibal ce2edd6312 * Tmp commit. Refactoring to create a Python Lexeme class. 2015-01-12 10:26:22 +11:00
Matthew Honnibal aacaf1a0f0 * Fix parser 2015-01-08 01:19:23 +11:00
Matthew Honnibal 9a21127bf7 * Fix parser, which was importing the wrong model 2015-01-08 00:10:15 +11:00
Matthew Honnibal 6a3e39cdd1 * Add typedefs.pyx 2015-01-06 04:51:40 +11:00
Matthew Honnibal a58920cc5e * Import orth.word_shape as a C module 2015-01-06 03:18:22 +11:00
Matthew Honnibal 6b68f7ef75 * Finally get string types right for orth function 2015-01-06 03:17:39 +11:00
Matthew Honnibal 90c143bd85 * Fix orth import 2015-01-05 18:49:19 +11:00
Matthew Honnibal 7689dccd0f * Remove unused import 2015-01-05 18:48:48 +11:00
Matthew Honnibal 3f1944d688 * Make PyPy work 2015-01-05 17:54:38 +11:00
Matthew Honnibal a510d9f677 * Another assertion removed 2015-01-05 13:01:40 +11:00
Matthew Honnibal 2856946a66 * Remove assertion that doesn't work on Python 3 2015-01-05 12:51:16 +11:00
Matthew Honnibal 94034f1112 * Fix encoding in lemmatization 2015-01-05 11:54:29 +11:00