Matthew Honnibal
|
2a0615104b
|
* Upd download script
|
2015-02-09 10:22:59 -05:00 |
Matthew Honnibal
|
5c3513583d
|
* Clear buffered python tokens when modifying the Tokens object. Need to clean this up, and modify via a method on Tokens.
|
2015-02-09 03:57:10 -05:00 |
Matthew Honnibal
|
be5536d239
|
* Fix Issue #22: PRP and PRP$ were mapped to NOUN. Should be PRON.
|
2015-02-08 18:36:18 -05:00 |
Matthew Honnibal
|
0492cee8b4
|
* Fix Issue #24: Lemmas are empty when the L field is missing for special-cased tokens
|
2015-02-08 18:30:30 -05:00 |
Matthew Honnibal
|
d229fbd228
|
* Give better error on out-of-bounds array access
|
2015-02-07 12:59:12 -05:00 |
Matthew Honnibal
|
ab8bb047d0
|
* Fix negative index for __getitem__
|
2015-02-07 12:58:46 -05:00 |
Matthew Honnibal
|
44c7eafe44
|
* Fix download.py
|
2015-02-07 12:00:36 -05:00 |
Matthew Honnibal
|
6ca7f2eedc
|
* Upd download script
|
2015-02-07 11:32:33 -05:00 |
Matthew Honnibal
|
f0e0588833
|
* Fill L2 norm attribute on LexemeC struct
|
2015-02-07 08:44:42 -05:00 |
Matthew Honnibal
|
75f9b7d6bf
|
* Add L2 norm field to LexemeC struct
|
2015-02-07 08:43:17 -05:00 |
Matthew Honnibal
|
51b618d646
|
* Add a has_repvec property to Lexeme, and a check function to check flags
|
2015-02-07 08:42:44 -05:00 |
Matthew Honnibal
|
321b402739
|
* Store the l2 norm of the word's vector
|
2015-02-07 08:42:16 -05:00 |
Matthew Honnibal
|
c7d8644149
|
* Fix regression on 'prob' attr of Token.
|
2015-02-03 03:32:18 +11:00 |
Matthew Honnibal
|
c55a33d045
|
* Catch oracle errors
|
2015-02-02 23:02:04 +11:00 |
Matthew Honnibal
|
de772088e6
|
* Use parse tree for sbd in Tokens.sents
|
2015-02-02 12:17:32 +11:00 |
Matthew Honnibal
|
56c2ef2982
|
* Tweak POS features for web text
|
2015-02-02 11:59:36 +11:00 |
Matthew Honnibal
|
d68678a93e
|
* Add Exception class, OracleError
|
2015-02-02 11:57:32 +11:00 |
Matthew Honnibal
|
a20fdbd8ee
|
* Upd download script
|
2015-02-01 13:22:23 +11:00 |
Matthew Honnibal
|
76d9394cb4
|
* Fix vocab.pyx for Python3
|
2015-02-01 13:14:04 +11:00 |
Matthew Honnibal
|
63abdf154c
|
* Hastily hack download file
|
2015-01-31 22:48:32 +11:00 |
Matthew Honnibal
|
7de00c5a79
|
* Try not holding a reference to Pool, since that seems to confuse the GC
|
2015-01-31 22:10:22 +11:00 |
Matthew Honnibal
|
ce3ae8b5d9
|
* Fix platform-specific lexicon bug.
|
2015-01-31 16:38:58 +11:00 |
Matthew Honnibal
|
a1ed574b7b
|
* Fix default model path for English
|
2015-01-31 16:38:27 +11:00 |
Matthew Honnibal
|
018e0bfa24
|
* Bug fixes to parse navigation
|
2015-01-31 16:37:13 +11:00 |
Matthew Honnibal
|
e013555b25
|
* Add option to download script
|
2015-01-31 13:51:56 +11:00 |
Matthew Honnibal
|
08ca5c8970
|
* Add sent_end flag to TokenC struct
|
2015-01-31 13:44:16 +11:00 |
Matthew Honnibal
|
024cfd485c
|
* Pass tag_strings as a tuple, to support new Tokens API
|
2015-01-31 13:43:37 +11:00 |
Matthew Honnibal
|
77d62d0179
|
* Large refactor of Token objects, making them much thinner. This is to support fast parse-tree navigation.
|
2015-01-31 13:42:58 +11:00 |
Matthew Honnibal
|
88170e6295
|
* Supply dep_strings as a tuple, for the changed API on Tokens
|
2015-01-31 13:42:09 +11:00 |
Matthew Honnibal
|
0981d68022
|
* Set a sent_end flag during parsing, for later use
|
2015-01-31 13:41:46 +11:00 |
Matthew Honnibal
|
251dbf24d7
|
* Fix unintialised variable error
|
2015-01-30 20:46:34 +11:00 |
Matthew Honnibal
|
83a4df5a1a
|
* Fix download script
|
2015-01-30 20:40:42 +11:00 |
Matthew Honnibal
|
6f9ebc2f34
|
* Fix download script
|
2015-01-30 20:33:19 +11:00 |
Matthew Honnibal
|
8b85d0bb8a
|
* Only download small data if no data dir exists
|
2015-01-30 20:27:14 +11:00 |
Matthew Honnibal
|
1a7a1c2771
|
* Fix Issue #16: tokens recurse when printing
|
2015-01-30 19:47:50 +11:00 |
Matthew Honnibal
|
cb95ef6934
|
* Fix download script
|
2015-01-30 19:28:43 +11:00 |
Matthew Honnibal
|
e578bd37bd
|
* Fix download script
|
2015-01-30 18:59:31 +11:00 |
Matthew Honnibal
|
df52014d12
|
* Fix download script
|
2015-01-30 18:36:24 +11:00 |
Matthew Honnibal
|
0f95712189
|
* Improve accuracy reporting during training
|
2015-01-30 18:05:06 +11:00 |
Matthew Honnibal
|
b68f563c2f
|
* Fix Issue #14: Improve parsing API
|
2015-01-30 18:04:41 +11:00 |
Matthew Honnibal
|
998b607f65
|
* Upd download script, having it download all data if there's no data/ directory, allowing easier compilation from source
|
2015-01-30 18:04:01 +11:00 |
Matthew Honnibal
|
67d6e53a69
|
* Ensure parser and tagger function correctly when training from missing values, indicated by -1
|
2015-01-30 14:08:56 +11:00 |
Matthew Honnibal
|
4ff180db74
|
* Fix off-by-one error in commit 0a7fceb
|
2015-01-30 12:49:33 +11:00 |
Matthew Honnibal
|
0a7fcebdf7
|
* Fix Issue #12: Incorrect token.idx calculations for some punctuation, in the presence of token cache
|
2015-01-30 12:33:38 +11:00 |
Matthew Honnibal
|
ebf7d2fab1
|
* Use non-joint sbd, for more simplicity and fewer classes
|
2015-01-29 06:22:03 +11:00 |
Matthew Honnibal
|
d05c5bf141
|
* Remove comment
|
2015-01-29 05:19:27 +11:00 |
Matthew Honnibal
|
320b045daa
|
* Oracle now consistent over gold standard derivation
|
2015-01-29 03:41:58 +11:00 |
Matthew Honnibal
|
f590382134
|
* Work on sbd
|
2015-01-29 03:18:29 +11:00 |
Matthew Honnibal
|
1884a7a0be
|
* Attach comment with paper
|
2015-01-28 03:18:43 +11:00 |
Matthew Honnibal
|
a2d6b195db
|
* Add messy Break transitions, carefully following the scheme of Dd Zhang et al (2013)
|
2015-01-28 03:09:45 +11:00 |
Matthew Honnibal
|
f9ee5d9934
|
* Build a python list of word strings, for debugging
|
2015-01-28 01:06:13 +11:00 |
Matthew Honnibal
|
d819101571
|
* Improve error message on oracle failure
|
2015-01-28 00:58:03 +11:00 |
Matthew Honnibal
|
e6c3d3471f
|
* Tweak documentation for Tokens, and hide constructor as __cinit__
|
2015-01-27 18:57:52 +11:00 |
Matthew Honnibal
|
c38c62d4a3
|
* Add docstring to English class
|
2015-01-27 02:45:21 +11:00 |
Matthew Honnibal
|
d4c99f7dec
|
* Add attrs.pxd
|
2015-01-26 22:22:09 +11:00 |
Matthew Honnibal
|
d4a493855e
|
* Fix error msg
|
2015-01-25 23:01:30 +11:00 |
Matthew Honnibal
|
7f87716cf7
|
* Fix download script
|
2015-01-25 23:01:10 +11:00 |
Matthew Honnibal
|
92fb9257dd
|
* Add parts-of-speech file
|
2015-01-25 22:00:39 +11:00 |
Matthew Honnibal
|
c1c3dba4cb
|
* Check whether vector files are present before trying to load them.
|
2015-01-25 18:16:48 +11:00 |
Matthew Honnibal
|
5049d4c2e6
|
* Add parts_of_speech.pyx
|
2015-01-25 16:32:26 +11:00 |
Matthew Honnibal
|
12b034e3ef
|
* Move POS tag definitions to parts_of_speech.pxd
|
2015-01-25 16:31:07 +11:00 |
Matthew Honnibal
|
7431c133d8
|
* Add error if try to access head and not is_parsed
|
2015-01-25 15:33:54 +11:00 |
Matthew Honnibal
|
951d06c824
|
* Silently don't parse if data is not present
|
2015-01-25 14:47:38 +11:00 |
Matthew Honnibal
|
4e857ab7a6
|
* Fix bug in POS tagger feature
|
2015-01-25 02:20:15 +11:00 |
Matthew Honnibal
|
dd56e298e2
|
* Ensure tagging is applied if parse=True
|
2015-01-25 02:19:44 +11:00 |
Matthew Honnibal
|
94750819cd
|
* Set parse=True by default --- i.e. parse unless told not to.
|
2015-01-25 01:28:28 +11:00 |
Matthew Honnibal
|
71b95202eb
|
* Add docstring to StringStore
|
2015-01-24 20:49:15 +11:00 |
Matthew Honnibal
|
6d1c08dafd
|
* Add docstring to Lexeme
|
2015-01-24 20:48:34 +11:00 |
Matthew Honnibal
|
a97bed9359
|
* Fix POS and dependency label tag names. Add parse and string navigation functions.
|
2015-01-24 17:29:04 +11:00 |
Matthew Honnibal
|
76cd024095
|
* Add whitespace property to Token
|
2015-01-24 07:41:21 +11:00 |
Matthew Honnibal
|
5fd72bc220
|
* Have 'string' refer to the whitespace-padded string
|
2015-01-24 07:32:38 +11:00 |
Matthew Honnibal
|
fda94271af
|
* Rename NORM1 and NORM2 attrs to lower and norm
|
2015-01-24 06:17:03 +11:00 |
Matthew Honnibal
|
5ed8b2b98f
|
* Rename sic to orth
|
2015-01-23 02:08:25 +11:00 |
Matthew Honnibal
|
a27b23cc8f
|
* Have SBD return start/end indices
|
2015-01-22 22:24:44 +11:00 |
Matthew Honnibal
|
d460c28838
|
* Rename vec to repvec
|
2015-01-22 02:06:22 +11:00 |
Matthew Honnibal
|
8b9d913d97
|
* Rename vec to repvec
|
2015-01-22 02:05:58 +11:00 |
Matthew Honnibal
|
9cd0b6b3e9
|
* Various tweaks to Tokens class
|
2015-01-22 02:05:37 +11:00 |
Matthew Honnibal
|
5928d158ce
|
* Pass the string to Tokens
|
2015-01-22 02:04:58 +11:00 |
Matthew Honnibal
|
45264e356b
|
* Rename vec to repvec
|
2015-01-22 02:04:24 +11:00 |
Matthew Honnibal
|
5e63c606ad
|
* Rename vec to repvec
|
2015-01-22 02:03:54 +11:00 |
Matthew Honnibal
|
56e6cf0672
|
* Add _string attr to Tokens object
|
2015-01-21 18:57:09 +11:00 |
Matthew Honnibal
|
d6ac60e91c
|
* Bug fixes to sentences method, and improved vector transport for tokens
|
2015-01-21 18:56:32 +11:00 |
Matthew Honnibal
|
f2a229136c
|
* Fix data_dir=None argument to English class
|
2015-01-21 18:27:31 +11:00 |
Matthew Honnibal
|
ef49b8c179
|
* Add stop-word flag
|
2015-01-21 18:22:31 +11:00 |
Matthew Honnibal
|
6646bfc5df
|
* Add LOWER attr
|
2015-01-21 18:19:08 +11:00 |
Matthew Honnibal
|
f149259bf5
|
* Fix negative indices in tokens
|
2015-01-20 01:16:29 +11:00 |
Matthew Honnibal
|
b65b0c07bf
|
* Messily hook up vector in tokens
|
2015-01-19 19:59:55 +11:00 |
Matthew Honnibal
|
8ff5b8bd84
|
* Add attribute for POS scheme
|
2015-01-17 17:33:16 +11:00 |
Matthew Honnibal
|
6c7e44140b
|
* Work on word vectors, and other stuff
|
2015-01-17 16:21:17 +11:00 |
Matthew Honnibal
|
802867e96a
|
* Revise interface to Token. Strings now have attribute names like norm1_
|
2015-01-15 03:51:47 +11:00 |
Matthew Honnibal
|
7d3c40de7d
|
* Tests passing after refactor. API has obvious warts, particularly in Token and Lexeme
|
2015-01-15 00:33:16 +11:00 |
Matthew Honnibal
|
0930892fc1
|
* Tmp. Working on refactor. Compiles, must hook up lexical feats.
|
2015-01-14 00:03:48 +11:00 |
Matthew Honnibal
|
46da3d74d2
|
* Tmp. Refactoring, introducing a Lexeme PyObject.
|
2015-01-12 11:23:44 +11:00 |
Matthew Honnibal
|
ce2edd6312
|
* Tmp commit. Refactoring to create a Python Lexeme class.
|
2015-01-12 10:26:22 +11:00 |
Matthew Honnibal
|
aacaf1a0f0
|
* Fix parser
|
2015-01-08 01:19:23 +11:00 |
Matthew Honnibal
|
9a21127bf7
|
* Fix parser, which was importing the wrong model
|
2015-01-08 00:10:15 +11:00 |
Matthew Honnibal
|
6a3e39cdd1
|
* Add typedefs.pyx
|
2015-01-06 04:51:40 +11:00 |
Matthew Honnibal
|
a58920cc5e
|
* Import orth.word_shape as a C module
|
2015-01-06 03:18:22 +11:00 |
Matthew Honnibal
|
6b68f7ef75
|
* Finally get string types right for orth function
|
2015-01-06 03:17:39 +11:00 |
Matthew Honnibal
|
90c143bd85
|
* Fix orth import
|
2015-01-05 18:49:19 +11:00 |
Matthew Honnibal
|
7689dccd0f
|
* Remove unused import
|
2015-01-05 18:48:48 +11:00 |
Matthew Honnibal
|
3f1944d688
|
* Make PyPy work
|
2015-01-05 17:54:38 +11:00 |
Matthew Honnibal
|
a510d9f677
|
* Another assertion removed
|
2015-01-05 13:01:40 +11:00 |
Matthew Honnibal
|
2856946a66
|
* Remove assertion that doesn't work on Python 3
|
2015-01-05 12:51:16 +11:00 |
Matthew Honnibal
|
94034f1112
|
* Fix encoding in lemmatization
|
2015-01-05 11:54:29 +11:00 |
Matthew Honnibal
|
b132b3caa6
|
* Fix unicode error in lemmatizer
|
2015-01-05 11:53:54 +11:00 |
Matthew Honnibal
|
477e7fbffe
|
* Fix data reading for lemmatizer
|
2015-01-05 06:01:32 +11:00 |
Matthew Honnibal
|
58f75abaca
|
* Fix unicode error in orth
|
2015-01-05 05:53:08 +11:00 |
Matthew Honnibal
|
4e085d5166
|
* Fix lemmatizer for Python3
|
2015-01-05 05:51:26 +11:00 |
Matthew Honnibal
|
ae7c811fd1
|
* Use Exception instead of StandardError
|
2015-01-04 01:22:12 +11:00 |
Matthew Honnibal
|
0e4c2ba036
|
* Fix loading of special morph words
|
2015-01-03 23:13:00 +11:00 |
Matthew Honnibal
|
f5d41028b5
|
* Move around data files for test release
|
2015-01-03 01:59:22 +11:00 |
Matthew Honnibal
|
a24321b63a
|
* Add downloader
|
2015-01-02 21:44:41 +11:00 |
Matthew Honnibal
|
5d9a096e2f
|
* Some minor clean-up after HastyModel
|
2014-12-31 19:46:04 +11:00 |
Matthew Honnibal
|
aafaf58cbe
|
* Refactor _ml.Model, and finish implementing HastyModel so far not worthwhile.
|
2014-12-31 19:40:59 +11:00 |
Matthew Honnibal
|
bcd038e7b6
|
* Implement HastyModel
|
2014-12-31 01:16:47 +11:00 |
Matthew Honnibal
|
1a075f77ff
|
* Don't over-ride pre-loaded POS tags, if set by special-cases
|
2014-12-30 23:26:32 +11:00 |
Matthew Honnibal
|
785c7ba76a
|
* Embed signature on attrs
|
2014-12-30 23:25:31 +11:00 |
Matthew Honnibal
|
30e5805656
|
* Lazy-load tagger and parser
|
2014-12-30 23:25:09 +11:00 |
Matthew Honnibal
|
9976aa976e
|
* Messily fix morphology and POS tags on special tokens.
|
2014-12-30 23:24:37 +11:00 |
Matthew Honnibal
|
c1ef3febee
|
* Embedsignature in tokens.pyx
|
2014-12-30 21:22:00 +11:00 |
Matthew Honnibal
|
aac5028b6e
|
* Move tagger to _ml
|
2014-12-30 21:21:38 +11:00 |
Matthew Honnibal
|
1ffb0229ed
|
* Import tokens in parser.pxd
|
2014-12-30 21:21:17 +11:00 |
Matthew Honnibal
|
bb0b00f819
|
* Repurporse the Tagger class as a generic Model, wrapping thinc's interface
|
2014-12-30 21:20:15 +11:00 |
Matthew Honnibal
|
fe2a5e0370
|
* Work on docstrings
|
2014-12-27 21:46:04 +11:00 |
Matthew Honnibal
|
bb80937544
|
* Upd docstrings
|
2014-12-27 18:45:16 +11:00 |
Matthew Honnibal
|
b8b65903fc
|
* Tmp
|
2014-12-24 17:42:00 +11:00 |
Matthew Honnibal
|
ab61673edd
|
* Fix api of array method
|
2014-12-23 15:18:48 +11:00 |
Matthew Honnibal
|
7708d0e24a
|
* Move lemmatizer to en dir
|
2014-12-23 15:16:57 +11:00 |
Matthew Honnibal
|
98eb4c0426
|
* Fix path to parser model
|
2014-12-23 15:09:09 +11:00 |
Matthew Honnibal
|
b00bc01d8c
|
* All tests now passing for reorg
|
2014-12-23 13:18:59 +11:00 |
Matthew Honnibal
|
73f200436f
|
* Tests passing except for morphology/lemmatization stuff
|
2014-12-23 11:40:32 +11:00 |
Matthew Honnibal
|
cf8d26c3d2
|
* POS tagger training working after reorg
|
2014-12-22 08:54:47 +11:00 |
Matthew Honnibal
|
4c4aa2c5c9
|
* Work on train
|
2014-12-22 07:25:43 +11:00 |
Matthew Honnibal
|
61df50b598
|
* Add English-subclass POS tagger
|
2014-12-21 20:59:07 +11:00 |
Matthew Honnibal
|
9f3f07cab6
|
* Add attrs file for English
|
2014-12-21 11:29:11 +11:00 |
Matthew Honnibal
|
2a89d70429
|
* Add vocab.pyx to setup, and ensure we can import spacy.en.lang
|
2014-12-21 06:03:53 +11:00 |
Matthew Honnibal
|
b34a1325d3
|
* Everything compiling after reorg. About to start testing.
|
2014-12-21 05:42:23 +11:00 |
Matthew Honnibal
|
e1c1a4b868
|
* Tmp
|
2014-12-21 05:36:29 +11:00 |
Matthew Honnibal
|
d11c1edf8c
|
* Import slice_unicode from strings.pyx
|
2014-12-20 07:56:26 +11:00 |
Matthew Honnibal
|
be1bdcbd85
|
* Move lang.pyx to tokenizer.pyx
|
2014-12-20 07:55:40 +11:00 |
Matthew Honnibal
|
89a1cc1a48
|
* Move murmurhash to .pxd in strings file
|
2014-12-20 07:41:08 +11:00 |
Matthew Honnibal
|
d5a942c4a4
|
* Rename lang.pyx to tokenizer.pyx
|
2014-12-20 07:30:39 +11:00 |
Matthew Honnibal
|
a60ae261ae
|
* Move tokenizer to its own file, and refactor
|
2014-12-20 07:29:16 +11:00 |
Matthew Honnibal
|
867a4a000c
|
* Export set_morph_from_dict function
|
2014-12-20 07:28:27 +11:00 |
Matthew Honnibal
|
4e30195c6d
|
* Refactor morphology.pyx
|
2014-12-20 07:27:28 +11:00 |
Matthew Honnibal
|
4c6ce7ee84
|
* Update tokens.pyx as part of reorg
|
2014-12-20 07:03:26 +11:00 |
Matthew Honnibal
|
116f7f3bc1
|
* Rename Lexicon to Vocab, and move it to its own file
|
2014-12-20 06:54:03 +11:00 |
Matthew Honnibal
|
780cbd68b1
|
* Move all struct definitions to structs.pxd, to avoid circular dependencies
|
2014-12-20 06:51:33 +11:00 |
Matthew Honnibal
|
f6556d8e5d
|
* Refactor, move Lexeme struct to structs.pxd
|
2014-12-20 06:51:03 +11:00 |