spaCy

Commit Graph

Author	SHA1	Message	Date
Matthew Honnibal	2a0615104b	* Upd download script	2015-02-09 10:22:59 -05:00
Matthew Honnibal	5c3513583d	* Clear buffered python tokens when modifying the Tokens object. Need to clean this up, and modify via a method on Tokens.	2015-02-09 03:57:10 -05:00
Matthew Honnibal	be5536d239	* Fix Issue #22 : PRP and PRP$ were mapped to NOUN. Should be PRON.	2015-02-08 18:36:18 -05:00
Matthew Honnibal	0492cee8b4	* Fix Issue #24 : Lemmas are empty when the L field is missing for special-cased tokens	2015-02-08 18:30:30 -05:00
Matthew Honnibal	d229fbd228	* Give better error on out-of-bounds array access	2015-02-07 12:59:12 -05:00
Matthew Honnibal	ab8bb047d0	* Fix negative index for __getitem__	2015-02-07 12:58:46 -05:00
Matthew Honnibal	44c7eafe44	* Fix download.py	2015-02-07 12:00:36 -05:00
Matthew Honnibal	6ca7f2eedc	* Upd download script	2015-02-07 11:32:33 -05:00
Matthew Honnibal	f0e0588833	* Fill L2 norm attribute on LexemeC struct	2015-02-07 08:44:42 -05:00
Matthew Honnibal	75f9b7d6bf	* Add L2 norm field to LexemeC struct	2015-02-07 08:43:17 -05:00
Matthew Honnibal	51b618d646	* Add a has_repvec property to Lexeme, and a check function to check flags	2015-02-07 08:42:44 -05:00
Matthew Honnibal	321b402739	* Store the l2 norm of the word's vector	2015-02-07 08:42:16 -05:00
Matthew Honnibal	c7d8644149	* Fix regression on 'prob' attr of Token.	2015-02-03 03:32:18 +11:00
Matthew Honnibal	c55a33d045	* Catch oracle errors	2015-02-02 23:02:04 +11:00
Matthew Honnibal	de772088e6	* Use parse tree for sbd in Tokens.sents	2015-02-02 12:17:32 +11:00
Matthew Honnibal	56c2ef2982	* Tweak POS features for web text	2015-02-02 11:59:36 +11:00
Matthew Honnibal	d68678a93e	* Add Exception class, OracleError	2015-02-02 11:57:32 +11:00
Matthew Honnibal	a20fdbd8ee	* Upd download script	2015-02-01 13:22:23 +11:00
Matthew Honnibal	76d9394cb4	* Fix vocab.pyx for Python3	2015-02-01 13:14:04 +11:00
Matthew Honnibal	63abdf154c	* Hastily hack download file	2015-01-31 22:48:32 +11:00
Matthew Honnibal	7de00c5a79	* Try not holding a reference to Pool, since that seems to confuse the GC	2015-01-31 22:10:22 +11:00
Matthew Honnibal	ce3ae8b5d9	* Fix platform-specific lexicon bug.	2015-01-31 16:38:58 +11:00
Matthew Honnibal	a1ed574b7b	* Fix default model path for English	2015-01-31 16:38:27 +11:00
Matthew Honnibal	018e0bfa24	* Bug fixes to parse navigation	2015-01-31 16:37:13 +11:00
Matthew Honnibal	e013555b25	* Add option to download script	2015-01-31 13:51:56 +11:00
Matthew Honnibal	08ca5c8970	* Add sent_end flag to TokenC struct	2015-01-31 13:44:16 +11:00
Matthew Honnibal	024cfd485c	* Pass tag_strings as a tuple, to support new Tokens API	2015-01-31 13:43:37 +11:00
Matthew Honnibal	77d62d0179	* Large refactor of Token objects, making them much thinner. This is to support fast parse-tree navigation.	2015-01-31 13:42:58 +11:00
Matthew Honnibal	88170e6295	* Supply dep_strings as a tuple, for the changed API on Tokens	2015-01-31 13:42:09 +11:00
Matthew Honnibal	0981d68022	* Set a sent_end flag during parsing, for later use	2015-01-31 13:41:46 +11:00
Matthew Honnibal	251dbf24d7	* Fix unintialised variable error	2015-01-30 20:46:34 +11:00
Matthew Honnibal	83a4df5a1a	* Fix download script	2015-01-30 20:40:42 +11:00
Matthew Honnibal	6f9ebc2f34	* Fix download script	2015-01-30 20:33:19 +11:00
Matthew Honnibal	8b85d0bb8a	* Only download small data if no data dir exists	2015-01-30 20:27:14 +11:00
Matthew Honnibal	1a7a1c2771	* Fix Issue #16 : tokens recurse when printing	2015-01-30 19:47:50 +11:00
Matthew Honnibal	cb95ef6934	* Fix download script	2015-01-30 19:28:43 +11:00
Matthew Honnibal	e578bd37bd	* Fix download script	2015-01-30 18:59:31 +11:00
Matthew Honnibal	df52014d12	* Fix download script	2015-01-30 18:36:24 +11:00
Matthew Honnibal	0f95712189	* Improve accuracy reporting during training	2015-01-30 18:05:06 +11:00
Matthew Honnibal	b68f563c2f	* Fix Issue #14 : Improve parsing API	2015-01-30 18:04:41 +11:00
Matthew Honnibal	998b607f65	* Upd download script, having it download all data if there's no data/ directory, allowing easier compilation from source	2015-01-30 18:04:01 +11:00
Matthew Honnibal	67d6e53a69	* Ensure parser and tagger function correctly when training from missing values, indicated by -1	2015-01-30 14:08:56 +11:00
Matthew Honnibal	4ff180db74	* Fix off-by-one error in commit `0a7fceb`	2015-01-30 12:49:33 +11:00
Matthew Honnibal	0a7fcebdf7	* Fix Issue #12 : Incorrect token.idx calculations for some punctuation, in the presence of token cache	2015-01-30 12:33:38 +11:00
Matthew Honnibal	ebf7d2fab1	* Use non-joint sbd, for more simplicity and fewer classes	2015-01-29 06:22:03 +11:00
Matthew Honnibal	d05c5bf141	* Remove comment	2015-01-29 05:19:27 +11:00
Matthew Honnibal	320b045daa	* Oracle now consistent over gold standard derivation	2015-01-29 03:41:58 +11:00
Matthew Honnibal	f590382134	* Work on sbd	2015-01-29 03:18:29 +11:00
Matthew Honnibal	1884a7a0be	* Attach comment with paper	2015-01-28 03:18:43 +11:00
Matthew Honnibal	a2d6b195db	* Add messy Break transitions, carefully following the scheme of Dd Zhang et al (2013)	2015-01-28 03:09:45 +11:00
Matthew Honnibal	f9ee5d9934	* Build a python list of word strings, for debugging	2015-01-28 01:06:13 +11:00
Matthew Honnibal	d819101571	* Improve error message on oracle failure	2015-01-28 00:58:03 +11:00
Matthew Honnibal	e6c3d3471f	* Tweak documentation for Tokens, and hide constructor as __cinit__	2015-01-27 18:57:52 +11:00
Matthew Honnibal	c38c62d4a3	* Add docstring to English class	2015-01-27 02:45:21 +11:00
Matthew Honnibal	d4c99f7dec	* Add attrs.pxd	2015-01-26 22:22:09 +11:00
Matthew Honnibal	d4a493855e	* Fix error msg	2015-01-25 23:01:30 +11:00
Matthew Honnibal	7f87716cf7	* Fix download script	2015-01-25 23:01:10 +11:00
Matthew Honnibal	92fb9257dd	* Add parts-of-speech file	2015-01-25 22:00:39 +11:00
Matthew Honnibal	c1c3dba4cb	* Check whether vector files are present before trying to load them.	2015-01-25 18:16:48 +11:00
Matthew Honnibal	5049d4c2e6	* Add parts_of_speech.pyx	2015-01-25 16:32:26 +11:00
Matthew Honnibal	12b034e3ef	* Move POS tag definitions to parts_of_speech.pxd	2015-01-25 16:31:07 +11:00
Matthew Honnibal	7431c133d8	* Add error if try to access head and not is_parsed	2015-01-25 15:33:54 +11:00
Matthew Honnibal	951d06c824	* Silently don't parse if data is not present	2015-01-25 14:47:38 +11:00
Matthew Honnibal	4e857ab7a6	* Fix bug in POS tagger feature	2015-01-25 02:20:15 +11:00
Matthew Honnibal	dd56e298e2	* Ensure tagging is applied if parse=True	2015-01-25 02:19:44 +11:00
Matthew Honnibal	94750819cd	* Set parse=True by default --- i.e. parse unless told not to.	2015-01-25 01:28:28 +11:00
Matthew Honnibal	71b95202eb	* Add docstring to StringStore	2015-01-24 20:49:15 +11:00
Matthew Honnibal	6d1c08dafd	* Add docstring to Lexeme	2015-01-24 20:48:34 +11:00
Matthew Honnibal	a97bed9359	* Fix POS and dependency label tag names. Add parse and string navigation functions.	2015-01-24 17:29:04 +11:00
Matthew Honnibal	76cd024095	* Add whitespace property to Token	2015-01-24 07:41:21 +11:00
Matthew Honnibal	5fd72bc220	* Have 'string' refer to the whitespace-padded string	2015-01-24 07:32:38 +11:00
Matthew Honnibal	fda94271af	* Rename NORM1 and NORM2 attrs to lower and norm	2015-01-24 06:17:03 +11:00
Matthew Honnibal	5ed8b2b98f	* Rename sic to orth	2015-01-23 02:08:25 +11:00
Matthew Honnibal	a27b23cc8f	* Have SBD return start/end indices	2015-01-22 22:24:44 +11:00
Matthew Honnibal	d460c28838	* Rename vec to repvec	2015-01-22 02:06:22 +11:00
Matthew Honnibal	8b9d913d97	* Rename vec to repvec	2015-01-22 02:05:58 +11:00
Matthew Honnibal	9cd0b6b3e9	* Various tweaks to Tokens class	2015-01-22 02:05:37 +11:00
Matthew Honnibal	5928d158ce	* Pass the string to Tokens	2015-01-22 02:04:58 +11:00
Matthew Honnibal	45264e356b	* Rename vec to repvec	2015-01-22 02:04:24 +11:00
Matthew Honnibal	5e63c606ad	* Rename vec to repvec	2015-01-22 02:03:54 +11:00
Matthew Honnibal	56e6cf0672	* Add _string attr to Tokens object	2015-01-21 18:57:09 +11:00
Matthew Honnibal	d6ac60e91c	* Bug fixes to sentences method, and improved vector transport for tokens	2015-01-21 18:56:32 +11:00
Matthew Honnibal	f2a229136c	* Fix data_dir=None argument to English class	2015-01-21 18:27:31 +11:00
Matthew Honnibal	ef49b8c179	* Add stop-word flag	2015-01-21 18:22:31 +11:00
Matthew Honnibal	6646bfc5df	* Add LOWER attr	2015-01-21 18:19:08 +11:00
Matthew Honnibal	f149259bf5	* Fix negative indices in tokens	2015-01-20 01:16:29 +11:00
Matthew Honnibal	b65b0c07bf	* Messily hook up vector in tokens	2015-01-19 19:59:55 +11:00
Matthew Honnibal	8ff5b8bd84	* Add attribute for POS scheme	2015-01-17 17:33:16 +11:00
Matthew Honnibal	6c7e44140b	* Work on word vectors, and other stuff	2015-01-17 16:21:17 +11:00
Matthew Honnibal	802867e96a	* Revise interface to Token. Strings now have attribute names like norm1_	2015-01-15 03:51:47 +11:00
Matthew Honnibal	7d3c40de7d	* Tests passing after refactor. API has obvious warts, particularly in Token and Lexeme	2015-01-15 00:33:16 +11:00
Matthew Honnibal	0930892fc1	* Tmp. Working on refactor. Compiles, must hook up lexical feats.	2015-01-14 00:03:48 +11:00
Matthew Honnibal	46da3d74d2	* Tmp. Refactoring, introducing a Lexeme PyObject.	2015-01-12 11:23:44 +11:00
Matthew Honnibal	ce2edd6312	* Tmp commit. Refactoring to create a Python Lexeme class.	2015-01-12 10:26:22 +11:00
Matthew Honnibal	aacaf1a0f0	* Fix parser	2015-01-08 01:19:23 +11:00
Matthew Honnibal	9a21127bf7	* Fix parser, which was importing the wrong model	2015-01-08 00:10:15 +11:00
Matthew Honnibal	6a3e39cdd1	* Add typedefs.pyx	2015-01-06 04:51:40 +11:00
Matthew Honnibal	a58920cc5e	* Import orth.word_shape as a C module	2015-01-06 03:18:22 +11:00
Matthew Honnibal	6b68f7ef75	* Finally get string types right for orth function	2015-01-06 03:17:39 +11:00
Matthew Honnibal	90c143bd85	* Fix orth import	2015-01-05 18:49:19 +11:00
Matthew Honnibal	7689dccd0f	* Remove unused import	2015-01-05 18:48:48 +11:00
Matthew Honnibal	3f1944d688	* Make PyPy work	2015-01-05 17:54:38 +11:00
Matthew Honnibal	a510d9f677	* Another assertion removed	2015-01-05 13:01:40 +11:00
Matthew Honnibal	2856946a66	* Remove assertion that doesn't work on Python 3	2015-01-05 12:51:16 +11:00
Matthew Honnibal	94034f1112	* Fix encoding in lemmatization	2015-01-05 11:54:29 +11:00
Matthew Honnibal	b132b3caa6	* Fix unicode error in lemmatizer	2015-01-05 11:53:54 +11:00
Matthew Honnibal	477e7fbffe	* Fix data reading for lemmatizer	2015-01-05 06:01:32 +11:00
Matthew Honnibal	58f75abaca	* Fix unicode error in orth	2015-01-05 05:53:08 +11:00
Matthew Honnibal	4e085d5166	* Fix lemmatizer for Python3	2015-01-05 05:51:26 +11:00
Matthew Honnibal	ae7c811fd1	* Use Exception instead of StandardError	2015-01-04 01:22:12 +11:00
Matthew Honnibal	0e4c2ba036	* Fix loading of special morph words	2015-01-03 23:13:00 +11:00
Matthew Honnibal	f5d41028b5	* Move around data files for test release	2015-01-03 01:59:22 +11:00
Matthew Honnibal	a24321b63a	* Add downloader	2015-01-02 21:44:41 +11:00
Matthew Honnibal	5d9a096e2f	* Some minor clean-up after HastyModel	2014-12-31 19:46:04 +11:00
Matthew Honnibal	aafaf58cbe	* Refactor _ml.Model, and finish implementing HastyModel so far not worthwhile.	2014-12-31 19:40:59 +11:00
Matthew Honnibal	bcd038e7b6	* Implement HastyModel	2014-12-31 01:16:47 +11:00
Matthew Honnibal	1a075f77ff	* Don't over-ride pre-loaded POS tags, if set by special-cases	2014-12-30 23:26:32 +11:00
Matthew Honnibal	785c7ba76a	* Embed signature on attrs	2014-12-30 23:25:31 +11:00
Matthew Honnibal	30e5805656	* Lazy-load tagger and parser	2014-12-30 23:25:09 +11:00
Matthew Honnibal	9976aa976e	* Messily fix morphology and POS tags on special tokens.	2014-12-30 23:24:37 +11:00
Matthew Honnibal	c1ef3febee	* Embedsignature in tokens.pyx	2014-12-30 21:22:00 +11:00
Matthew Honnibal	aac5028b6e	* Move tagger to _ml	2014-12-30 21:21:38 +11:00
Matthew Honnibal	1ffb0229ed	* Import tokens in parser.pxd	2014-12-30 21:21:17 +11:00
Matthew Honnibal	bb0b00f819	* Repurporse the Tagger class as a generic Model, wrapping thinc's interface	2014-12-30 21:20:15 +11:00
Matthew Honnibal	fe2a5e0370	* Work on docstrings	2014-12-27 21:46:04 +11:00
Matthew Honnibal	bb80937544	* Upd docstrings	2014-12-27 18:45:16 +11:00
Matthew Honnibal	b8b65903fc	* Tmp	2014-12-24 17:42:00 +11:00
Matthew Honnibal	ab61673edd	* Fix api of array method	2014-12-23 15:18:48 +11:00
Matthew Honnibal	7708d0e24a	* Move lemmatizer to en dir	2014-12-23 15:16:57 +11:00
Matthew Honnibal	98eb4c0426	* Fix path to parser model	2014-12-23 15:09:09 +11:00
Matthew Honnibal	b00bc01d8c	* All tests now passing for reorg	2014-12-23 13:18:59 +11:00
Matthew Honnibal	73f200436f	* Tests passing except for morphology/lemmatization stuff	2014-12-23 11:40:32 +11:00
Matthew Honnibal	cf8d26c3d2	* POS tagger training working after reorg	2014-12-22 08:54:47 +11:00
Matthew Honnibal	4c4aa2c5c9	* Work on train	2014-12-22 07:25:43 +11:00
Matthew Honnibal	61df50b598	* Add English-subclass POS tagger	2014-12-21 20:59:07 +11:00
Matthew Honnibal	9f3f07cab6	* Add attrs file for English	2014-12-21 11:29:11 +11:00
Matthew Honnibal	2a89d70429	* Add vocab.pyx to setup, and ensure we can import spacy.en.lang	2014-12-21 06:03:53 +11:00
Matthew Honnibal	b34a1325d3	* Everything compiling after reorg. About to start testing.	2014-12-21 05:42:23 +11:00
Matthew Honnibal	e1c1a4b868	* Tmp	2014-12-21 05:36:29 +11:00
Matthew Honnibal	d11c1edf8c	* Import slice_unicode from strings.pyx	2014-12-20 07:56:26 +11:00
Matthew Honnibal	be1bdcbd85	* Move lang.pyx to tokenizer.pyx	2014-12-20 07:55:40 +11:00
Matthew Honnibal	89a1cc1a48	* Move murmurhash to .pxd in strings file	2014-12-20 07:41:08 +11:00
Matthew Honnibal	d5a942c4a4	* Rename lang.pyx to tokenizer.pyx	2014-12-20 07:30:39 +11:00
Matthew Honnibal	a60ae261ae	* Move tokenizer to its own file, and refactor	2014-12-20 07:29:16 +11:00
Matthew Honnibal	867a4a000c	* Export set_morph_from_dict function	2014-12-20 07:28:27 +11:00
Matthew Honnibal	4e30195c6d	* Refactor morphology.pyx	2014-12-20 07:27:28 +11:00
Matthew Honnibal	4c6ce7ee84	* Update tokens.pyx as part of reorg	2014-12-20 07:03:26 +11:00
Matthew Honnibal	116f7f3bc1	* Rename Lexicon to Vocab, and move it to its own file	2014-12-20 06:54:03 +11:00
Matthew Honnibal	780cbd68b1	* Move all struct definitions to structs.pxd, to avoid circular dependencies	2014-12-20 06:51:33 +11:00
Matthew Honnibal	f6556d8e5d	* Refactor, move Lexeme struct to structs.pxd	2014-12-20 06:51:03 +11:00

1 2 3 4 5 ...

545 Commits