spaCy

Commit Graph

Author	SHA1	Message	Date
ines	7c919aeb09	Make sure serializers and deserializers are ordered	2017-06-03 17:05:09 +02:00
ines	0153b66a86	Return self in Tokenizer.from_bytes	2017-06-03 13:26:13 +02:00
Matthew Honnibal	0561df2a9d	Fix tokenizer serialization	2017-05-31 14:12:38 +02:00
Matthew Honnibal	e9419072e7	Fix tokenizer serialisation	2017-05-31 13:43:31 +02:00
Matthew Honnibal	66af019d5d	Fix serialization of tokenizer	2017-05-31 11:43:40 +02:00
Matthew Honnibal	a318f0cae1	Add to/from disk/bytes methods for tokenizer	2017-05-29 12:24:41 +02:00
ines	c5a653fa48	Update docstrings and API docs for Tokenizer	2017-05-21 13:18:14 +02:00
ines	f216422ac5	Remove deprecated load classmethod	2017-05-21 13:18:01 +02:00
Matthew Honnibal	793430aa7a	Get spaCy train command working with neural network * Integrate models into pipeline * Add basic serialization (maybe incorrect) * Fix pickle on vocab	2017-05-17 12:04:50 +02:00
ines	e1efd589c3	Fix json imports and use ujson	2017-04-15 12:13:34 +02:00
ines	c05ec4b89a	Add compat functions and remove old workarounds Add ensure_path util function to handle checking instance of path	2017-04-15 12:11:16 +02:00
ines	d24589aa72	Clean up imports, unused code, whitespace, docstrings	2017-04-15 12:05:47 +02:00
ines	561f2a3eb4	Use consistent formatting for docstrings	2017-04-15 11:59:21 +02:00
Raphaël Bournhonesque	f332bf05be	Remove unused import statements	2017-03-21 21:08:54 +01:00
Matthew Honnibal	0ac3d27689	Fix handling of trailing whitespace Fix off-by-one error that meant trailing spaces were being dropped. Closes #792	2017-03-08 15:01:40 +01:00
Matthew Honnibal	0a6d7ca200	Fix spacing after token_match The boolean flag indicating a space after the token was being set incorrectly after the token_match regex was applied. Fixes #859.	2017-03-08 14:33:32 +01:00
Raphaël Bournhonesque	dce8f5515e	Allow zero-width 'infix' token	2017-01-23 18:28:01 +01:00
Ines Montani	aa876884f0	Revert "Revert "Merge remote-tracking branch 'origin/master'"" This reverts commit `fb9d3bb022`.	2017-01-09 13:28:13 +01:00
Matthew Honnibal	a36353df47	Temporarily put back the tokenize_from_strings method, while tests aren't updated yet.	2016-11-04 19:18:07 +01:00
Matthew Honnibal	e0c9695615	Fix doc strings for tokenizer	2016-11-02 23:15:39 +01:00
Matthew Honnibal	e9e6fce576	Handle null prefix/suffix/infix search in tokenizer	2016-11-02 20:35:48 +01:00
Matthew Honnibal	8ce8803824	Fix JSON in tokenizer	2016-10-21 01:44:20 +02:00
Matthew Honnibal	95aaea0d3f	Refactor so that the tokenizer data is read from Python data, rather than from disk	2016-09-25 14:49:53 +02:00
Matthew Honnibal	fd65cf6cbb	Finish refactoring data loading	2016-09-24 20:26:17 +02:00
Matthew Honnibal	83e364188c	Mostly finished loading refactoring. Design is in place, but doesn't work yet.	2016-09-24 15:42:01 +02:00
Matthew Honnibal	cc8bf62208	* Fix Issue #360 : Tokenizer failed when the infix regex matched the start of the string while trying to tokenize multi-infix tokens.	2016-05-09 13:23:47 +02:00
Matthew Honnibal	519366f677	* Fix Issue #351 : Indices off when leading whitespace	2016-05-04 15:53:36 +02:00
Matthew Honnibal	04d0209be9	* Recognise multiple infixes in a token.	2016-04-13 18:38:26 +10:00
Henning Peters	b8f63071eb	add lang registration facility	2016-03-25 18:54:45 +01:00
Matthew Honnibal	141639ea3a	* Fix bug in tokenizer that caused new tokens to be added for affixes	2016-02-21 23:17:47 +00:00
Matthew Honnibal	f9e765cae7	* Add pipe() method to tokenizer	2016-02-03 02:32:37 +01:00
Matthew Honnibal	3e9961d2c4	* If final token is whitespace, don't mark it as owning a trailing space. Fixes Issue #154	2016-01-16 17:08:59 +01:00
Henning Peters	235f094534	untangle data_path/via	2016-01-16 12:23:45 +01:00
Henning Peters	846fa49b2a	distinct load() and from_package() methods	2016-01-16 10:00:57 +01:00
Henning Peters	788f734513	refactored data_dir->via, add zip_safe, add spacy.load()	2016-01-15 18:01:02 +01:00
Henning Peters	bc229790ac	integrate with sputnik	2016-01-13 19:46:17 +01:00
Matthew Honnibal	a6ba43ecaf	* Fix errors in packaging revision	2015-12-29 18:37:26 +01:00
Matthew Honnibal	aec130af56	Use util.Package class for io Previous Sputnik integration caused API change: Vocab, Tagger, etc were loaded via a from_package classmethod, that required a sputnik.Package instance. This forced users to first create a sputnik.Sputnik() instance, in order to acquire a Package via sp.pool(). Instead I've created a small file-system shim, util.Package, which allows classes to have a .load() classmethod, that accepts either util.Package objects, or strings. We can later gut the internals of this and make it a proxy for Sputnik if we need more functionality that should live in the Sputnik library. Sputnik is now only used to download and install the data, in spacy.en.download	2015-12-29 18:00:48 +01:00
Henning Peters	9027cef3bc	access model via sputnik	2015-12-07 06:01:28 +01:00
Matthew Honnibal	68f479e821	* Rename Doc.data to Doc.c	2015-11-04 00:15:14 +11:00
Chris DuBois	dac8fe7bdb	Add __reduce__ to Tokenizer so that English pickles. - Add tests to test_pickle and test_tokenizer that save to tempfiles.	2015-10-23 22:24:03 -07:00
Matthew Honnibal	3ba66f2dc7	* Add string length cap in Tokenizer.__call__	2015-10-16 04:54:16 +11:00
Matthew Honnibal	c2307fa9ee	* More work on language-generic parsing	2015-08-28 02:02:33 +02:00
Matthew Honnibal	119c0f8c3f	* Hack out morphology stuff from tokenizer, while morphology being reimplemented.	2015-08-26 19:20:11 +02:00
Matthew Honnibal	9c4d0aae62	* Switch to better Python2/3 compatible unicode handling	2015-07-28 14:45:37 +02:00
Matthew Honnibal	0c507bd80a	* Fix tokenizer	2015-07-22 14:10:30 +02:00
Matthew Honnibal	2fc66e3723	* Use Py_UNICODE in tokenizer for now, while sort out Py_UCS4 stuff	2015-07-22 13:38:45 +02:00
Matthew Honnibal	109106a949	* Replace UniStr, using unicode objects instead	2015-07-22 04:52:05 +02:00
Matthew Honnibal	e49c7f1478	* Update oov check in tokenizer	2015-07-18 22:45:28 +02:00
Matthew Honnibal	cfd842769e	* Allow infix tokens to be variable length	2015-07-18 22:45:00 +02:00

1 2

74 Commits