spaCy

Commit Graph

Author	SHA1	Message	Date
ines	91899d337b	Tidy up language, lemmatizer and scorer	2017-10-27 14:40:14 +02:00
ines	8492d5be6d	Always make lemmatizer return a list of lemmas, not a set	2017-10-24 16:00:56 +02:00
ines	95f866f99f	Add lookup argument to Lemmatizer.load	2017-10-24 16:00:56 +02:00
ines	3516aa0cea	Port over changes from #1389	2017-10-14 13:32:55 +02:00
Matthew Honnibal	9b90d235d1	Fix tag check in lemmatizer	2017-10-12 22:50:43 +02:00
ines	9fd471372a	Add lookup lemmatizer to lemmatizer as lookup() method	2017-10-11 13:25:51 +02:00
Matthew Honnibal	a6ac4699eb	Allow Morphology class to setup tokens Add Morphology.assign_untagged() C-method, and call it from Doc.push_back() when a token is created. This gives a place to allow the Morphology class to initialize token data.	2017-10-11 03:24:14 +02:00
Matthew Honnibal	c15d8278cb	Avoid lemmatizing inappropriate tags in English lemmatizer	2017-10-11 03:23:23 +02:00
ines	820bf85075	Move LookupLemmatizer to spacy.lemmatizer	2017-10-11 02:25:13 +02:00
Matthew Honnibal	9cb2aef587	Remove print statement	2017-09-14 13:38:28 +02:00
Matthew Honnibal	5c3ff06924	Fix lemmatizer rules	2017-09-06 19:13:24 +02:00
Matthew Honnibal	bfddf50081	Fix #1296 : Incorrect lemmatization of base form verbs	2017-09-04 15:18:41 +02:00
ines	d24589aa72	Clean up imports, unused code, whitespace, docstrings	2017-04-15 12:05:47 +02:00
ines	561f2a3eb4	Use consistent formatting for docstrings	2017-04-15 11:59:21 +02:00
Matthew Honnibal	ed2b106f4d	Fix circular import in lemmatizer	2017-03-26 07:17:07 -05:00
Matthew Honnibal	c748907a66	Fix errors in previous commit	2017-03-25 22:25:01 +01:00
Matthew Honnibal	4f400fa486	Prevent lemmatization of base nouns Update lemmatizer's base-form check, for change in morphology class. Closes #903.	2017-03-25 21:51:12 +01:00
Matthew Honnibal	4454c1b23f	Block lemmatization of base-form adjectives Fixes check that an adjective is a base form (as opposed to a comparative or superlative), so that it's not lemmatized. e.g. inner -!> inn. Closes #912.	2017-03-25 21:29:57 +01:00
Matthew Honnibal	413138de79	Fix #719 : Lemmatizer can no longer output empty string	2017-03-18 16:02:06 +01:00
Matthew Honnibal	c4351e1165	Update base-form check in lemmatizer, for UD 2.0 morphology	2017-03-16 17:59:31 -05:00
Matthew Honnibal	fea9fe08af	Merge pull request #866 from juanmirocks/master Fix lemmatization of OOV words	2017-03-16 23:37:36 +01:00
ines	1da29a7146	Use new Lemmatizer data and remove file import Since there's currently only an English lemmatizer, the global Lemmatizer imports from spacy.en. This is unideal and still needs to be fixed.	2017-03-12 13:58:22 +01:00
Juan Miguel Cejuela	25c29f072d	apply patch	2017-03-01 21:44:17 +01:00
Matthew Honnibal	44f4f008bd	Wire up lemmatizer rules for English	2016-12-18 15:50:09 +01:00
Matthew Honnibal	a4eb5c2bff	Check POS key in lemmatizer, to update it for new data format	2016-12-18 13:28:20 +01:00
Ines Montani	8350d65695	Change morphology and lemmatizer API Take morphology features as object instead of keyword arguments	2016-12-07 21:12:49 +01:00
Matthew Honnibal	e30348b331	Prefer to import from symbols instead of parts_of_speech	2016-11-04 00:27:55 +01:00
Matthew Honnibal	f5fe4f595b	Fix json loading, for Python 3.	2016-10-20 21:23:26 +02:00
Matthew Honnibal	2e92c6fb3a	Fix JSON encoding issue on load	2016-10-20 21:06:48 +02:00
Matthew Honnibal	f189a3cb00	Fix encoding when opening files in Python 2.7, re Issue #539	2016-10-20 14:42:56 +02:00
Matthew Honnibal	a2f3510d6d	Fix lemmatizer	2016-09-27 17:47:05 +02:00
Matthew Honnibal	35cd953f9e	Fix pos name conflict with morphology	2016-09-27 14:16:22 +02:00
Matthew Honnibal	40509e8bca	Tweak the new is_base_form logic, because we can expect the 'pos' key in the morphology we're passed.	2016-09-27 14:01:16 +02:00
Matthew Honnibal	3cb4d455d2	Pass lemmatizer morphological features, so that rules are sensitive to base/inflected distinction, which is how the WordNet data is designed. See Issue #435	2016-09-27 13:52:11 +02:00
Matthew Honnibal	fd65cf6cbb	Finish refactoring data loading	2016-09-24 20:26:17 +02:00
Matthew Honnibal	83e364188c	Mostly finished loading refactoring. Design is in place, but doesn't work yet.	2016-09-24 15:42:01 +02:00
Henning Peters	846fa49b2a	distinct load() and from_package() methods	2016-01-16 10:00:57 +01:00
Henning Peters	788f734513	refactored data_dir->via, add zip_safe, add spacy.load()	2016-01-15 18:01:02 +01:00
Henning Peters	bc229790ac	integrate with sputnik	2016-01-13 19:46:17 +01:00
Matthew Honnibal	eaf2ad59f1	* Fix use of mock Package object	2015-12-31 04:13:15 +01:00
Matthew Honnibal	55bcdf8bdd	* Fix errors	2015-12-29 22:32:03 +01:00
Matthew Honnibal	aec130af56	Use util.Package class for io Previous Sputnik integration caused API change: Vocab, Tagger, etc were loaded via a from_package classmethod, that required a sputnik.Package instance. This forced users to first create a sputnik.Sputnik() instance, in order to acquire a Package via sp.pool(). Instead I've created a small file-system shim, util.Package, which allows classes to have a .load() classmethod, that accepts either util.Package objects, or strings. We can later gut the internals of this and make it a proxy for Sputnik if we need more functionality that should live in the Sputnik library. Sputnik is now only used to download and install the data, in spacy.en.download	2015-12-29 18:00:48 +01:00
Matthew Honnibal	c5902f2b4b	* Upd Lemmatizer to use MockPackage. Replace from_package with load() classmethod	2015-12-29 16:56:02 +01:00
Henning Peters	8359bd4d93	strip data/ from package, friendlier Language invocation, make data_dir backward/forward-compatible	2015-12-18 09:52:55 +01:00
Henning Peters	9027cef3bc	access model via sputnik	2015-12-07 06:01:28 +01:00
maxirmx	f07e4accd7	Fixing encoding issue #4	2015-10-21 20:45:56 +03:00
maxirmx	fcbfff043f	Fixing encoding issue #3	2015-10-21 15:52:34 +03:00
maxirmx	fe9d2e2c4e	Fixing encode issue #2	2015-10-21 15:36:21 +03:00
maxirmx	e4a1726f77	Fixing encoding issue UTF-8	2015-10-21 14:16:37 +03:00
Matthew Honnibal	5332c0b697	* Add support for punctuation lemmatization, to handle unicode characters. This should help in addressing Issue #130	2015-10-09 18:54:40 +11:00

1 2

60 Commits