Commit Graph

60 Commits

Author SHA1 Message Date
ines 91899d337b Tidy up language, lemmatizer and scorer 2017-10-27 14:40:14 +02:00
ines 8492d5be6d Always make lemmatizer return a list of lemmas, not a set 2017-10-24 16:00:56 +02:00
ines 95f866f99f Add lookup argument to Lemmatizer.load 2017-10-24 16:00:56 +02:00
ines 3516aa0cea Port over changes from #1389 2017-10-14 13:32:55 +02:00
Matthew Honnibal 9b90d235d1 Fix tag check in lemmatizer 2017-10-12 22:50:43 +02:00
ines 9fd471372a Add lookup lemmatizer to lemmatizer as lookup() method 2017-10-11 13:25:51 +02:00
Matthew Honnibal a6ac4699eb Allow Morphology class to setup tokens
Add Morphology.assign_untagged() C-method, and call it from
Doc.push_back() when a token is created. This gives a place
to allow the Morphology class to initialize token data.
2017-10-11 03:24:14 +02:00
Matthew Honnibal c15d8278cb Avoid lemmatizing inappropriate tags in English lemmatizer 2017-10-11 03:23:23 +02:00
ines 820bf85075 Move LookupLemmatizer to spacy.lemmatizer 2017-10-11 02:25:13 +02:00
Matthew Honnibal 9cb2aef587 Remove print statement 2017-09-14 13:38:28 +02:00
Matthew Honnibal 5c3ff06924 Fix lemmatizer rules 2017-09-06 19:13:24 +02:00
Matthew Honnibal bfddf50081 Fix #1296: Incorrect lemmatization of base form verbs 2017-09-04 15:18:41 +02:00
ines d24589aa72 Clean up imports, unused code, whitespace, docstrings 2017-04-15 12:05:47 +02:00
ines 561f2a3eb4 Use consistent formatting for docstrings 2017-04-15 11:59:21 +02:00
Matthew Honnibal ed2b106f4d Fix circular import in lemmatizer 2017-03-26 07:17:07 -05:00
Matthew Honnibal c748907a66 Fix errors in previous commit 2017-03-25 22:25:01 +01:00
Matthew Honnibal 4f400fa486 Prevent lemmatization of base nouns
Update lemmatizer's base-form check, for change in morphology class.
Closes #903.
2017-03-25 21:51:12 +01:00
Matthew Honnibal 4454c1b23f Block lemmatization of base-form adjectives
Fixes check that an adjective is a base form (as opposed to a
comparative or superlative), so that it's not lemmatized.
e.g. inner -!> inn. Closes #912.
2017-03-25 21:29:57 +01:00
Matthew Honnibal 413138de79 Fix #719: Lemmatizer can no longer output empty string 2017-03-18 16:02:06 +01:00
Matthew Honnibal c4351e1165 Update base-form check in lemmatizer, for UD 2.0 morphology 2017-03-16 17:59:31 -05:00
Matthew Honnibal fea9fe08af Merge pull request #866 from juanmirocks/master
Fix lemmatization of OOV words
2017-03-16 23:37:36 +01:00
ines 1da29a7146 Use new Lemmatizer data and remove file import
Since there's currently only an English lemmatizer, the global
Lemmatizer imports from spacy.en. This is unideal and still needs to be
fixed.
2017-03-12 13:58:22 +01:00
Juan Miguel Cejuela 25c29f072d apply patch 2017-03-01 21:44:17 +01:00
Matthew Honnibal 44f4f008bd Wire up lemmatizer rules for English 2016-12-18 15:50:09 +01:00
Matthew Honnibal a4eb5c2bff Check POS key in lemmatizer, to update it for new data format 2016-12-18 13:28:20 +01:00
Ines Montani 8350d65695 Change morphology and lemmatizer API
Take morphology features as object instead of keyword arguments
2016-12-07 21:12:49 +01:00
Matthew Honnibal e30348b331 Prefer to import from symbols instead of parts_of_speech 2016-11-04 00:27:55 +01:00
Matthew Honnibal f5fe4f595b Fix json loading, for Python 3. 2016-10-20 21:23:26 +02:00
Matthew Honnibal 2e92c6fb3a Fix JSON encoding issue on load 2016-10-20 21:06:48 +02:00
Matthew Honnibal f189a3cb00 Fix encoding when opening files in Python 2.7, re Issue #539 2016-10-20 14:42:56 +02:00
Matthew Honnibal a2f3510d6d Fix lemmatizer 2016-09-27 17:47:05 +02:00
Matthew Honnibal 35cd953f9e Fix pos name conflict with morphology 2016-09-27 14:16:22 +02:00
Matthew Honnibal 40509e8bca Tweak the new is_base_form logic, because we can expect the 'pos' key in the morphology we're passed. 2016-09-27 14:01:16 +02:00
Matthew Honnibal 3cb4d455d2 Pass lemmatizer morphological features, so that rules are sensitive to base/inflected distinction, which is how the WordNet data is designed. See Issue #435 2016-09-27 13:52:11 +02:00
Matthew Honnibal fd65cf6cbb Finish refactoring data loading 2016-09-24 20:26:17 +02:00
Matthew Honnibal 83e364188c Mostly finished loading refactoring. Design is in place, but doesn't work yet. 2016-09-24 15:42:01 +02:00
Henning Peters 846fa49b2a distinct load() and from_package() methods 2016-01-16 10:00:57 +01:00
Henning Peters 788f734513 refactored data_dir->via, add zip_safe, add spacy.load() 2016-01-15 18:01:02 +01:00
Henning Peters bc229790ac integrate with sputnik 2016-01-13 19:46:17 +01:00
Matthew Honnibal eaf2ad59f1 * Fix use of mock Package object 2015-12-31 04:13:15 +01:00
Matthew Honnibal 55bcdf8bdd * Fix errors 2015-12-29 22:32:03 +01:00
Matthew Honnibal aec130af56 Use util.Package class for io
Previous Sputnik integration caused API change: Vocab, Tagger, etc
were loaded via a from_package classmethod, that required a
sputnik.Package instance. This forced users to first create a
sputnik.Sputnik() instance, in order to acquire a Package via
sp.pool().

Instead I've created a small file-system shim, util.Package, which
allows classes to have a .load() classmethod, that accepts either
util.Package objects, or strings. We can later gut the internals
of this and make it a proxy for Sputnik if we need more functionality
that should live in the Sputnik library.

Sputnik is now only used to download and install the data, in
spacy.en.download
2015-12-29 18:00:48 +01:00
Matthew Honnibal c5902f2b4b * Upd Lemmatizer to use MockPackage. Replace from_package with load() classmethod 2015-12-29 16:56:02 +01:00
Henning Peters 8359bd4d93 strip data/ from package, friendlier Language invocation, make data_dir backward/forward-compatible 2015-12-18 09:52:55 +01:00
Henning Peters 9027cef3bc access model via sputnik 2015-12-07 06:01:28 +01:00
maxirmx f07e4accd7 Fixing encoding issue #4 2015-10-21 20:45:56 +03:00
maxirmx fcbfff043f Fixing encoding issue #3 2015-10-21 15:52:34 +03:00
maxirmx fe9d2e2c4e Fixing encode issue #2 2015-10-21 15:36:21 +03:00
maxirmx e4a1726f77 Fixing encoding issue
UTF-8
2015-10-21 14:16:37 +03:00
Matthew Honnibal 5332c0b697 * Add support for punctuation lemmatization, to handle unicode characters. This should help in addressing Issue #130 2015-10-09 18:54:40 +11:00