Commit Graph

109 Commits

Author SHA1 Message Date
Ines Montani bab9976d9a
💫 Adjust Table API and add docs (#4289)
* Adjust Table API and add docs

* Add attributes and update description [ci skip]

* Use strings.get_string_id instead of hash_string

* Fix table method calls

* Make orth arg in Lemmatizer.lookup optional

Fall back to string, which is now handled by Table.__contains__ out-of-the-box

* Fix method name

* Auto-format
2019-09-15 22:08:13 +02:00
Paul O'Leary McCann 7d8df69158 Bloom-filter backed Lookup Tables (#4268)
* Improve load_language_data helper

* WIP: Add Lookups implementation

* Start moving lemma data over to JSON

* WIP: move data over for more languages

* Convert more languages

* Fix lemmatizer fixtures in tests

* Finish conversion

* Auto-format JSON files

* Fix test for now

* Make sure tables are stored on instance

* Update docstrings

* Update docstrings and errors

* Update test

* Add Lookups.__len__

* Add serialization methods

* Add Lookups.remove_table

* Use msgpack for serialization to disk

* Fix file exists check

* Try using OrderedDict for everything

* Update .flake8 [ci skip]

* Try fixing serialization

* Update test_lookups.py

* Update test_serialize_vocab_strings.py

* Lookups / Tables now work

This implements the stubs in the Lookups/Table classes. Currently this
is in Cython but with no type declarations, so that could be improved.

* Add lookups to setup.py

* Actually add lookups pyx

The previous commit added the old py file...

* Lookups work-in-progress

* Move from pyx back to py

* Add string based lookups, fix serialization

* Update tests, language/lemmatizer to work with string lookups

There are some outstanding issues here:

- a pickling-related test fails due to the bloom filter
- some custom lemmatizers (fr/nl at least) have issues

More generally, there's a question of how to deal with the case where
you have a string but want to use the lookup table. Currently the table
allows access by string or id, but that's getting pretty awkward.

* Change lemmatizer lookup method to pass (orth, string)

* Fix token lookup

* Fix French lookup

* Fix lt lemmatizer test

* Fix Dutch lemmatizer

* Fix lemmatizer lookup test

This was using a normal dict instead of a Table, so checks for the
string instead of an integer key failed.

* Make uk/nl/ru lemmatizer lookup methods consistent

The mentioned tokenizers all have their own implementation of the
`lookup` method, which accesses a `Lookups` table. The way that was
called in `token.pyx` was changed so this should be updated to have the
same arguments as `lookup` in `lemmatizer.py` (specificially (orth/id,
string)).

Prior to this change tests weren't failing, but there would probably be
issues with normal use of a model. More tests should proably be added.

Additionally, the language-specific `lookup` implementations seem like
they might not be needed, since they handle things like lower-casing
that aren't actually language specific.

* Make recently added Greek method compatible

* Remove redundant class/method

Leftovers from a merge not cleaned up adequately.
2019-09-12 17:26:11 +02:00
Matthew Honnibal f7a096b462 Update morphology 2019-09-11 18:06:43 +02:00
Matthew Honnibal c47c0269b1 Update morphology features 2019-09-11 15:16:53 +02:00
Matthew Honnibal 67c3d03905 Revert morphology serialisation 2019-08-30 13:13:07 +02:00
Adriane Boyd 893f11a9e3 Serialize tag_map directly
Fix Aspect_prof typo
2019-08-30 11:30:03 +02:00
Matthew Honnibal fc0a3c8c38 Add morphology serialization 2019-08-29 21:17:34 +02:00
Matthew Honnibal 188a1cf297 Fix morphology for | features 2019-08-25 21:57:02 +02:00
Ines Montani 278e9d2eb0 Merge branch 'master' into feature/lemmatizer 2019-03-16 13:44:22 +01:00
Matthew Honnibal 80b94313b6 💫 Fix interaction of lemmatizer and tokenizer exceptions (#3388)
Closes #2203. Closes #3268.

Lemmas set from outside the `Morphology` class were being overwritten. The result was especially confusing when deserialising, as it meant some lemmas could change when storing and retrieving a `Doc` object.

This PR applies two fixes:

1) When we go to set the lemma in the `Morphology` class, first check whether a lemma is already set. If so, don't overwrite.
2) When we load with `doc.from_array()`, take care to apply the `TAG` field first. This allows other fields to overwrite the `TAG` implied properties, if they're provided explicitly (e.g. the `LEMMA`).

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-03-11 01:31:21 +01:00
Matthew Honnibal 5431c47b91 Refactor morphology slightly 2019-03-10 00:59:51 +00:00
Matthew Honnibal 0f12082465 Refactor morphologizer 2019-03-09 22:54:59 +00:00
Matthew Honnibal 41a3016019 Refactor morphologizer class map 2019-03-09 20:55:33 +01:00
Matthew Honnibal eae384ebb2 Add POS to morphological fields 2019-03-09 11:49:44 +00:00
Matthew Honnibal 42bc3ad73b Fix class mapping for morphologizer 2019-03-09 00:20:29 +00:00
Matthew Honnibal 09b26f5e2e Fix compile error 2019-03-08 18:58:26 +01:00
Matthew Honnibal d7ec1d62cb Fix Morphologizer 2019-03-08 18:54:25 +01:00
Matthew Honnibal 322b64dca0 Allow lookup of morphology by attribute name 2019-03-08 01:38:15 +01:00
Matthew Honnibal b5f2b7b454 Add list_features() helper, clean up 2019-03-08 00:08:35 +01:00
Matthew Honnibal 987ee6e884 Fix data reading in morphology 2019-03-07 21:58:43 +01:00
Matthew Honnibal 2669190b85 Normalize props for morph exceptions 2019-03-07 18:32:36 +01:00
Matthew Honnibal fed0371db7 Remove enums from morphology 2019-03-07 17:14:57 +01:00
Matthew Honnibal b9ade7d4e0 Add MorphAnalysisC struct 2019-03-07 14:03:07 +01:00
Matthew Honnibal b69013e2d7 Fix passing of morphological features to lemmatizer 2019-03-07 13:11:38 +01:00
Matthew Honnibal 6734cfec88 Add comment 2019-03-07 12:14:37 +01:00
Matthew Honnibal ae7c728c5f Fix json dependency 2019-03-07 01:17:19 +01:00
Matthew Honnibal 2b8a53ebdc Fix morphology functions 2018-09-26 21:03:57 +02:00
Matthew Honnibal 2be15fa7d2 Fix Python feature enum in morphology 2018-09-25 23:03:43 +02:00
Matthew Honnibal a4fc397880 Add helper to parse features into field and column IDs 2018-09-25 22:13:10 +02:00
Matthew Honnibal 51a297f934 Fix morphology add and update 2018-09-25 21:07:08 +02:00
Matthew Honnibal 34cab8cc49 Update morphology API 2018-09-25 20:53:24 +02:00
Matthew Honnibal 4b7e772f5d Implement the is_animacy_feature etc functions 2018-09-25 17:28:34 +02:00
Matthew Honnibal 8308c1525e Fix exception loading 2018-09-25 15:18:21 +02:00
Matthew Honnibal be8cf39e16 Fix morphology 2018-09-25 10:57:33 +02:00
Matthew Honnibal a3d2e616d5 Restore previous morphology stuff 2018-09-25 00:35:59 +02:00
Matthew Honnibal 6ae645c4ef WIP on supporting morphology features 2018-09-24 23:57:41 +02:00
Matthew Honnibal 7b09a4ca49 Fix lemmatization 2018-07-05 13:56:02 +02:00
Matthew Honnibal 2c4a6d66fa Merge master into develop. Big merge, many conflicts -- need to review 2018-04-29 14:49:26 +02:00
Ines Montani 3141e04822
💫 New system for error messages and warnings (#2163)
* Add spacy.errors module

* Update deprecation and user warnings

* Replace errors and asserts with new error message system

* Remove redundant asserts

* Fix whitespace

* Add messages for print/util.prints statements

* Fix typo

* Fix typos

* Move CLI messages to spacy.cli._messages

* Add decorator to display error code with message

An implementation like this is nice because it only modifies the string when it's retrieved from the containing class – so we don't have to worry about manipulating tracebacks etc.

* Remove unused link in spacy.about

* Update errors for invalid pipeline components

* Improve error for unknown factories

* Add displaCy warnings

* Update formatting consistency

* Move error message to spacy.errors

* Update errors and check if doc returned by component is None
2018-04-03 15:50:31 +02:00
Matthew Honnibal 1f7229f40f Revert "Merge branch 'develop' of https://github.com/explosion/spaCy into develop"
This reverts commit c9ba3d3c2d, reversing
changes made to 92c26a35d4.
2018-03-27 19:23:02 +02:00
Matthew Honnibal 31babe3c3f Fix non-clobbering lemmatization 2017-11-06 12:36:05 +01:00
Matthew Honnibal 134d3b8143 Fix morphology 2017-11-05 22:18:22 +01:00
Matthew Honnibal bb25cb0f76 Avoid clobbering preset lemmas 2017-11-05 19:39:38 +01:00
Matthew Honnibal bd2cbdfa85 Make Morphology not fail on unknown tags 2017-11-03 13:29:09 +01:00
ines d96e72f656 Tidy up rest 2017-10-27 21:07:59 +02:00
Matthew Honnibal 66766c1454 Restore SP tag to English tag_map, until models migrate 2017-10-24 17:05:00 +02:00
ines 8492d5be6d Always make lemmatizer return a list of lemmas, not a set 2017-10-24 16:00:56 +02:00
Matthew Honnibal 49895fbef6 Rename 'SP' special tag to '_SP'
Renaming the tag with an underscore lets us add it to the tag map
without worrying that we'll change the sequence of tags, which throws
off the tag-to-ID mapping. For instance, if we inserted a 'SP' tag,
the "VERB" tag is pushed to a different class ID, and the model is all
messed up.
2017-10-20 14:01:12 +02:00
Matthew Honnibal 506cf2eb13 Remove cpdef enum, to avoid too much code generation 2017-10-20 14:00:23 +02:00
ines 6dd14dc342 Add lookup lemmas to tokens without POS tags 2017-10-11 13:27:10 +02:00