Commit Graph

133 Commits

Author SHA1 Message Date
Adriane Boyd e962784531
Add Lemmatizer and simplify related components (#5848)
* Add Lemmatizer and simplify related components

* Add `Lemmatizer` pipe with `lookup` and `rule` modes using the
`Lookups` tables.
* Reduce `Tagger` to a simple tagger that sets `Token.tag` (no pos or lemma)
* Reduce `Morphology` to only keep track of morph tags (no tag map, lemmatizer,
or morph rules)
* Remove lemmatizer from `Vocab`
* Adjust many many tests

Differences:

* No default lookup lemmas
* No special treatment of TAG in `from_array` and similar required
* Easier to modify labels in a `Tagger`
* No extra strings added from morphology / tag map

* Fix test

* Initial fix for Lemmatizer config/serialization

* Adjust init test to be more generic

* Adjust init test to force empty Lookups

* Add simple cache to rule-based lemmatizer

* Convert language-specific lemmatizers

Convert language-specific lemmatizers to component lemmatizers. Remove
previous lemmatizer class.

* Fix French and Polish lemmatizers

* Remove outdated UPOS conversions

* Update Russian lemmatizer init in tests

* Add minimal init/run tests for custom lemmatizers

* Add option to overwrite existing lemmas

* Update mode setting, lookup loading, and caching

* Make `mode` an immutable property
* Only enforce strict `load_lookups` for known supported modes
* Move caching into individual `_lemmatize` methods

* Implement strict when lang is not found in lookups

* Fix tables/lookups in make_lemmatizer

* Reallow provided lookups and allow for stricter checks

* Add lookups asset to all Lemmatizer pipe tests

* Rename lookups in lemmatizer init test

* Clean up merge

* Refactor lookup table loading

* Add helper from `load_lemmatizer_lookups` that loads required and
optional lookups tables based on settings provided by a config.

Additional slight refactor of lookups:

* Add `Lookups.set_table` to set a table from a provided `Table`
* Reorder class definitions to be able to specify type as `Table`

* Move registry assets into test methods

* Refactor lookups tables config

Use class methods within `Lemmatizer` to provide the config for
particular modes and to load the lookups from a config.

* Add pipe and score to lemmatizer

* Simplify Tagger.score

* Add missing import

* Clean up imports and auto-format

* Remove unused kwarg

* Tidy up and auto-format

* Update docstrings for Lemmatizer

Update docstrings for Lemmatizer.

Additionally modify `is_base_form` API to take `Token` instead of
individual features.

* Update docstrings

* Remove tag map values from Tagger.add_label

* Update API docs

* Fix relative link in Lemmatizer API docs
2020-08-07 15:27:13 +02:00
Adriane Boyd fdb8815ef5
Minor refactor for Morphology and MorphAnalysis (#5804)
* `MorphAnalysis.get` returns only the field values
* Move `_normalize_props` inside `Morphology` as
`Morphology.normalize_attrs` and simplify
  * Simplify POS field detection/conversion
  * Convert all non-POS features to strings
* `Morphology` returns an empty string for a missing morph to align
with the FEATS string returned for an existing morph
* Remove unused `list_to_feats`
2020-07-24 09:28:06 +02:00
Adriane Boyd b84fd70cc3
Fix exceptions for Morphology.__reduce__ (#5792)
Pickle exceptions in the MORPH_RULES format instead of the internal
format after the recent `Morphology.__init__` changes.
2020-07-22 15:00:25 +02:00
Ines Montani 796f6c52d1 Merge branch 'develop' into pr/5767 2020-07-19 13:37:46 +02:00
Adriane Boyd 9ee1c54f40
Improve tag map initialization and updating (#5764)
* Improve tag map initialization and updating

Generalize tag map initialization and updating so that the tag map can
be loaded correctly prior to loading a `Corpus` with `spacy debug-data`
and `spacy train`.

* normalize provided tag map as necessary
* use the same method for initializing and updating the tag map

* Replace rather than update tag map

Replace rather than update tag map when loading a custom tag map.
Updating the tag map is problematic due to the sorted list of tag names
and the fact that the tag map will contain lingering/unwanted tags from
the default tag map.

* Update CLI scripts

* Reinitialize cache after loading new tag map

Reinitialize the cache with the right size after loading a new tag map.
2020-07-19 13:13:57 +02:00
Adriane Boyd b81a89f0a9
Update morphologizer (#5766)
* update `Morphologizer.begin_training` for use with `Example`

* make init and begin_training more consistent

* add `Morphology.normalize_features` to normalize outside of
`Morphology.add`

* make sure `get_loss` doesn't create unknown labels when the POS and
morph alignments differ
2020-07-19 11:10:51 +02:00
Adriane Boyd d106cf66dd Update Morphology to load exceptions as MORPH_RULES
Update `Morphology` to load exceptions in `Morphology.__init__` and
`Morphology.load_morph_exceptions` from the format used in `MORPH_RULES`
rather than the internal format with tuple keys.

* Rename to `Morphology.exc` to `Morphology._exc` for internal use with
tuple keys
* Add `Morphology.exc` as a property that converts the internal `_exc`
back to `MORPH_RULES` format, primarily for serialization
2020-07-16 21:16:49 +02:00
Adriane Boyd c9f0f75778
Update get_loss for senter and morphologizer (#5724)
* Update get_loss for senter

Update `SentenceRecognizer.get_loss` to keep it similar to `Tagger`.

* Update get_loss for morphologizer

Update `Morphologizer.get_loss` to keep it similar to `Tagger`.
2020-07-08 13:59:28 +02:00
Sofie Van Landeghem 8d3c0306e1
refactor fixes (#5664)
* fixes in ud_train, UX for morphs

* update pyproject with new version of thinc

* fixes in debug_data script

* cleanup of old unused error messages

* remove obsolete TempErrors

* move error messages to errors.py

* add ENT_KB_ID to default DocBin serialization

* few fixes to simple_ner

* fix tags
2020-06-29 14:33:00 +02:00
Ines Montani 52728d8fa3 Merge branch 'develop' into master-tmp 2020-06-20 15:52:00 +02:00
Ines Montani a8875d4a4b Fix typo 2020-06-03 14:42:39 +02:00
Ines Montani 4e0610d0d4 Update warning codes 2020-06-03 14:37:09 +02:00
Ines Montani 810fce3bb1 Merge branch 'develop' into master-tmp 2020-06-03 14:36:59 +02:00
Adriane Boyd b6b5908f5e Prefer _SP over SP for default tag map space attrs
If `_SP` is already in the tag map, use the mapping from `_SP` instead
of `SP` so that `SP` can be a valid non-space tag. (Chinese has a
non-space tag `SP` which was overriding the mapping of `_SP` to
`SPACE`.)
2020-05-26 14:57:13 +02:00
Ines Montani 5d3806e059 unicode -> str consistency 2020-05-24 17:20:58 +02:00
Ines Montani f44897e4c6 Update warning IDs 2020-05-21 18:39:11 +02:00
Ines Montani 70ee4ef4fd Fix small errors 2020-03-26 13:47:31 +01:00
Ines Montani b0cfab317f Merge branch 'develop' into refactor/simplify-warnings 2020-03-04 16:38:55 +01:00
Ines Montani 648f61d077
Tidy up compiler flags and imports (#5071) 2020-03-02 11:48:10 +01:00
Ines Montani 37691e6d5d Simplify warnings 2020-02-28 12:20:23 +01:00
adrianeboyd adc9745718 Modify morphology to support arbitrary features (#4932)
* Restructure tag maps for MorphAnalysis changes

Prepare tag maps for upcoming MorphAnalysis changes that allow
arbritrary features.

* Use default tag map rather than duplicating for ca / uk / vi

* Import tag map into defaults for ga

* Modify tag maps so all morphological fields and features are strings
  * Move features from `"Other"` to the top level
  * Rewrite tuples as strings separated by `","`

* Rewrite morph symbols for fr lemmatizer as strings

* Export MorphAnalysis under spacy.tokens

* Modify morphology to support arbitrary features

Modify `Morphology` and `MorphAnalysis` so that arbitrary features are
supported.

* Modify `MorphAnalysisC` so that it can support arbitrary features and
multiple values per field. `MorphAnalysisC` is redesigned to contain:
  * key: hash of UD FEATS string of morphological features
  * array of `MorphFeatureC` structs that each contain a hash of `Field`
and `Field=Value` for a given morphological feature, which makes it
possible to:
    * find features by field
    * represent multiple values for a given field

* `get_field()` is renamed to `get_by_field()` and is no longer `nogil`.
Instead a new helper function `get_n_by_field()` is `nogil` and returns
`n` features by field.

* `MorphAnalysis.get()` returns all possible values for a field as a
list of individual features such as `["Tense=Pres", "Tense=Past"]`.

* `MorphAnalysis`'s `str()` and `repr()` are the UD FEATS string.

* `Morphology.feats_to_dict()` converts a UD FEATS string to a dict
where:
  * Each field has one entry in the dict
  * Multiple values remain separated by a separator in the value string

* `Token.morph_` returns the UD FEATS string and you can set
`Token.morph_` with a UD FEATS string or with a tag map dict.

* Modify get_by_field to use np.ndarray

Modify `get_by_field()` to use np.ndarray. Remove `max_results` from
`get_n_by_field()` and always iterate over all the fields.

* Rewrite without MorphFeatureC

* Add shortcut for existing feats strings as keys

Add shortcut for existing feats strings as keys in `Morphology.add()`.

* Check for '_' as empty analysis when adding morphs

* Extend helper converters in Morphology

Add and extend helper converters that convert and normalize between:

* UD FEATS strings (`"Case=dat,gen|Number=sing"`)
* per-field dict of feats (`{"Case": "dat,gen", "Number": "sing"}`)
* list of individual features (`["Case=dat", "Case=gen",
"Number=sing"]`)

All converters sort fields and values where applicable.
2020-01-23 22:01:54 +01:00
Ines Montani a892821c51 More formatting changes 2019-12-25 17:59:52 +01:00
Ines Montani db55577c45
Drop Python 2.7 and 3.5 (#4828)
* Remove unicode declarations

* Remove Python 3.5 and 2.7 from CI

* Don't require pathlib

* Replace compat helpers

* Remove OrderedDict

* Use f-strings

* Set Cython compiler language level

* Fix typo

* Re-add OrderedDict for Table

* Update setup.cfg

* Revert CONTRIBUTING.md

* Revert lookups.md

* Revert top-level.md

* Small adjustments and docs [ci skip]
2019-12-22 01:53:56 +01:00
Ines Montani 16aa092fb5 Improve Morphology errors (#4314)
* Improve Morphology errors

* Also clean up some other errors

* Update errors.py
2019-09-21 14:37:06 +02:00
Ines Montani bab9976d9a
💫 Adjust Table API and add docs (#4289)
* Adjust Table API and add docs

* Add attributes and update description [ci skip]

* Use strings.get_string_id instead of hash_string

* Fix table method calls

* Make orth arg in Lemmatizer.lookup optional

Fall back to string, which is now handled by Table.__contains__ out-of-the-box

* Fix method name

* Auto-format
2019-09-15 22:08:13 +02:00
Paul O'Leary McCann 7d8df69158 Bloom-filter backed Lookup Tables (#4268)
* Improve load_language_data helper

* WIP: Add Lookups implementation

* Start moving lemma data over to JSON

* WIP: move data over for more languages

* Convert more languages

* Fix lemmatizer fixtures in tests

* Finish conversion

* Auto-format JSON files

* Fix test for now

* Make sure tables are stored on instance

* Update docstrings

* Update docstrings and errors

* Update test

* Add Lookups.__len__

* Add serialization methods

* Add Lookups.remove_table

* Use msgpack for serialization to disk

* Fix file exists check

* Try using OrderedDict for everything

* Update .flake8 [ci skip]

* Try fixing serialization

* Update test_lookups.py

* Update test_serialize_vocab_strings.py

* Lookups / Tables now work

This implements the stubs in the Lookups/Table classes. Currently this
is in Cython but with no type declarations, so that could be improved.

* Add lookups to setup.py

* Actually add lookups pyx

The previous commit added the old py file...

* Lookups work-in-progress

* Move from pyx back to py

* Add string based lookups, fix serialization

* Update tests, language/lemmatizer to work with string lookups

There are some outstanding issues here:

- a pickling-related test fails due to the bloom filter
- some custom lemmatizers (fr/nl at least) have issues

More generally, there's a question of how to deal with the case where
you have a string but want to use the lookup table. Currently the table
allows access by string or id, but that's getting pretty awkward.

* Change lemmatizer lookup method to pass (orth, string)

* Fix token lookup

* Fix French lookup

* Fix lt lemmatizer test

* Fix Dutch lemmatizer

* Fix lemmatizer lookup test

This was using a normal dict instead of a Table, so checks for the
string instead of an integer key failed.

* Make uk/nl/ru lemmatizer lookup methods consistent

The mentioned tokenizers all have their own implementation of the
`lookup` method, which accesses a `Lookups` table. The way that was
called in `token.pyx` was changed so this should be updated to have the
same arguments as `lookup` in `lemmatizer.py` (specificially (orth/id,
string)).

Prior to this change tests weren't failing, but there would probably be
issues with normal use of a model. More tests should proably be added.

Additionally, the language-specific `lookup` implementations seem like
they might not be needed, since they handle things like lower-casing
that aren't actually language specific.

* Make recently added Greek method compatible

* Remove redundant class/method

Leftovers from a merge not cleaned up adequately.
2019-09-12 17:26:11 +02:00
Matthew Honnibal f7a096b462 Update morphology 2019-09-11 18:06:43 +02:00
Matthew Honnibal c47c0269b1 Update morphology features 2019-09-11 15:16:53 +02:00
Matthew Honnibal 67c3d03905 Revert morphology serialisation 2019-08-30 13:13:07 +02:00
Adriane Boyd 893f11a9e3 Serialize tag_map directly
Fix Aspect_prof typo
2019-08-30 11:30:03 +02:00
Matthew Honnibal fc0a3c8c38 Add morphology serialization 2019-08-29 21:17:34 +02:00
Matthew Honnibal 188a1cf297 Fix morphology for | features 2019-08-25 21:57:02 +02:00
Ines Montani 278e9d2eb0 Merge branch 'master' into feature/lemmatizer 2019-03-16 13:44:22 +01:00
Matthew Honnibal 80b94313b6 💫 Fix interaction of lemmatizer and tokenizer exceptions (#3388)
Closes #2203. Closes #3268.

Lemmas set from outside the `Morphology` class were being overwritten. The result was especially confusing when deserialising, as it meant some lemmas could change when storing and retrieving a `Doc` object.

This PR applies two fixes:

1) When we go to set the lemma in the `Morphology` class, first check whether a lemma is already set. If so, don't overwrite.
2) When we load with `doc.from_array()`, take care to apply the `TAG` field first. This allows other fields to overwrite the `TAG` implied properties, if they're provided explicitly (e.g. the `LEMMA`).

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-03-11 01:31:21 +01:00
Matthew Honnibal 5431c47b91 Refactor morphology slightly 2019-03-10 00:59:51 +00:00
Matthew Honnibal 0f12082465 Refactor morphologizer 2019-03-09 22:54:59 +00:00
Matthew Honnibal 41a3016019 Refactor morphologizer class map 2019-03-09 20:55:33 +01:00
Matthew Honnibal eae384ebb2 Add POS to morphological fields 2019-03-09 11:49:44 +00:00
Matthew Honnibal 42bc3ad73b Fix class mapping for morphologizer 2019-03-09 00:20:29 +00:00
Matthew Honnibal 09b26f5e2e Fix compile error 2019-03-08 18:58:26 +01:00
Matthew Honnibal d7ec1d62cb Fix Morphologizer 2019-03-08 18:54:25 +01:00
Matthew Honnibal 322b64dca0 Allow lookup of morphology by attribute name 2019-03-08 01:38:15 +01:00
Matthew Honnibal b5f2b7b454 Add list_features() helper, clean up 2019-03-08 00:08:35 +01:00
Matthew Honnibal 987ee6e884 Fix data reading in morphology 2019-03-07 21:58:43 +01:00
Matthew Honnibal 2669190b85 Normalize props for morph exceptions 2019-03-07 18:32:36 +01:00
Matthew Honnibal fed0371db7 Remove enums from morphology 2019-03-07 17:14:57 +01:00
Matthew Honnibal b9ade7d4e0 Add MorphAnalysisC struct 2019-03-07 14:03:07 +01:00
Matthew Honnibal b69013e2d7 Fix passing of morphological features to lemmatizer 2019-03-07 13:11:38 +01:00
Matthew Honnibal 6734cfec88 Add comment 2019-03-07 12:14:37 +01:00
Matthew Honnibal ae7c728c5f Fix json dependency 2019-03-07 01:17:19 +01:00