Commit Graph

40 Commits

Author SHA1 Message Date
Adriane Boyd e962784531
Add Lemmatizer and simplify related components (#5848)
* Add Lemmatizer and simplify related components

* Add `Lemmatizer` pipe with `lookup` and `rule` modes using the
`Lookups` tables.
* Reduce `Tagger` to a simple tagger that sets `Token.tag` (no pos or lemma)
* Reduce `Morphology` to only keep track of morph tags (no tag map, lemmatizer,
or morph rules)
* Remove lemmatizer from `Vocab`
* Adjust many many tests

Differences:

* No default lookup lemmas
* No special treatment of TAG in `from_array` and similar required
* Easier to modify labels in a `Tagger`
* No extra strings added from morphology / tag map

* Fix test

* Initial fix for Lemmatizer config/serialization

* Adjust init test to be more generic

* Adjust init test to force empty Lookups

* Add simple cache to rule-based lemmatizer

* Convert language-specific lemmatizers

Convert language-specific lemmatizers to component lemmatizers. Remove
previous lemmatizer class.

* Fix French and Polish lemmatizers

* Remove outdated UPOS conversions

* Update Russian lemmatizer init in tests

* Add minimal init/run tests for custom lemmatizers

* Add option to overwrite existing lemmas

* Update mode setting, lookup loading, and caching

* Make `mode` an immutable property
* Only enforce strict `load_lookups` for known supported modes
* Move caching into individual `_lemmatize` methods

* Implement strict when lang is not found in lookups

* Fix tables/lookups in make_lemmatizer

* Reallow provided lookups and allow for stricter checks

* Add lookups asset to all Lemmatizer pipe tests

* Rename lookups in lemmatizer init test

* Clean up merge

* Refactor lookup table loading

* Add helper from `load_lemmatizer_lookups` that loads required and
optional lookups tables based on settings provided by a config.

Additional slight refactor of lookups:

* Add `Lookups.set_table` to set a table from a provided `Table`
* Reorder class definitions to be able to specify type as `Table`

* Move registry assets into test methods

* Refactor lookups tables config

Use class methods within `Lemmatizer` to provide the config for
particular modes and to load the lookups from a config.

* Add pipe and score to lemmatizer

* Simplify Tagger.score

* Add missing import

* Clean up imports and auto-format

* Remove unused kwarg

* Tidy up and auto-format

* Update docstrings for Lemmatizer

Update docstrings for Lemmatizer.

Additionally modify `is_base_form` API to take `Token` instead of
individual features.

* Update docstrings

* Remove tag map values from Tagger.add_label

* Update API docs

* Fix relative link in Lemmatizer API docs
2020-08-07 15:27:13 +02:00
Adriane Boyd d106cf66dd Update Morphology to load exceptions as MORPH_RULES
Update `Morphology` to load exceptions in `Morphology.__init__` and
`Morphology.load_morph_exceptions` from the format used in `MORPH_RULES`
rather than the internal format with tuple keys.

* Rename to `Morphology.exc` to `Morphology._exc` for internal use with
tuple keys
* Add `Morphology.exc` as a property that converts the internal `_exc`
back to `MORPH_RULES` format, primarily for serialization
2020-07-16 21:16:49 +02:00
Ines Montani 648f61d077
Tidy up compiler flags and imports (#5071) 2020-03-02 11:48:10 +01:00
adrianeboyd adc9745718 Modify morphology to support arbitrary features (#4932)
* Restructure tag maps for MorphAnalysis changes

Prepare tag maps for upcoming MorphAnalysis changes that allow
arbritrary features.

* Use default tag map rather than duplicating for ca / uk / vi

* Import tag map into defaults for ga

* Modify tag maps so all morphological fields and features are strings
  * Move features from `"Other"` to the top level
  * Rewrite tuples as strings separated by `","`

* Rewrite morph symbols for fr lemmatizer as strings

* Export MorphAnalysis under spacy.tokens

* Modify morphology to support arbitrary features

Modify `Morphology` and `MorphAnalysis` so that arbitrary features are
supported.

* Modify `MorphAnalysisC` so that it can support arbitrary features and
multiple values per field. `MorphAnalysisC` is redesigned to contain:
  * key: hash of UD FEATS string of morphological features
  * array of `MorphFeatureC` structs that each contain a hash of `Field`
and `Field=Value` for a given morphological feature, which makes it
possible to:
    * find features by field
    * represent multiple values for a given field

* `get_field()` is renamed to `get_by_field()` and is no longer `nogil`.
Instead a new helper function `get_n_by_field()` is `nogil` and returns
`n` features by field.

* `MorphAnalysis.get()` returns all possible values for a field as a
list of individual features such as `["Tense=Pres", "Tense=Past"]`.

* `MorphAnalysis`'s `str()` and `repr()` are the UD FEATS string.

* `Morphology.feats_to_dict()` converts a UD FEATS string to a dict
where:
  * Each field has one entry in the dict
  * Multiple values remain separated by a separator in the value string

* `Token.morph_` returns the UD FEATS string and you can set
`Token.morph_` with a UD FEATS string or with a tag map dict.

* Modify get_by_field to use np.ndarray

Modify `get_by_field()` to use np.ndarray. Remove `max_results` from
`get_n_by_field()` and always iterate over all the fields.

* Rewrite without MorphFeatureC

* Add shortcut for existing feats strings as keys

Add shortcut for existing feats strings as keys in `Morphology.add()`.

* Check for '_' as empty analysis when adding morphs

* Extend helper converters in Morphology

Add and extend helper converters that convert and normalize between:

* UD FEATS strings (`"Case=dat,gen|Number=sing"`)
* per-field dict of feats (`{"Case": "dat,gen", "Number": "sing"}`)
* list of individual features (`["Case=dat", "Case=gen",
"Number=sing"]`)

All converters sort fields and values where applicable.
2020-01-23 22:01:54 +01:00
Matthew Honnibal 41a3016019 Refactor morphologizer class map 2019-03-09 20:55:33 +01:00
Matthew Honnibal 322b64dca0 Allow lookup of morphology by attribute name 2019-03-08 01:38:15 +01:00
Matthew Honnibal b5f2b7b454 Add list_features() helper, clean up 2019-03-08 00:08:35 +01:00
Matthew Honnibal c1888b05d2 Export helper functions for morphology 2019-03-07 18:33:06 +01:00
Matthew Honnibal fed0371db7 Remove enums from morphology 2019-03-07 17:14:57 +01:00
Matthew Honnibal b9ade7d4e0 Add MorphAnalysisC struct 2019-03-07 14:03:07 +01:00
Matthew Honnibal 022dcda964 Fix morphology enum 2018-09-26 21:03:44 +02:00
Matthew Honnibal 51a297f934 Fix morphology add and update 2018-09-25 21:07:08 +02:00
Matthew Honnibal 34cab8cc49 Update morphology API 2018-09-25 20:53:24 +02:00
Matthew Honnibal 6fe7c72560 Reorder morphology enum, and add begin and end markers 2018-09-25 17:28:13 +02:00
Matthew Honnibal 8308c1525e Fix exception loading 2018-09-25 15:18:21 +02:00
Matthew Honnibal be8cf39e16 Fix morphology 2018-09-25 10:57:33 +02:00
Matthew Honnibal a3d2e616d5 Restore previous morphology stuff 2018-09-25 00:35:59 +02:00
Matthew Honnibal 6ae645c4ef WIP on supporting morphology features 2018-09-24 23:57:41 +02:00
Matthew Honnibal 1f7229f40f Revert "Merge branch 'develop' of https://github.com/explosion/spaCy into develop"
This reverts commit c9ba3d3c2d, reversing
changes made to 92c26a35d4.
2018-03-27 19:23:02 +02:00
Matthew Honnibal 6218af0105 Remove cpdef enum, to avoid too much code generation 2017-10-20 13:59:57 +02:00
Matthew Honnibal d528b6e36d Add assign_untagged method in Morphology 2017-10-11 03:22:49 +02:00
Matthew Honnibal b78cc318c3 Fix loading of morphology exceptions 2017-06-04 16:34:32 -05:00
Roman Inflianskas 66e1109b53 Add support for Universal Dependencies v2.0 2017-03-03 13:17:34 +01:00
Matthew Honnibal 33996e770b Update header for morphology class 2016-12-18 16:50:42 +01:00
Matthew Honnibal 1fb09c3dc1 Fix morphology tagger 2016-11-04 19:19:09 +01:00
Matthew Honnibal 293c79c09a Fix #595: Lemmatization was incorrect for base forms, because morphological analyser wasn't adding morphology properly. 2016-11-04 00:29:07 +01:00
Matthew Honnibal 5ca57bd859 * Ensure Morphology can be pickled, to address Issue #125. 2015-10-13 13:44:41 +11:00
Matthew Honnibal 278e12f7e8 * Addmorphology symbols to morphology. May need to remove these as an enum. 2015-10-13 13:44:40 +11:00
Matthew Honnibal 534e3dda3c * More work on language independent parsing 2015-08-28 03:44:54 +02:00
Matthew Honnibal c2307fa9ee * More work on language-generic parsing 2015-08-28 02:02:33 +02:00
Matthew Honnibal 0af139e183 * Tagger training now working. Still need to test load/save of model. Morphology still broken. 2015-08-27 09:16:11 +02:00
Matthew Honnibal 008b02b035 * Comment out enums in Morpohlogy for now 2015-08-26 19:17:35 +02:00
Matthew Honnibal 5d5922dbfa * Begin laying out morphological features 2015-08-24 01:04:30 +02:00
Matthew Honnibal ce2edd6312 * Tmp commit. Refactoring to create a Python Lexeme class. 2015-01-12 10:26:22 +11:00
Matthew Honnibal b00bc01d8c * All tests now passing for reorg 2014-12-23 13:18:59 +11:00
Matthew Honnibal 867a4a000c * Export set_morph_from_dict function 2014-12-20 07:28:27 +11:00
Matthew Honnibal 4e30195c6d * Refactor morphology.pyx 2014-12-20 07:27:28 +11:00
Matthew Honnibal 9959a64f7b * Working morphology and lemmatisation. POS tagging quite fast. 2014-12-10 08:09:32 +11:00
Matthew Honnibal 42973c4b37 * Improve efficiency of tagger, and improve morphological processing 2014-12-10 01:02:04 +11:00
Matthew Honnibal 6b34a2f34b * Move morphological analysis into its own module, morphology.pyx 2014-12-09 21:16:17 +11:00