5.5 KiB
title | teaser | tag | source |
---|---|---|---|
Lemmatizer | Assign the base forms of words | class | spacy/lemmatizer.py |
The Lemmatizer
supports simple part-of-speech-sensitive suffix rules and
lookup tables.
Lemmatizer.__init__
Initialize a Lemmatizer
. Typically, this happens under the hood within spaCy
when a Language
subclass and its Vocab
is initialized.
Example
from spacy.lemmatizer import Lemmatizer from spacy.lookups import Lookups lookups = Lookups() lookups.add_table("lemma_rules", {"noun": [["s", ""]]}) lemmatizer = Lemmatizer(lookups)
For examples of the data format, see the
spacy-lookups-data
repo.
Name | Type | Description |
---|---|---|
lookups 2.2 |
Lookups |
The lookups object containing the (optional) tables "lemma_rules" , "lemma_index" , "lemma_exc" and "lemma_lookup" . |
RETURNS | Lemmatizer |
The newly created object. |
As of v2.2, the lemmatizer is initialized with a Lookups
object containing tables for the different components. This makes it easier for
spaCy to share and serialize rules and lookup tables via the Vocab
, and allows
users to modify lemmatizer data at runtime by updating nlp.vocab.lookups
.
- lemmatizer = Lemmatizer(rules=lemma_rules)
+ lemmatizer = Lemmatizer(lookups)
Lemmatizer.__call__
Lemmatize a string.
Example
from spacy.lemmatizer import Lemmatizer from spacy.lookups import Lookups lookups = Loookups() lookups.add_table("lemma_rules", {"noun": [["s", ""]]}) lemmatizer = Lemmatizer(lookups) lemmas = lemmatizer("ducks", "NOUN") assert lemmas == ["duck"]
Name | Type | Description |
---|---|---|
string |
unicode | The string to lemmatize, e.g. the token text. |
univ_pos |
unicode / int | The token's universal part-of-speech tag. |
morphology |
dict / None |
Morphological features following the Universal Dependencies scheme. |
RETURNS | list | The available lemmas for the string. |
Lemmatizer.lookup
Look up a lemma in the lookup table, if available. If no lemma is found, the
original string is returned. Languages can provide a
lookup table via the Lookups
.
Example
lookups = Lookups() lookups.add_table("lemma_lookup", {"going": "go"}) assert lemmatizer.lookup("going") == "go"
Name | Type | Description |
---|---|---|
string |
unicode | The string to look up. |
orth |
int | Optional hash of the string to look up. If not set, the string will be used and hashed. Defaults to None . |
RETURNS | unicode | The lemma if the string was found, otherwise the original string. |
Lemmatizer.is_base_form
Check whether we're dealing with an uninflected paradigm, so we can avoid lemmatization entirely.
Example
pos = "verb" morph = {"VerbForm": "inf"} is_base_form = lemmatizer.is_base_form(pos, morph) assert is_base_form == True
Name | Type | Description |
---|---|---|
univ_pos |
unicode / int | The token's universal part-of-speech tag. |
morphology |
dict | The token's morphological features. |
RETURNS | bool | Whether the token's part-of-speech tag and morphological features describe a base form. |
Attributes
Name | Type | Description |
---|---|---|
lookups 2.2 |
Lookups |
The lookups object containing the rules and data, if available. |