spaCy/website/docs/api/lemmatizer.md

4.6 KiB

title teaser tag source
Lemmatizer Assign the base forms of words class spacy/lemmatizer.py

The Lemmatizer supports simple part-of-speech-sensitive suffix rules and lookup tables.

Lemmatizer.__init__

Create a Lemmatizer.

Example

from spacy.lemmatizer import Lemmatizer
lemmatizer = Lemmatizer()
Name Type Description
index dict / None Inventory of lemmas in the language.
exceptions dict / None Mapping of string forms to lemmas that bypass the rules.
rules dict / None List of suffix rewrite rules.
lookup dict / None Lookup table mapping string to their lemmas.
RETURNS Lemmatizer The newly created object.

Lemmatizer.__call__

Lemmatize a string.

Example

from spacy.lemmatizer import Lemmatizer
rules = {"noun": [["s", ""]]}
lemmatizer = Lemmatizer(index={}, exceptions={}, rules=rules)
lemmas = lemmatizer("ducks", "NOUN")
assert lemmas == ["duck"]
Name Type Description
string unicode The string to lemmatize, e.g. the token text.
univ_pos unicode / int The token's universal part-of-speech tag.
morphology dict / None Morphological features following the Universal Dependencies scheme.
RETURNS list The available lemmas for the string.

Lemmatizer.lookup

Look up a lemma in the lookup table, if available. If no lemma is found, the original string is returned. Languages can provide a lookup table via the lemma_lookup variable, set on the individual Language class.

Example

lookup = {"going": "go"}
lemmatizer = Lemmatizer(lookup=lookup)
assert lemmatizer.lookup("going") == "go"
Name Type Description
string unicode The string to look up.
RETURNS unicode The lemma if the string was found, otherwise the original string.

Lemmatizer.is_base_form

Check whether we're dealing with an uninflected paradigm, so we can avoid lemmatization entirely.

Example

pos = "verb"
morph = {"VerbForm": "inf"}
is_base_form = lemmatizer.is_base_form(pos, morph)
assert is_base_form == True
Name Type Description
univ_pos unicode / int The token's universal part-of-speech tag.
morphology dict The token's morphological features.
RETURNS bool Whether the token's part-of-speech tag and morphological features describe a base form.

Attributes

Name Type Description
index dict / None Inventory of lemmas in the language.
exc dict / None Mapping of string forms to lemmas that bypass the rules.
rules dict / None List of suffix rewrite rules.
lookup_table 2 dict / None The lemma lookup table, if available.