spaCy/lemmatizer.md at f503817623627b0e7a8e7bdacdcda412fc9318c0

5.0 KiB

Raw Blame History

title	teaser	tag	source
Lemmatizer	Assign the base forms of words	class	spacy/lemmatizer.py

The Lemmatizer supports simple part-of-speech-sensitive suffix rules and lookup tables.

Lemmatizer.init

Initialize a Lemmatizer. Typically, this happens under the hood within spaCy when a Language subclass and its Vocab is initialized.

Example

from spacy.lemmatizer import Lemmatizer
from spacy.lookups import Lookups
lookups = Lookups()
lookups.add_table("lemma_rules", {"noun": [["s", ""]]})
lemmatizer = Lemmatizer(lookups)

For examples of the data format, see the spacy-lookups-data repo.

Name	Type	Description
`lookups` 2.2	`Lookups`	The lookups object containing the (optional) tables `"lemma_rules"`, `"lemma_index"`, `"lemma_exc"` and `"lemma_lookup"`.
RETURNS	`Lemmatizer`	The newly created object.

Lemmatizer.call

Lemmatize a string.

Example

from spacy.lemmatizer import Lemmatizer
from spacy.lookups import Lookups
lookups = Lookups()
lookups.add_table("lemma_rules", {"noun": [["s", ""]]})
lemmatizer = Lemmatizer(lookups)
lemmas = lemmatizer("ducks", "NOUN")
assert lemmas == ["duck"]

Name	Type	Description
`string`	str	The string to lemmatize, e.g. the token text.
`univ_pos`	str / int	The token's universal part-of-speech tag.
`morphology`	dict / `None`	Morphological features following the Universal Dependencies scheme.
RETURNS	list	The available lemmas for the string.

Lemmatizer.lookup

Look up a lemma in the lookup table, if available. If no lemma is found, the original string is returned. Languages can provide a lookup table via the Lookups.

Example

lookups = Lookups()
lookups.add_table("lemma_lookup", {"going": "go"})
assert lemmatizer.lookup("going") == "go"

Name	Type	Description
`string`	str	The string to look up.
`orth`	int	Optional hash of the string to look up. If not set, the string will be used and hashed. Defaults to `None`.
RETURNS	str	The lemma if the string was found, otherwise the original string.

Lemmatizer.is_base_form

Check whether we're dealing with an uninflected paradigm, so we can avoid lemmatization entirely.

Example

pos = "verb"
morph = {"VerbForm": "inf"}
is_base_form = lemmatizer.is_base_form(pos, morph)
assert is_base_form == True

Name	Type	Description
`univ_pos`	str / int	The token's universal part-of-speech tag.
`morphology`	dict	The token's morphological features.
RETURNS	bool	Whether the token's part-of-speech tag and morphological features describe a base form.

Attributes

Name	Type	Description
`lookups` 2.2	`Lookups`	The lookups object containing the rules and data, if available.

5.0 KiB Raw Blame History

Lemmatizer.__init__

Example

Lemmatizer.__call__

Example

Lemmatizer.lookup

Example

Lemmatizer.is_base_form

Example

Attributes

5.0 KiB

Raw Blame History

Lemmatizer.init

Lemmatizer.call