spaCy/lemmatizer.md at c922f8e8b058e29536b0c26682be92859dc9e00d

4.9 KiB

Raw Blame History

title	teaser	tag	source
Lemmatizer	Assign the base forms of words	class	spacy/lemmatizer.py

The Lemmatizer supports simple part-of-speech-sensitive suffix rules and lookup tables.

Lemmatizer.init

Create a Lemmatizer.

Example

from spacy.lemmatizer import Lemmatizer
lemmatizer = Lemmatizer()

Name	Type	Description
`index`	dict / `None`	Inventory of lemmas in the language.
`exceptions`	dict / `None`	Mapping of string forms to lemmas that bypass the `rules`.
`rules`	dict / `None`	List of suffix rewrite rules.
`lookup`	dict / `None`	Lookup table mapping string to their lemmas.
RETURNS	`Lemmatizer`	The newly created object.

Lemmatizer.call

Lemmatize a string.

Example

from spacy.lemmatizer import Lemmatizer
rules = {"noun": [["s", ""]]}
lemmatizer = Lemmatizer(index={}, exceptions={}, rules=rules)
lemmas = lemmatizer("ducks", "NOUN")
assert lemmas == ["duck"]

Name	Type	Description
`string`	unicode	The string to lemmatize, e.g. the token text.
`univ_pos`	unicode / int	The token's universal part-of-speech tag.
`morphology`	dict / `None`	Morphological features following the Universal Dependencies scheme.
RETURNS	list	The available lemmas for the string.

Lemmatizer.lookup

Look up a lemma in the lookup table, if available. If no lemma is found, the original string is returned. Languages can provide a lookup table via the resources, set on the individual Language class.

Example

lookup = {"going": "go"}
lemmatizer = Lemmatizer(lookup=lookup)
assert lemmatizer.lookup("going") == "go"

Name	Type	Description
`string`	unicode	The string to look up.
`orth`	int	Optional hash of the string to look up. If not set, the string will be used and hashed. Defaults to `None`.
RETURNS	unicode	The lemma if the string was found, otherwise the original string.

Lemmatizer.is_base_form

Check whether we're dealing with an uninflected paradigm, so we can avoid lemmatization entirely.

Example

pos = "verb"
morph = {"VerbForm": "inf"}
is_base_form = lemmatizer.is_base_form(pos, morph)
assert is_base_form == True

Name	Type	Description
`univ_pos`	unicode / int	The token's universal part-of-speech tag.
`morphology`	dict	The token's morphological features.
RETURNS	bool	Whether the token's part-of-speech tag and morphological features describe a base form.

Attributes

Name	Type	Description
`index`	dict / `None`	Inventory of lemmas in the language.
`exc`	dict / `None`	Mapping of string forms to lemmas that bypass the `rules`.
`rules`	dict / `None`	List of suffix rewrite rules.
`lookup_table` 2	dict / `None`	The lemma lookup table, if available.

4.9 KiB Raw Blame History

Lemmatizer.__init__

Example

Lemmatizer.__call__

Example

Lemmatizer.lookup

Example

Lemmatizer.is_base_form

Example

Attributes

4.9 KiB

Raw Blame History

Lemmatizer.init

Lemmatizer.call