spaCy/lemmatizer.md at 8e70a564f11f70b8e1d8acd7b2639562394d7455

4.6 KiB

Raw Blame History

title	teaser	tag	source
Lemmatizer	Assign the base forms of words	class	spacy/lemmatizer.py

The Lemmatizer supports simple part-of-speech-sensitive suffix rules and lookup tables.

Lemmatizer.init

Create a Lemmatizer.

Example

from spacy.lemmatizer import Lemmatizer
lemmatizer = Lemmatizer()

Name	Type	Description
`index`	dict / `None`	Inventory of lemmas in the language.
`exceptions`	dict / `None`	Mapping of string forms to lemmas that bypass the `rules`.
`rules`	dict / `None`	List of suffix rewrite rules.
`lookup`	dict / `None`	Lookup table mapping string to their lemmas.
RETURNS	`Lemmatizer`	The newly created object.

Lemmatizer.call

Lemmatize a string.

Example

from spacy.lemmatizer import Lemmatizer
from spacy.lang.en import LEMMA_INDEX, LEMMA_EXC, LEMMA_RULES
lemmatizer = Lemmatizer(LEMMA_INDEX, LEMMA_EXC, LEMMA_RULES)
lemmas = lemmatizer(u"ducks", u"NOUN")
assert lemmas == [u"duck"]

Name	Type	Description
`string`	unicode	The string to lemmatize, e.g. the token text.
`univ_pos`	unicode / int	The token's universal part-of-speech tag.
`morphology`	dict / `None`	Morphological features following the Universal Dependencies scheme.
RETURNS	list	The available lemmas for the string.

Lemmatizer.lookup

Look up a lemma in the lookup table, if available. If no lemma is found, the original string is returned. Languages can provide a lookup table via the lemma_lookup variable, set on the individual Language class.

Example

lookup = {u"going": u"go"}
lemmatizer = Lemmatizer(lookup=lookup)
assert lemmatizer.lookup(u"going") == u"go"

Name	Type	Description
`string`	unicode	The string to look up.
RETURNS	unicode	The lemma if the string was found, otherwise the original string.

Lemmatizer.is_base_form

Check whether we're dealing with an uninflected paradigm, so we can avoid lemmatization entirely.

Example

pos = "verb"
morph = {"VerbForm": "inf"}
is_base_form = lemmatizer.is_base_form(pos, morph)
assert is_base_form == True

Name	Type	Description
`univ_pos`	unicode / int	The token's universal part-of-speech tag.
`morphology`	dict	The token's morphological features.
RETURNS	bool	Whether the token's part-of-speech tag and morphological features describe a base form.

Attributes

Name	Type	Description
`index`	dict / `None`	Inventory of lemmas in the language.
`exc`	dict / `None`	Mapping of string forms to lemmas that bypass the `rules`.
`rules`	dict / `None`	List of suffix rewrite rules.
`lookup_table` 2	dict / `None`	The lemma lookup table, if available.

4.6 KiB Raw Blame History

Lemmatizer.__init__

Example

Lemmatizer.__call__

Example

Lemmatizer.lookup

Example

Lemmatizer.is_base_form

Example

Attributes

4.6 KiB

Raw Blame History

Lemmatizer.init

Lemmatizer.call