mirror of https://github.com/explosion/spaCy.git
4.6 KiB
4.6 KiB
title | teaser | tag | source |
---|---|---|---|
Lemmatizer | Assign the base forms of words | class | spacy/lemmatizer.py |
The Lemmatizer
supports simple part-of-speech-sensitive suffix rules and
lookup tables.
Lemmatizer.__init__
Create a Lemmatizer
.
Example
from spacy.lemmatizer import Lemmatizer lemmatizer = Lemmatizer()
Name | Type | Description |
---|---|---|
index |
dict / None |
Inventory of lemmas in the language. |
exceptions |
dict / None |
Mapping of string forms to lemmas that bypass the rules . |
rules |
dict / None |
List of suffix rewrite rules. |
lookup |
dict / None |
Lookup table mapping string to their lemmas. |
RETURNS | Lemmatizer |
The newly created object. |
Lemmatizer.__call__
Lemmatize a string.
Example
from spacy.lemmatizer import Lemmatizer from spacy.lang.en import LEMMA_INDEX, LEMMA_EXC, LEMMA_RULES lemmatizer = Lemmatizer(LEMMA_INDEX, LEMMA_EXC, LEMMA_RULES) lemmas = lemmatizer(u"ducks", u"NOUN") assert lemmas == [u"duck"]
Name | Type | Description |
---|---|---|
string |
unicode | The string to lemmatize, e.g. the token text. |
univ_pos |
unicode / int | The token's universal part-of-speech tag. |
morphology |
dict / None |
Morphological features following the Universal Dependencies scheme. |
RETURNS | list | The available lemmas for the string. |
Lemmatizer.lookup
Look up a lemma in the lookup table, if available. If no lemma is found, the
original string is returned. Languages can provide a
lookup table via the lemma_lookup
variable, set on the individual Language
class.
Example
lookup = {u"going": u"go"} lemmatizer = Lemmatizer(lookup=lookup) assert lemmatizer.lookup(u"going") == u"go"
Name | Type | Description |
---|---|---|
string |
unicode | The string to look up. |
RETURNS | unicode | The lemma if the string was found, otherwise the original string. |
Lemmatizer.is_base_form
Check whether we're dealing with an uninflected paradigm, so we can avoid lemmatization entirely.
Example
pos = "verb" morph = {"VerbForm": "inf"} is_base_form = lemmatizer.is_base_form(pos, morph) assert is_base_form == True
Name | Type | Description |
---|---|---|
univ_pos |
unicode / int | The token's universal part-of-speech tag. |
morphology |
dict | The token's morphological features. |
RETURNS | bool | Whether the token's part-of-speech tag and morphological features describe a base form. |
Attributes
Name | Type | Description |
---|---|---|
index |
dict / None |
Inventory of lemmas in the language. |
exc |
dict / None |
Mapping of string forms to lemmas that bypass the rules . |
rules |
dict / None |
List of suffix rewrite rules. |
lookup_table 2 |
dict / None |
The lemma lookup table, if available. |