spaCy/website/docs/api/attributes.md

6.7 KiB

title teaser source
Attributes Token attributes spacy/attrs.pyx

Token attributes are specified using internal IDs in many places including:

import spacy
from spacy.attrs import DEP

nlp = spacy.blank("en")
doc = nlp("There are many attributes.")

# DEP always has the same internal value
assert DEP == 76

# "DEP" is automatically converted to DEP
assert DEP == nlp.vocab.strings["DEP"]
assert doc.has_annotation(DEP) == doc.has_annotation("DEP")

# look up IDs in spacy.attrs.IDS
from spacy.attrs import IDS
assert IDS["DEP"] == DEP

All methods automatically convert between the string version of an ID ("DEP") and the internal integer symbols (DEP). The internal IDs can be imported from spacy.attrs or retrieved from the StringStore. A map from string attribute names to internal attribute IDs is stored in spacy.attrs.IDS.

The corresponding Token object attributes can be accessed using the same names in lowercase, e.g. token.orth or token.length. For attributes that represent string values, the internal integer ID is accessed as Token.attr, e.g. token.dep, while the string value can be retrieved by appending _ as in token.dep_.

Attribute Description
DEP The token's dependency label. str
ENT_ID The token's entity ID (ent_id). str
ENT_IOB The IOB part of the token's entity tag. Uses custom integer vaues rather than the string store: unset is 0, I is 1, O is 2, and B is 3. str
ENT_KB_ID The token's entity knowledge base ID. str
ENT_TYPE The token's entity label. str
IS_ALPHA Token text consists of alphabetic characters. bool
IS_ASCII Token text consists of ASCII characters. bool
IS_DIGIT Token text consists of digits. bool
IS_LOWER Token text is in lowercase. bool
IS_PUNCT Token is punctuation. bool
IS_SPACE Token is whitespace. bool
IS_STOP Token is a stop word. bool
IS_TITLE Token text is in titlecase. bool
IS_UPPER Token text is in uppercase. bool
LEMMA The token's lemma. str
LENGTH The length of the token text. int
LIKE_EMAIL Token text resembles an email address. bool
LIKE_NUM Token text resembles a number. bool
LIKE_URL Token text resembles a URL. bool
LOWER The lowercase form of the token text. str
MORPH The token's morphological analysis. MorphAnalysis
NORM The normalized form of the token text. str
ORTH The exact verbatim text of a token. str
POS The token's universal part of speech (UPOS). str
SENT_START Token is start of sentence. bool
SHAPE The token's shape. str
SPACY Token has a trailing space. bool
TAG The token's fine-grained part of speech. str