mirror of https://github.com/explosion/spaCy.git
78 lines
6.7 KiB
Plaintext
78 lines
6.7 KiB
Plaintext
---
|
|
title: Attributes
|
|
teaser: Token attributes
|
|
source: spacy/attrs.pyx
|
|
---
|
|
|
|
[Token](/api/token) attributes are specified using internal IDs in many places
|
|
including:
|
|
|
|
- [`Matcher` patterns](/api/matcher#patterns),
|
|
- [`Doc.to_array`](/api/doc#to_array) and
|
|
[`Doc.from_array`](/api/doc#from_array)
|
|
- [`Doc.has_annotation`](/api/doc#has_annotation)
|
|
- [`MultiHashEmbed`](/api/architectures#MultiHashEmbed) Tok2Vec architecture
|
|
`attrs`
|
|
|
|
> ```python
|
|
> import spacy
|
|
> from spacy.attrs import DEP
|
|
>
|
|
> nlp = spacy.blank("en")
|
|
> doc = nlp("There are many attributes.")
|
|
>
|
|
> # DEP always has the same internal value
|
|
> assert DEP == 76
|
|
>
|
|
> # "DEP" is automatically converted to DEP
|
|
> assert DEP == nlp.vocab.strings["DEP"]
|
|
> assert doc.has_annotation(DEP) == doc.has_annotation("DEP")
|
|
>
|
|
> # look up IDs in spacy.attrs.IDS
|
|
> from spacy.attrs import IDS
|
|
> assert IDS["DEP"] == DEP
|
|
> ```
|
|
|
|
All methods automatically convert between the string version of an ID (`"DEP"`)
|
|
and the internal integer symbols (`DEP`). The internal IDs can be imported from
|
|
`spacy.attrs` or retrieved from the [`StringStore`](/api/stringstore). A map
|
|
from string attribute names to internal attribute IDs is stored in
|
|
`spacy.attrs.IDS`.
|
|
|
|
The corresponding [`Token` object attributes](/api/token#attributes) can be
|
|
accessed using the same names in lowercase, e.g. `token.orth` or `token.length`.
|
|
For attributes that represent string values, the internal integer ID is accessed
|
|
as `Token.attr`, e.g. `token.dep`, while the string value can be retrieved by
|
|
appending `_` as in `token.dep_`.
|
|
|
|
| Attribute | Description |
|
|
| ------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
|
| `DEP` | The token's dependency label. ~~str~~ |
|
|
| `ENT_ID` | The token's entity ID (`ent_id`). ~~str~~ |
|
|
| `ENT_IOB` | The IOB part of the token's entity tag. Uses custom integer values rather than the string store: unset is `0`, `I` is `1`, `O` is `2`, and `B` is `3`. ~~str~~ |
|
|
| `ENT_KB_ID` | The token's entity knowledge base ID. ~~str~~ |
|
|
| `ENT_TYPE` | The token's entity label. ~~str~~ |
|
|
| `IS_ALPHA` | Token text consists of alphabetic characters. ~~bool~~ |
|
|
| `IS_ASCII` | Token text consists of ASCII characters. ~~bool~~ |
|
|
| `IS_DIGIT` | Token text consists of digits. ~~bool~~ |
|
|
| `IS_LOWER` | Token text is in lowercase. ~~bool~~ |
|
|
| `IS_PUNCT` | Token is punctuation. ~~bool~~ |
|
|
| `IS_SPACE` | Token is whitespace. ~~bool~~ |
|
|
| `IS_STOP` | Token is a stop word. ~~bool~~ |
|
|
| `IS_TITLE` | Token text is in titlecase. ~~bool~~ |
|
|
| `IS_UPPER` | Token text is in uppercase. ~~bool~~ |
|
|
| `LEMMA` | The token's lemma. ~~str~~ |
|
|
| `LENGTH` | The length of the token text. ~~int~~ |
|
|
| `LIKE_EMAIL` | Token text resembles an email address. ~~bool~~ |
|
|
| `LIKE_NUM` | Token text resembles a number. ~~bool~~ |
|
|
| `LIKE_URL` | Token text resembles a URL. ~~bool~~ |
|
|
| `LOWER` | The lowercase form of the token text. ~~str~~ |
|
|
| `MORPH` | The token's morphological analysis. ~~MorphAnalysis~~ |
|
|
| `NORM` | The normalized form of the token text. ~~str~~ |
|
|
| `ORTH` | The exact verbatim text of a token. ~~str~~ |
|
|
| `POS` | The token's universal part of speech (UPOS). ~~str~~ |
|
|
| `SENT_START` | Token is start of sentence. ~~bool~~ |
|
|
| `SHAPE` | The token's shape. ~~str~~ |
|
|
| `SPACY` | Token has a trailing space. ~~bool~~ |
|
|
| `TAG` | The token's fine-grained part of speech. ~~str~~ |
|