spaCy/website/docs/api/cython-classes.md

10 KiB

title menu
Cython Classes
Doc
doc
Token
token
Span
span
Lexeme
lexeme
Vocab
vocab
StringStore
stringstore

Doc

The Doc object holds an array of TokenC structs.

This section documents the extra C-level attributes and methods that can't be accessed from Python. For the Python documentation, see Doc.

Attributes

Name Type Description
mem cymem.Pool A memory pool. Allocated memory will be freed once the Doc object is garbage collected.
vocab Vocab A reference to the shared Vocab object.
c TokenC* A pointer to a TokenC struct.
length int The number of tokens in the document.
max_length int The underlying size of the Doc.c array.

Doc.push_back

Append a token to the Doc. The token can be provided as a LexemeC or TokenC pointer, using Cython's fused types.

Example

from spacy.tokens cimport Doc
from spacy.vocab cimport Vocab

doc = Doc(Vocab())
lexeme = doc.vocab.get(u'hello')
doc.push_back(lexeme, True)
assert doc.text == u'hello '
Name Type Description
lex_or_tok LexemeOrToken The word to append to the Doc.
has_space bint Whether the word has trailing whitespace.

Token

A Cython class providing access and methods for a TokenC struct. Note that the Token object does not own the struct. It only receives a pointer to it.

This section documents the extra C-level attributes and methods that can't be accessed from Python. For the Python documentation, see Token.

Attributes

Name Type Description
vocab Vocab A reference to the shared Vocab object.
c TokenC* A pointer to a TokenC struct.
i int The offset of the token within the document.
doc Doc The parent document.

Token.cinit

Create a Token object from a TokenC* pointer.

Example

token = Token.cinit(&doc.c[3], doc, 3)
Name Type Description
vocab Vocab A reference to the shared Vocab.
c TokenC* A pointer to a TokenCstruct.
offset int The offset of the token within the document.
doc Doc The parent document.
RETURNS Token The newly constructed object.

Span

A Cython class providing access and methods for a slice of a Doc object.

This section documents the extra C-level attributes and methods that can't be accessed from Python. For the Python documentation, see Span.

Attributes

Name Type Description
doc Doc The parent document.
start int The index of the first token of the span.
end int The index of the first token after the span.
start_char int The index of the first character of the span.
end_char int The index of the last character of the span.
label attr_t A label to attach to the span, e.g. for named entities.

Lexeme

A Cython class providing access and methods for an entry in the vocabulary.

This section documents the extra C-level attributes and methods that can't be accessed from Python. For the Python documentation, see Lexeme.

Attributes

Name Type Description
c LexemeC* A pointer to a LexemeC struct.
vocab Vocab A reference to the shared Vocab object.
orth attr_t ID of the verbatim text content.

Vocab

A Cython class providing access and methods for a vocabulary and other data shared across a language.

This section documents the extra C-level attributes and methods that can't be accessed from Python. For the Python documentation, see Vocab.

Attributes

Name Type Description
mem cymem.Pool A memory pool. Allocated memory will be freed once the Vocab object is garbage collected.
strings StringStore A StringStore that maps string to hash values and vice versa.
length int The number of entries in the vocabulary.

Vocab.get

Retrieve a LexemeC* pointer from the vocabulary.

Example

lexeme = vocab.get(vocab.mem, u'hello')
Name Type Description
mem cymem.Pool A memory pool. Allocated memory will be freed once the Vocab object is garbage collected.
string unicode The string of the word to look up.
RETURNS const LexemeC* The lexeme in the vocabulary.

Vocab.get_by_orth

Retrieve a LexemeC* pointer from the vocabulary.

Example

lexeme = vocab.get_by_orth(doc[0].lex.norm)
Name Type Description
mem cymem.Pool A memory pool. Allocated memory will be freed once the Vocab object is garbage collected.
orth attr_t ID of the verbatim text content.
RETURNS const LexemeC* The lexeme in the vocabulary.

StringStore

A lookup table to retrieve strings by 64-bit hashes.

This section documents the extra C-level attributes and methods that can't be accessed from Python. For the Python documentation, see StringStore.

Attributes

Name Type Description
mem cymem.Pool A memory pool. Allocated memory will be freed once theStringStore object is garbage collected.
keys vector[hash_t] A list of hash values in the StringStore.