8.9 KiB
StringStore & Vocab
Reference:
spacy/strings.pyx
Reference:spacy/vocab.pyx
Overview
spaCy represents mosts strings internally using a uint64
in Cython which
corresponds to a hash. The magic required to make this largely transparent is
handled by the StringStore
, and is integrated into the pipelines using the
Vocab
, which also connects it to some other information.
These are mostly internal details that average library users should never have
to think about. On the other hand, when developing a component it's normal to
interact with the Vocab for lexeme data or word vectors, and it's not unusual
to add labels to the StringStore
.
StringStore
Overview
The StringStore
is a cdef class
that looks a bit like a two-way dictionary,
though it is not a subclass of anything in particular.
The main functionality of the StringStore
is that __getitem__
converts
hashes into strings or strings into hashes.
The full details of the conversion are complicated. Normally you shouldn't have to worry about them, but the first applicable case here is used to get the return value:
- 0 and the empty string are special cased to each other
- internal symbols use a lookup table (
SYMBOLS_BY_STR
) - normal strings or bytes are hashed
- internal symbol IDs in
SYMBOLS_BY_INT
are handled - anything not yet handled is used as a hash to lookup a string
For the symbol enums, see symbols.pxd
.
Almost all strings in spaCy are stored in the StringStore
. This naturally
includes tokens, but also includes things like labels (not just NER/POS/dep,
but also categories etc.), lemmas, lowercase forms, word shapes, and so on. One
of the main results of this is that tokens can be represented by a compact C
struct (LexemeC
/TokenC
) that mostly consists of string hashes. This also means that converting
input for the models is straightforward, and there's not a token mapping step
like in many machine learning frameworks. Additionally, because the token IDs
in spaCy are based on hashes, they are consistent across environments or
models.
One pattern you'll see a lot in spaCy APIs is that something.value
returns an
int
and something.value_
returns a string. That's implemented using the
StringStore
. Typically the int
is stored in a C struct and the string is
generated via a property that calls into the StringStore
with the int
.
Besides __getitem__
, the StringStore
has functions to return specifically a
string or specifically a hash, regardless of whether the input was a string or
hash to begin with, though these are only used occasionally.
Implementation Details: Hashes and Allocations
Hashes are 64-bit and are computed using murmurhash on UTF-8 bytes. There is no mechanism for detecting and avoiding collisions. To date there has never been a reproducible collision or user report about any related issues.
The empty string is not hashed, it's just converted to/from 0.
A small number of strings use indices into a lookup table (so low integers)
rather than hashes. This is mostly Universal Dependencies labels or other
strings considered "core" in spaCy. This was critical in v1, which hadn't
introduced hashing yet. Since v2 it's important for items in spacy.attrs
,
especially lexeme flags, but is otherwise only maintained for backwards
compatibility.
You can call strings["mystring"]
with a string the StringStore
has never seen
before and it will return a hash. But in order to do the reverse operation, you
need to call strings.add("mystring")
first. Without a call to add
the
string will not be interned.
Example:
from spacy.strings import StringStore
ss = StringStore()
hashval = ss["spacy"] # 10639093010105930009
try:
# this won't work
ss[hashval]
except KeyError:
print(f"key {hashval} unknown in the StringStore.")
ss.add("spacy")
assert ss[hashval] == "spacy" # it works now
# There is no `.keys` property, but you can iterate over keys
# The empty string will never be in the list of keys
for key in ss:
print(key)
In normal use nothing is ever removed from the StringStore
. In theory this
means that if you do something like iterate through all hex values of a certain
length you can have explosive memory usage. In practice this has never been an
issue. (Note that this is also different from using sys.intern
to intern
Python strings, which does not guarantee they won't be garbage collected later.)
Strings are stored in the StringStore
in a peculiar way: each string uses a
union that is either an eight-byte char[]
or a char*
. Short strings are
stored directly in the char[]
, while longer strings are stored in allocated
memory and prefixed with their length. This is a strategy to reduce indirection
and memory fragmentation. See decode_Utf8Str
and _allocate
in
strings.pyx
for the implementation.
When to Use the StringStore?
While you can ignore the StringStore
in many cases, there are situations where
you should make use of it to avoid errors.
Any time you introduce a string that may be set on a Doc
field that has a hash,
you should add the string to the StringStore
. This mainly happens when adding
labels in components, but there are some other cases:
- syntax iterators, mainly
get_noun_chunks
- external data used in components, like the
KnowledgeBase
in theentity_linker
- labels used in tests
Vocab
The Vocab
is a core component of a Language
pipeline. Its main function is
to manage Lexeme
s, which are structs that contain information about a token
that depends only on its surface form, without context. Lexeme
s store much of
the data associated with Token
s. As a side effect of this the Vocab
also
manages the StringStore
for a pipeline and a grab-bag of other data.
These are things stored in the vocab:
Lexeme
sStringStore
Morphology
: manages info used inMorphAnalysis
objectsvectors
: basically a dict for word vectorslookups
: language specific data like lemmaswriting_system
: language specific metadataget_noun_chunks
: a syntax iterator- lex attribute getters: functions like
is_punct
, set in language defaults cfg
: not the pipeline config, this is mostly unused_unused_object
: Formerly an unused object, kept around until v4 for compatability
Some of these, like the Morphology and Vectors, are complex enough that they need their own explanations. Here we'll just look at Vocab-specific items.
Lexemes
A Lexeme
is a type that mainly wraps a LexemeC
, a struct consisting of ints
that identify various context-free token attributes. Lexemes are the core data
of the Vocab
, and can be accessed using __getitem__
on the Vocab
. The memory
for storing LexemeC
objects is managed by a pool that belongs to the Vocab
.
Note that __getitem__
on the Vocab
works much like the StringStore
, in
that it accepts a hash or id, with one important difference: if you do a lookup
using a string, that value is added to the StringStore
automatically.
The attributes stored in a LexemeC
are:
- orth (the raw text)
- lower
- norm
- shape
- prefix
- suffix
Most of these are straightforward. All of them can be customized, and (except
orth
) probably should be since the defaults are based on English, but in
practice this is rarely done at present.
Lookups
This is basically a dict of dicts, implemented using a Table
for each
sub-dict, that stores lemmas and other language-specific lookup data.
A Table
is a subclass of OrderedDict
used for string-to-string data. It uses
Bloom filters to speed up misses and has some extra serialization features.
Tables are not used outside of the lookups.
Lex Attribute Getters
Lexical Attribute Getters like is_punct
are defined on a per-language basis,
much like lookups, but take the form of functions rather than string-to-string
dicts, so they're stored separately.
Writing System
This is a dict with three attributes:
direction
: ltr or rtl (default ltr)has_case
: bool (defaultTrue
)has_letters
: bool (defaultTrue
,False
only for CJK for now)
Currently these are not used much - the main use is that direction
is used in
visualizers, though rtl
doesn't quite work (see
#4854). In the future they
could be used when choosing hyperparameters for subwords, controlling word
shape generation, and similar tasks.
Other Vocab Members
The Vocab is kind of the default place to store things from Language.defaults
that don't belong to the Tokenizer. The following properties are in the Vocab
just because they don't have anywhere else to go.
get_noun_chunks
cfg
: This is a dict that just storesoov_prob
(hardcoded to-20
)_unused_object
: Leftover C member, should be removed in next major version