From 9ceb8f413c42f7a130a27f4fb8b277e917e63bc4 Mon Sep 17 00:00:00 2001 From: Paul O'Leary McCann Date: Thu, 16 Sep 2021 12:50:22 +0900 Subject: [PATCH] StringStore/Vocab dev docs (#9142) * First take at StringStore/Vocab docs Things to check: 1. The mysterious vocab members 2. How to make table of contents? Is it autogenerated? 3. Anything I missed / needs more detail? * Update docs * Apply suggestions from code review Co-authored-by: Sofie Van Landeghem * Updates based on review feedback * Minor fix * Move example code down Co-authored-by: Sofie Van Landeghem --- extra/DEVELOPER_DOCS/StringStore-Vocab.md | 216 ++++++++++++++++++++++ 1 file changed, 216 insertions(+) create mode 100644 extra/DEVELOPER_DOCS/StringStore-Vocab.md diff --git a/extra/DEVELOPER_DOCS/StringStore-Vocab.md b/extra/DEVELOPER_DOCS/StringStore-Vocab.md new file mode 100644 index 000000000..866ba2aae --- /dev/null +++ b/extra/DEVELOPER_DOCS/StringStore-Vocab.md @@ -0,0 +1,216 @@ +# StringStore & Vocab + +> Reference: `spacy/strings.pyx` +> Reference: `spacy/vocab.pyx` + +## Overview + +spaCy represents mosts strings internally using a `uint64` in Cython which +corresponds to a hash. The magic required to make this largely transparent is +handled by the `StringStore`, and is integrated into the pipelines using the +`Vocab`, which also connects it to some other information. + +These are mostly internal details that average library users should never have +to think about. On the other hand, when developing a component it's normal to +interact with the Vocab for lexeme data or word vectors, and it's not unusual +to add labels to the `StringStore`. + +## StringStore + +### Overview + +The `StringStore` is a `cdef class` that looks a bit like a two-way dictionary, +though it is not a subclass of anything in particular. + +The main functionality of the `StringStore` is that `__getitem__` converts +hashes into strings or strings into hashes. + +The full details of the conversion are complicated. Normally you shouldn't have +to worry about them, but the first applicable case here is used to get the +return value: + +1. 0 and the empty string are special cased to each other +2. internal symbols use a lookup table (`SYMBOLS_BY_STR`) +3. normal strings or bytes are hashed +4. internal symbol IDs in `SYMBOLS_BY_INT` are handled +5. anything not yet handled is used as a hash to lookup a string + +For the symbol enums, see [`symbols.pxd`](https://github.com/explosion/spaCy/blob/master/spacy/symbols.pxd). + +Almost all strings in spaCy are stored in the `StringStore`. This naturally +includes tokens, but also includes things like labels (not just NER/POS/dep, +but also categories etc.), lemmas, lowercase forms, word shapes, and so on. One +of the main results of this is that tokens can be represented by a compact C +struct ([`LexemeC`](https://spacy.io/api/cython-structs#lexemec)/[`TokenC`](https://github.com/explosion/spaCy/issues/4854)) that mostly consists of string hashes. This also means that converting +input for the models is straightforward, and there's not a token mapping step +like in many machine learning frameworks. Additionally, because the token IDs +in spaCy are based on hashes, they are consistent across environments or +models. + +One pattern you'll see a lot in spaCy APIs is that `something.value` returns an +`int` and `something.value_` returns a string. That's implemented using the +`StringStore`. Typically the `int` is stored in a C struct and the string is +generated via a property that calls into the `StringStore` with the `int`. + +Besides `__getitem__`, the `StringStore` has functions to return specifically a +string or specifically a hash, regardless of whether the input was a string or +hash to begin with, though these are only used occasionally. + +### Implementation Details: Hashes and Allocations + +Hashes are 64-bit and are computed using [murmurhash][] on UTF-8 bytes. There is no +mechanism for detecting and avoiding collisions. To date there has never been a +reproducible collision or user report about any related issues. + +[murmurhash]: https://github.com/explosion/murmurhash + +The empty string is not hashed, it's just converted to/from 0. + +A small number of strings use indices into a lookup table (so low integers) +rather than hashes. This is mostly Universal Dependencies labels or other +strings considered "core" in spaCy. This was critical in v1, which hadn't +introduced hashing yet. Since v2 it's important for items in `spacy.attrs`, +especially lexeme flags, but is otherwise only maintained for backwards +compatibility. + +You can call `strings["mystring"]` with a string the `StringStore` has never seen +before and it will return a hash. But in order to do the reverse operation, you +need to call `strings.add("mystring")` first. Without a call to `add` the +string will not be interned. + +Example: + +``` +from spacy.strings import StringStore + +ss = StringStore() +hashval = ss["spacy"] # 10639093010105930009 +try: + # this won't work + ss[hashval] +except KeyError: + print(f"key {hashval} unknown in the StringStore.") + +ss.add("spacy") +assert ss[hashval] == "spacy" # it works now + +# There is no `.keys` property, but you can iterate over keys +# The empty string will never be in the list of keys +for key in ss: + print(key) +``` + +In normal use nothing is ever removed from the `StringStore`. In theory this +means that if you do something like iterate through all hex values of a certain +length you can have explosive memory usage. In practice this has never been an +issue. (Note that this is also different from using `sys.intern` to intern +Python strings, which does not guarantee they won't be garbage collected later.) + +Strings are stored in the `StringStore` in a peculiar way: each string uses a +union that is either an eight-byte `char[]` or a `char*`. Short strings are +stored directly in the `char[]`, while longer strings are stored in allocated +memory and prefixed with their length. This is a strategy to reduce indirection +and memory fragmentation. See `decode_Utf8Str` and `_allocate` in +`strings.pyx` for the implementation. + +### When to Use the StringStore? + +While you can ignore the `StringStore` in many cases, there are situations where +you should make use of it to avoid errors. + +Any time you introduce a string that may be set on a `Doc` field that has a hash, +you should add the string to the `StringStore`. This mainly happens when adding +labels in components, but there are some other cases: + +- syntax iterators, mainly `get_noun_chunks` +- external data used in components, like the `KnowledgeBase` in the `entity_linker` +- labels used in tests + +## Vocab + +The `Vocab` is a core component of a `Language` pipeline. Its main function is +to manage `Lexeme`s, which are structs that contain information about a token +that depends only on its surface form, without context. `Lexeme`s store much of +the data associated with `Token`s. As a side effect of this the `Vocab` also +manages the `StringStore` for a pipeline and a grab-bag of other data. + +These are things stored in the vocab: + +- `Lexeme`s +- `StringStore` +- `Morphology`: manages info used in `MorphAnalysis` objects +- `vectors`: basically a dict for word vectors +- `lookups`: language specific data like lemmas +- `writing_system`: language specific metadata +- `get_noun_chunks`: a syntax iterator +- lex attribute getters: functions like `is_punct`, set in language defaults +- `cfg`: **not** the pipeline config, this is mostly unused +- `_unused_object`: Formerly an unused object, kept around until v4 for compatability + +Some of these, like the Morphology and Vectors, are complex enough that they +need their own explanations. Here we'll just look at Vocab-specific items. + +### Lexemes + +A `Lexeme` is a type that mainly wraps a `LexemeC`, a struct consisting of ints +that identify various context-free token attributes. Lexemes are the core data +of the `Vocab`, and can be accessed using `__getitem__` on the `Vocab`. The memory +for storing `LexemeC` objects is managed by a pool that belongs to the `Vocab`. + +Note that `__getitem__` on the `Vocab` works much like the `StringStore`, in +that it accepts a hash or id, with one important difference: if you do a lookup +using a string, that value is added to the `StringStore` automatically. + +The attributes stored in a `LexemeC` are: + +- orth (the raw text) +- lower +- norm +- shape +- prefix +- suffix + +Most of these are straightforward. All of them can be customized, and (except +`orth`) probably should be since the defaults are based on English, but in +practice this is rarely done at present. + +### Lookups + +This is basically a dict of dicts, implemented using a `Table` for each +sub-dict, that stores lemmas and other language-specific lookup data. + +A `Table` is a subclass of `OrderedDict` used for string-to-string data. It uses +Bloom filters to speed up misses and has some extra serialization features. +Tables are not used outside of the lookups. + +### Lex Attribute Getters + +Lexical Attribute Getters like `is_punct` are defined on a per-language basis, +much like lookups, but take the form of functions rather than string-to-string +dicts, so they're stored separately. + +### Writing System + +This is a dict with three attributes: + +- `direction`: ltr or rtl (default ltr) +- `has_case`: bool (default `True`) +- `has_letters`: bool (default `True`, `False` only for CJK for now) + +Currently these are not used much - the main use is that `direction` is used in +visualizers, though `rtl` doesn't quite work (see +[#4854](https://github.com/explosion/spaCy/issues/4854)). In the future they +could be used when choosing hyperparameters for subwords, controlling word +shape generation, and similar tasks. + +### Other Vocab Members + +The Vocab is kind of the default place to store things from `Language.defaults` +that don't belong to the Tokenizer. The following properties are in the Vocab +just because they don't have anywhere else to go. + +- `get_noun_chunks` +- `cfg`: This is a dict that just stores `oov_prob` (hardcoded to `-20`) +- `_unused_object`: Leftover C member, should be removed in next major version + +