2016-10-24 11:49:03 +00:00
|
|
|
# cython: infer_types=True
|
2023-09-12 06:49:41 +00:00
|
|
|
# cython: profile=False
|
2016-03-24 14:09:55 +00:00
|
|
|
cimport cython
|
Support 'memory zones' for user memory management (#13621)
Add a context manage nlp.memory_zone(), which will begin
memory_zone() blocks on the vocab, string store, and potentially
other components.
Example usage:
```
with nlp.memory_zone():
for text in nlp.pipe(texts):
do_something(doc)
# do_something(doc) <-- Invalid
```
Once the memory_zone() block expires, spaCy will free any shared
resources that were allocated for the text-processing that occurred
within the memory_zone. If you create Doc objects within a memory
zone, it's invalid to access them once the memory zone is expired.
The purpose of this is that spaCy creates and stores Lexeme objects
in the Vocab that can be shared between multiple Doc objects. It also
interns strings. Normally, spaCy can't know when all Doc objects using
a Lexeme are out-of-scope, so new Lexemes accumulate in the vocab,
causing memory pressure.
Memory zones solve this problem by telling spaCy "okay none of the
documents allocated within this block will be accessed again". This
lets spaCy free all new Lexeme objects and other data that were
created during the block.
The mechanism is general, so memory_zone() context managers can be
added to other components that could benefit from them, e.g. pipeline
components.
I experimented with adding memory zone support to the tokenizer as well,
for its cache. However, this seems unnecessarily complicated. It makes
more sense to just stick a limit on the cache size. This lets spaCy
benefit from the efficiency advantage of the cache better, because
we can maintain a (bounded) cache even if only small batches of
documents are being processed.
2024-09-09 09:19:39 +00:00
|
|
|
|
|
|
|
from contextlib import contextmanager
|
|
|
|
from typing import Iterator, List, Optional
|
|
|
|
|
2023-06-14 15:48:41 +00:00
|
|
|
from libc.stdint cimport uint32_t
|
2014-12-19 19:42:01 +00:00
|
|
|
from libc.string cimport memcpy
|
2023-06-14 15:48:41 +00:00
|
|
|
from murmurhash.mrmr cimport hash32, hash64
|
2020-03-02 10:48:10 +00:00
|
|
|
|
💫 Replace ujson, msgpack and dill/pickle/cloudpickle with srsly (#3003)
Remove hacks and wrappers, keep code in sync across our libraries and move spaCy a few steps closer to only depending on packages with binary wheels 🎉
See here: https://github.com/explosion/srsly
Serialization is hard, especially across Python versions and multiple platforms. After dealing with many subtle bugs over the years (encodings, locales, large files) our libraries like spaCy and Prodigy have steadily grown a number of utility functions to wrap the multiple serialization formats we need to support (especially json, msgpack and pickle). These wrapping functions ended up duplicated across our codebases, so we wanted to put them in one place.
At the same time, we noticed that having a lot of small dependencies was making maintainence harder, and making installation slower. To solve this, we've made srsly standalone, by including the component packages directly within it. This way we can provide all the serialization utilities we need in a single binary wheel.
srsly currently includes forks of the following packages:
ujson
msgpack
msgpack-numpy
cloudpickle
* WIP: replace json/ujson with srsly
* Replace ujson in examples
Use regular json instead of srsly to make code easier to read and follow
* Update requirements
* Fix imports
* Fix typos
* Replace msgpack with srsly
* Fix warning
2018-12-03 00:28:22 +00:00
|
|
|
import srsly
|
2015-11-05 11:28:26 +00:00
|
|
|
|
2020-03-02 10:48:10 +00:00
|
|
|
from .typedefs cimport hash_t
|
|
|
|
|
2023-06-14 15:48:41 +00:00
|
|
|
from . import util
|
|
|
|
from .errors import Errors
|
2017-05-28 11:03:16 +00:00
|
|
|
from .symbols import IDS as SYMBOLS_BY_STR
|
2019-03-07 11:52:15 +00:00
|
|
|
from .symbols import NAMES as SYMBOLS_BY_INT
|
2023-06-14 15:48:41 +00:00
|
|
|
|
2014-12-19 19:42:01 +00:00
|
|
|
|
2022-07-04 13:04:03 +00:00
|
|
|
# Not particularly elegant, but this is faster than `isinstance(key, numbers.Integral)`
|
|
|
|
cdef inline bint _try_coerce_to_hash(object key, hash_t* out_hash):
|
|
|
|
try:
|
|
|
|
out_hash[0] = key
|
|
|
|
return True
|
2023-07-19 10:03:31 +00:00
|
|
|
except: # no-cython-lint
|
2022-07-04 13:04:03 +00:00
|
|
|
return False
|
2014-12-19 19:42:01 +00:00
|
|
|
|
2023-07-19 10:03:31 +00:00
|
|
|
|
2018-12-10 15:09:26 +00:00
|
|
|
def get_string_id(key):
|
|
|
|
"""Get a string ID, handling the reserved symbols correctly. If the key is
|
|
|
|
already an ID, return it.
|
2019-03-08 10:42:26 +00:00
|
|
|
|
2018-12-10 15:09:26 +00:00
|
|
|
This function optimises for convenience over performance, so shouldn't be
|
|
|
|
used in tight loops.
|
|
|
|
"""
|
Support 'memory zones' for user memory management (#13621)
Add a context manage nlp.memory_zone(), which will begin
memory_zone() blocks on the vocab, string store, and potentially
other components.
Example usage:
```
with nlp.memory_zone():
for text in nlp.pipe(texts):
do_something(doc)
# do_something(doc) <-- Invalid
```
Once the memory_zone() block expires, spaCy will free any shared
resources that were allocated for the text-processing that occurred
within the memory_zone. If you create Doc objects within a memory
zone, it's invalid to access them once the memory zone is expired.
The purpose of this is that spaCy creates and stores Lexeme objects
in the Vocab that can be shared between multiple Doc objects. It also
interns strings. Normally, spaCy can't know when all Doc objects using
a Lexeme are out-of-scope, so new Lexemes accumulate in the vocab,
causing memory pressure.
Memory zones solve this problem by telling spaCy "okay none of the
documents allocated within this block will be accessed again". This
lets spaCy free all new Lexeme objects and other data that were
created during the block.
The mechanism is general, so memory_zone() context managers can be
added to other components that could benefit from them, e.g. pipeline
components.
I experimented with adding memory zone support to the tokenizer as well,
for its cache. However, this seems unnecessarily complicated. It makes
more sense to just stick a limit on the cache size. This lets spaCy
benefit from the efficiency advantage of the cache better, because
we can maintain a (bounded) cache even if only small batches of
documents are being processed.
2024-09-09 09:19:39 +00:00
|
|
|
cdef hash_t str_hash
|
2022-07-04 13:04:03 +00:00
|
|
|
if isinstance(key, str):
|
|
|
|
if len(key) == 0:
|
|
|
|
return 0
|
|
|
|
|
|
|
|
symbol = SYMBOLS_BY_STR.get(key, None)
|
|
|
|
if symbol is not None:
|
|
|
|
return symbol
|
|
|
|
else:
|
|
|
|
chars = key.encode("utf8")
|
|
|
|
return hash_utf8(chars, len(chars))
|
|
|
|
elif _try_coerce_to_hash(key, &str_hash):
|
|
|
|
# Coerce the integral key to the expected primitive hash type.
|
|
|
|
# This ensures that custom/overloaded "primitive" data types
|
Support 'memory zones' for user memory management (#13621)
Add a context manage nlp.memory_zone(), which will begin
memory_zone() blocks on the vocab, string store, and potentially
other components.
Example usage:
```
with nlp.memory_zone():
for text in nlp.pipe(texts):
do_something(doc)
# do_something(doc) <-- Invalid
```
Once the memory_zone() block expires, spaCy will free any shared
resources that were allocated for the text-processing that occurred
within the memory_zone. If you create Doc objects within a memory
zone, it's invalid to access them once the memory zone is expired.
The purpose of this is that spaCy creates and stores Lexeme objects
in the Vocab that can be shared between multiple Doc objects. It also
interns strings. Normally, spaCy can't know when all Doc objects using
a Lexeme are out-of-scope, so new Lexemes accumulate in the vocab,
causing memory pressure.
Memory zones solve this problem by telling spaCy "okay none of the
documents allocated within this block will be accessed again". This
lets spaCy free all new Lexeme objects and other data that were
created during the block.
The mechanism is general, so memory_zone() context managers can be
added to other components that could benefit from them, e.g. pipeline
components.
I experimented with adding memory zone support to the tokenizer as well,
for its cache. However, this seems unnecessarily complicated. It makes
more sense to just stick a limit on the cache size. This lets spaCy
benefit from the efficiency advantage of the cache better, because
we can maintain a (bounded) cache even if only small batches of
documents are being processed.
2024-09-09 09:19:39 +00:00
|
|
|
# such as those implemented by numpy are not inadvertently used
|
|
|
|
# downsteam (as these are internally implemented as custom PyObjects
|
2022-07-04 13:04:03 +00:00
|
|
|
# whose comparison operators can incur a significant overhead).
|
|
|
|
return str_hash
|
2018-12-10 15:09:26 +00:00
|
|
|
else:
|
2022-07-04 13:04:03 +00:00
|
|
|
# TODO: Raise an error instead
|
|
|
|
return key
|
2018-12-10 15:09:26 +00:00
|
|
|
|
|
|
|
|
2021-09-13 15:02:17 +00:00
|
|
|
cpdef hash_t hash_string(str string) except 0:
|
2019-03-08 10:42:26 +00:00
|
|
|
chars = string.encode("utf8")
|
2017-03-07 16:15:18 +00:00
|
|
|
return hash_utf8(chars, len(chars))
|
2016-03-24 14:09:55 +00:00
|
|
|
|
|
|
|
|
2017-03-07 16:15:18 +00:00
|
|
|
cdef hash_t hash_utf8(char* utf8_string, int length) nogil:
|
2016-09-30 18:20:22 +00:00
|
|
|
return hash64(utf8_string, length, 1)
|
2015-01-11 23:26:22 +00:00
|
|
|
|
|
|
|
|
2017-03-07 16:15:18 +00:00
|
|
|
cdef uint32_t hash32_utf8(char* utf8_string, int length) nogil:
|
2016-11-01 12:27:13 +00:00
|
|
|
return hash32(utf8_string, length, 1)
|
|
|
|
|
|
|
|
|
2021-09-13 15:02:17 +00:00
|
|
|
cdef str decode_Utf8Str(const Utf8Str* string):
|
2015-07-20 10:05:23 +00:00
|
|
|
cdef int i, length
|
2015-07-20 09:26:46 +00:00
|
|
|
if string.s[0] < sizeof(string.s) and string.s[0] != 0:
|
2019-03-08 10:42:26 +00:00
|
|
|
return string.s[1:string.s[0]+1].decode("utf8")
|
2015-07-20 10:05:23 +00:00
|
|
|
elif string.p[0] < 255:
|
2019-03-08 10:42:26 +00:00
|
|
|
return string.p[1:string.p[0]+1].decode("utf8")
|
2015-07-20 09:26:46 +00:00
|
|
|
else:
|
2015-07-20 10:05:23 +00:00
|
|
|
i = 0
|
|
|
|
length = 0
|
|
|
|
while string.p[i] == 255:
|
|
|
|
i += 1
|
|
|
|
length += 255
|
|
|
|
length += string.p[i]
|
2015-07-20 09:26:46 +00:00
|
|
|
i += 1
|
2019-03-08 10:42:26 +00:00
|
|
|
return string.p[i:length + i].decode("utf8")
|
2015-07-20 09:26:46 +00:00
|
|
|
|
|
|
|
|
2017-05-28 10:36:27 +00:00
|
|
|
cdef Utf8Str* _allocate(Pool mem, const unsigned char* chars, uint32_t length) except *:
|
2015-07-20 10:05:23 +00:00
|
|
|
cdef int n_length_bytes
|
|
|
|
cdef int i
|
2017-05-28 10:36:27 +00:00
|
|
|
cdef Utf8Str* string = <Utf8Str*>mem.alloc(1, sizeof(Utf8Str))
|
2015-07-20 09:26:46 +00:00
|
|
|
if length < sizeof(string.s):
|
|
|
|
string.s[0] = <unsigned char>length
|
|
|
|
memcpy(&string.s[1], chars, length)
|
|
|
|
return string
|
2015-07-20 10:05:23 +00:00
|
|
|
elif length < 255:
|
2015-07-20 09:26:46 +00:00
|
|
|
string.p = <unsigned char*>mem.alloc(length + 1, sizeof(unsigned char))
|
|
|
|
string.p[0] = length
|
|
|
|
memcpy(&string.p[1], chars, length)
|
|
|
|
return string
|
|
|
|
else:
|
2015-07-20 10:05:23 +00:00
|
|
|
i = 0
|
|
|
|
n_length_bytes = (length // 255) + 1
|
|
|
|
string.p = <unsigned char*>mem.alloc(length + n_length_bytes, sizeof(unsigned char))
|
|
|
|
for i in range(n_length_bytes-1):
|
|
|
|
string.p[i] = 255
|
|
|
|
string.p[n_length_bytes-1] = length % 255
|
|
|
|
memcpy(&string.p[n_length_bytes], chars, length)
|
|
|
|
return string
|
2014-12-21 20:25:43 +00:00
|
|
|
|
2017-05-28 16:19:11 +00:00
|
|
|
|
2014-12-19 19:42:01 +00:00
|
|
|
cdef class StringStore:
|
2019-03-08 10:42:26 +00:00
|
|
|
"""Look up strings by 64-bit hashes.
|
|
|
|
|
2021-01-30 09:09:38 +00:00
|
|
|
DOCS: https://spacy.io/api/stringstore
|
2019-03-08 10:42:26 +00:00
|
|
|
"""
|
2016-10-24 11:49:03 +00:00
|
|
|
def __init__(self, strings=None, freeze=False):
|
2017-05-21 12:18:58 +00:00
|
|
|
"""Create the StringStore.
|
2016-11-01 11:25:36 +00:00
|
|
|
|
2017-05-21 12:18:58 +00:00
|
|
|
strings (iterable): A sequence of unicode strings to add to the store.
|
2017-04-15 09:59:21 +00:00
|
|
|
"""
|
2014-12-19 19:42:01 +00:00
|
|
|
self.mem = Pool()
|
Support 'memory zones' for user memory management (#13621)
Add a context manage nlp.memory_zone(), which will begin
memory_zone() blocks on the vocab, string store, and potentially
other components.
Example usage:
```
with nlp.memory_zone():
for text in nlp.pipe(texts):
do_something(doc)
# do_something(doc) <-- Invalid
```
Once the memory_zone() block expires, spaCy will free any shared
resources that were allocated for the text-processing that occurred
within the memory_zone. If you create Doc objects within a memory
zone, it's invalid to access them once the memory zone is expired.
The purpose of this is that spaCy creates and stores Lexeme objects
in the Vocab that can be shared between multiple Doc objects. It also
interns strings. Normally, spaCy can't know when all Doc objects using
a Lexeme are out-of-scope, so new Lexemes accumulate in the vocab,
causing memory pressure.
Memory zones solve this problem by telling spaCy "okay none of the
documents allocated within this block will be accessed again". This
lets spaCy free all new Lexeme objects and other data that were
created during the block.
The mechanism is general, so memory_zone() context managers can be
added to other components that could benefit from them, e.g. pipeline
components.
I experimented with adding memory zone support to the tokenizer as well,
for its cache. However, this seems unnecessarily complicated. It makes
more sense to just stick a limit on the cache size. This lets spaCy
benefit from the efficiency advantage of the cache better, because
we can maintain a (bounded) cache even if only small batches of
documents are being processed.
2024-09-09 09:19:39 +00:00
|
|
|
self._non_temp_mem = self.mem
|
2014-12-19 19:42:01 +00:00
|
|
|
self._map = PreshMap()
|
Support 'memory zones' for user memory management (#13621)
Add a context manage nlp.memory_zone(), which will begin
memory_zone() blocks on the vocab, string store, and potentially
other components.
Example usage:
```
with nlp.memory_zone():
for text in nlp.pipe(texts):
do_something(doc)
# do_something(doc) <-- Invalid
```
Once the memory_zone() block expires, spaCy will free any shared
resources that were allocated for the text-processing that occurred
within the memory_zone. If you create Doc objects within a memory
zone, it's invalid to access them once the memory zone is expired.
The purpose of this is that spaCy creates and stores Lexeme objects
in the Vocab that can be shared between multiple Doc objects. It also
interns strings. Normally, spaCy can't know when all Doc objects using
a Lexeme are out-of-scope, so new Lexemes accumulate in the vocab,
causing memory pressure.
Memory zones solve this problem by telling spaCy "okay none of the
documents allocated within this block will be accessed again". This
lets spaCy free all new Lexeme objects and other data that were
created during the block.
The mechanism is general, so memory_zone() context managers can be
added to other components that could benefit from them, e.g. pipeline
components.
I experimented with adding memory zone support to the tokenizer as well,
for its cache. However, this seems unnecessarily complicated. It makes
more sense to just stick a limit on the cache size. This lets spaCy
benefit from the efficiency advantage of the cache better, because
we can maintain a (bounded) cache even if only small batches of
documents are being processed.
2024-09-09 09:19:39 +00:00
|
|
|
self._transient_map = None
|
2015-10-12 04:12:32 +00:00
|
|
|
if strings is not None:
|
|
|
|
for string in strings:
|
2017-05-28 10:36:27 +00:00
|
|
|
self.add(string)
|
2015-06-22 22:02:50 +00:00
|
|
|
|
2014-12-19 19:42:01 +00:00
|
|
|
def __getitem__(self, object string_or_id):
|
2017-05-28 16:19:11 +00:00
|
|
|
"""Retrieve a string from a given hash, or vice versa.
|
2017-04-15 09:59:21 +00:00
|
|
|
|
2021-09-13 15:02:17 +00:00
|
|
|
string_or_id (bytes, str or uint64): The value to encode.
|
2020-05-24 16:51:10 +00:00
|
|
|
Returns (str / uint64): The value to be retrieved.
|
2016-11-01 11:25:36 +00:00
|
|
|
"""
|
2022-07-04 13:04:03 +00:00
|
|
|
cdef hash_t str_hash
|
|
|
|
cdef Utf8Str* utf8str = NULL
|
|
|
|
|
2021-09-13 15:02:17 +00:00
|
|
|
if isinstance(string_or_id, str):
|
2022-07-04 13:04:03 +00:00
|
|
|
if len(string_or_id) == 0:
|
|
|
|
return 0
|
|
|
|
|
|
|
|
# Return early if the string is found in the symbols LUT.
|
|
|
|
symbol = SYMBOLS_BY_STR.get(string_or_id, None)
|
|
|
|
if symbol is not None:
|
|
|
|
return symbol
|
|
|
|
else:
|
|
|
|
return hash_string(string_or_id)
|
2017-05-28 10:36:27 +00:00
|
|
|
elif isinstance(string_or_id, bytes):
|
2022-07-04 13:04:03 +00:00
|
|
|
return hash_utf8(string_or_id, len(string_or_id))
|
|
|
|
elif _try_coerce_to_hash(string_or_id, &str_hash):
|
|
|
|
if str_hash == 0:
|
|
|
|
return ""
|
|
|
|
elif str_hash < len(SYMBOLS_BY_INT):
|
|
|
|
return SYMBOLS_BY_INT[str_hash]
|
2016-10-24 11:49:03 +00:00
|
|
|
else:
|
2022-07-04 13:04:03 +00:00
|
|
|
utf8str = <Utf8Str*>self._map.get(str_hash)
|
Support 'memory zones' for user memory management (#13621)
Add a context manage nlp.memory_zone(), which will begin
memory_zone() blocks on the vocab, string store, and potentially
other components.
Example usage:
```
with nlp.memory_zone():
for text in nlp.pipe(texts):
do_something(doc)
# do_something(doc) <-- Invalid
```
Once the memory_zone() block expires, spaCy will free any shared
resources that were allocated for the text-processing that occurred
within the memory_zone. If you create Doc objects within a memory
zone, it's invalid to access them once the memory zone is expired.
The purpose of this is that spaCy creates and stores Lexeme objects
in the Vocab that can be shared between multiple Doc objects. It also
interns strings. Normally, spaCy can't know when all Doc objects using
a Lexeme are out-of-scope, so new Lexemes accumulate in the vocab,
causing memory pressure.
Memory zones solve this problem by telling spaCy "okay none of the
documents allocated within this block will be accessed again". This
lets spaCy free all new Lexeme objects and other data that were
created during the block.
The mechanism is general, so memory_zone() context managers can be
added to other components that could benefit from them, e.g. pipeline
components.
I experimented with adding memory zone support to the tokenizer as well,
for its cache. However, this seems unnecessarily complicated. It makes
more sense to just stick a limit on the cache size. This lets spaCy
benefit from the efficiency advantage of the cache better, because
we can maintain a (bounded) cache even if only small batches of
documents are being processed.
2024-09-09 09:19:39 +00:00
|
|
|
if utf8str is NULL and self._transient_map is not None:
|
|
|
|
utf8str = <Utf8Str*>self._transient_map.get(str_hash)
|
2022-07-04 13:04:03 +00:00
|
|
|
else:
|
|
|
|
# TODO: Raise an error instead
|
|
|
|
utf8str = <Utf8Str*>self._map.get(string_or_id)
|
Support 'memory zones' for user memory management (#13621)
Add a context manage nlp.memory_zone(), which will begin
memory_zone() blocks on the vocab, string store, and potentially
other components.
Example usage:
```
with nlp.memory_zone():
for text in nlp.pipe(texts):
do_something(doc)
# do_something(doc) <-- Invalid
```
Once the memory_zone() block expires, spaCy will free any shared
resources that were allocated for the text-processing that occurred
within the memory_zone. If you create Doc objects within a memory
zone, it's invalid to access them once the memory zone is expired.
The purpose of this is that spaCy creates and stores Lexeme objects
in the Vocab that can be shared between multiple Doc objects. It also
interns strings. Normally, spaCy can't know when all Doc objects using
a Lexeme are out-of-scope, so new Lexemes accumulate in the vocab,
causing memory pressure.
Memory zones solve this problem by telling spaCy "okay none of the
documents allocated within this block will be accessed again". This
lets spaCy free all new Lexeme objects and other data that were
created during the block.
The mechanism is general, so memory_zone() context managers can be
added to other components that could benefit from them, e.g. pipeline
components.
I experimented with adding memory zone support to the tokenizer as well,
for its cache. However, this seems unnecessarily complicated. It makes
more sense to just stick a limit on the cache size. This lets spaCy
benefit from the efficiency advantage of the cache better, because
we can maintain a (bounded) cache even if only small batches of
documents are being processed.
2024-09-09 09:19:39 +00:00
|
|
|
if utf8str is NULL and self._transient_map is not None:
|
|
|
|
utf8str = <Utf8Str*>self._transient_map.get(str_hash)
|
2022-07-04 13:04:03 +00:00
|
|
|
if utf8str is NULL:
|
|
|
|
raise KeyError(Errors.E018.format(hash_value=string_or_id))
|
|
|
|
else:
|
|
|
|
return decode_Utf8Str(utf8str)
|
2017-05-28 10:36:27 +00:00
|
|
|
|
2018-09-24 13:25:20 +00:00
|
|
|
def as_int(self, key):
|
|
|
|
"""If key is an int, return it; otherwise, get the int value."""
|
2021-09-13 15:02:17 +00:00
|
|
|
if not isinstance(key, str):
|
2018-09-24 13:25:20 +00:00
|
|
|
return key
|
|
|
|
else:
|
|
|
|
return self[key]
|
|
|
|
|
|
|
|
def as_string(self, key):
|
|
|
|
"""If key is a string, return it; otherwise, get the string value."""
|
2021-09-13 15:02:17 +00:00
|
|
|
if isinstance(key, str):
|
2018-09-24 13:25:20 +00:00
|
|
|
return key
|
|
|
|
else:
|
|
|
|
return self[key]
|
2019-12-22 00:53:56 +00:00
|
|
|
|
Support 'memory zones' for user memory management (#13621)
Add a context manage nlp.memory_zone(), which will begin
memory_zone() blocks on the vocab, string store, and potentially
other components.
Example usage:
```
with nlp.memory_zone():
for text in nlp.pipe(texts):
do_something(doc)
# do_something(doc) <-- Invalid
```
Once the memory_zone() block expires, spaCy will free any shared
resources that were allocated for the text-processing that occurred
within the memory_zone. If you create Doc objects within a memory
zone, it's invalid to access them once the memory zone is expired.
The purpose of this is that spaCy creates and stores Lexeme objects
in the Vocab that can be shared between multiple Doc objects. It also
interns strings. Normally, spaCy can't know when all Doc objects using
a Lexeme are out-of-scope, so new Lexemes accumulate in the vocab,
causing memory pressure.
Memory zones solve this problem by telling spaCy "okay none of the
documents allocated within this block will be accessed again". This
lets spaCy free all new Lexeme objects and other data that were
created during the block.
The mechanism is general, so memory_zone() context managers can be
added to other components that could benefit from them, e.g. pipeline
components.
I experimented with adding memory zone support to the tokenizer as well,
for its cache. However, this seems unnecessarily complicated. It makes
more sense to just stick a limit on the cache size. This lets spaCy
benefit from the efficiency advantage of the cache better, because
we can maintain a (bounded) cache even if only small batches of
documents are being processed.
2024-09-09 09:19:39 +00:00
|
|
|
def __reduce__(self):
|
|
|
|
strings = list(self.non_transient_keys())
|
|
|
|
return (StringStore, (strings,), None, None, None)
|
|
|
|
|
|
|
|
def __len__(self) -> int:
|
|
|
|
"""The number of strings in the store.
|
|
|
|
|
|
|
|
RETURNS (int): The number of strings in the store.
|
|
|
|
"""
|
|
|
|
return self._keys.size() + self._transient_keys.size()
|
|
|
|
|
|
|
|
@contextmanager
|
|
|
|
def memory_zone(self, mem: Optional[Pool] = None) -> Pool:
|
|
|
|
"""Begin a block where all resources allocated during the block will
|
|
|
|
be freed at the end of it. If a resources was created within the
|
|
|
|
memory zone block, accessing it outside the block is invalid.
|
|
|
|
Behaviour of this invalid access is undefined. Memory zones should
|
|
|
|
not be nested.
|
|
|
|
|
|
|
|
The memory zone is helpful for services that need to process large
|
|
|
|
volumes of text with a defined memory budget.
|
|
|
|
"""
|
|
|
|
if mem is None:
|
|
|
|
mem = Pool()
|
|
|
|
self.mem = mem
|
|
|
|
self._transient_map = PreshMap()
|
|
|
|
yield mem
|
|
|
|
self.mem = self._non_temp_mem
|
|
|
|
self._transient_map = None
|
|
|
|
self._transient_keys.clear()
|
|
|
|
|
|
|
|
def add(self, string: str, allow_transient: bool = False) -> int:
|
2017-05-28 16:19:11 +00:00
|
|
|
"""Add a string to the StringStore.
|
|
|
|
|
2020-05-24 15:20:58 +00:00
|
|
|
string (str): The string to add.
|
Support 'memory zones' for user memory management (#13621)
Add a context manage nlp.memory_zone(), which will begin
memory_zone() blocks on the vocab, string store, and potentially
other components.
Example usage:
```
with nlp.memory_zone():
for text in nlp.pipe(texts):
do_something(doc)
# do_something(doc) <-- Invalid
```
Once the memory_zone() block expires, spaCy will free any shared
resources that were allocated for the text-processing that occurred
within the memory_zone. If you create Doc objects within a memory
zone, it's invalid to access them once the memory zone is expired.
The purpose of this is that spaCy creates and stores Lexeme objects
in the Vocab that can be shared between multiple Doc objects. It also
interns strings. Normally, spaCy can't know when all Doc objects using
a Lexeme are out-of-scope, so new Lexemes accumulate in the vocab,
causing memory pressure.
Memory zones solve this problem by telling spaCy "okay none of the
documents allocated within this block will be accessed again". This
lets spaCy free all new Lexeme objects and other data that were
created during the block.
The mechanism is general, so memory_zone() context managers can be
added to other components that could benefit from them, e.g. pipeline
components.
I experimented with adding memory zone support to the tokenizer as well,
for its cache. However, this seems unnecessarily complicated. It makes
more sense to just stick a limit on the cache size. This lets spaCy
benefit from the efficiency advantage of the cache better, because
we can maintain a (bounded) cache even if only small batches of
documents are being processed.
2024-09-09 09:19:39 +00:00
|
|
|
allow_transient (bool): Allow the string to be stored in the 'transient'
|
|
|
|
map, which will be flushed at the end of the memory zone. Strings
|
|
|
|
encountered during arbitrary text processing should be added
|
|
|
|
with allow_transient=True, while labels and other strings used
|
|
|
|
internally should not.
|
2017-05-28 16:19:11 +00:00
|
|
|
RETURNS (uint64): The string's hash value.
|
|
|
|
"""
|
2022-07-04 13:04:03 +00:00
|
|
|
cdef hash_t str_hash
|
2021-09-13 15:02:17 +00:00
|
|
|
if isinstance(string, str):
|
2017-05-28 11:03:16 +00:00
|
|
|
if string in SYMBOLS_BY_STR:
|
|
|
|
return SYMBOLS_BY_STR[string]
|
2022-07-04 13:04:03 +00:00
|
|
|
|
|
|
|
string = string.encode("utf8")
|
|
|
|
str_hash = hash_utf8(string, len(string))
|
Support 'memory zones' for user memory management (#13621)
Add a context manage nlp.memory_zone(), which will begin
memory_zone() blocks on the vocab, string store, and potentially
other components.
Example usage:
```
with nlp.memory_zone():
for text in nlp.pipe(texts):
do_something(doc)
# do_something(doc) <-- Invalid
```
Once the memory_zone() block expires, spaCy will free any shared
resources that were allocated for the text-processing that occurred
within the memory_zone. If you create Doc objects within a memory
zone, it's invalid to access them once the memory zone is expired.
The purpose of this is that spaCy creates and stores Lexeme objects
in the Vocab that can be shared between multiple Doc objects. It also
interns strings. Normally, spaCy can't know when all Doc objects using
a Lexeme are out-of-scope, so new Lexemes accumulate in the vocab,
causing memory pressure.
Memory zones solve this problem by telling spaCy "okay none of the
documents allocated within this block will be accessed again". This
lets spaCy free all new Lexeme objects and other data that were
created during the block.
The mechanism is general, so memory_zone() context managers can be
added to other components that could benefit from them, e.g. pipeline
components.
I experimented with adding memory zone support to the tokenizer as well,
for its cache. However, this seems unnecessarily complicated. It makes
more sense to just stick a limit on the cache size. This lets spaCy
benefit from the efficiency advantage of the cache better, because
we can maintain a (bounded) cache even if only small batches of
documents are being processed.
2024-09-09 09:19:39 +00:00
|
|
|
self._intern_utf8(string, len(string), &str_hash, allow_transient)
|
2017-05-28 10:36:27 +00:00
|
|
|
elif isinstance(string, bytes):
|
2017-05-28 11:03:16 +00:00
|
|
|
if string in SYMBOLS_BY_STR:
|
|
|
|
return SYMBOLS_BY_STR[string]
|
2022-07-04 13:04:03 +00:00
|
|
|
str_hash = hash_utf8(string, len(string))
|
Support 'memory zones' for user memory management (#13621)
Add a context manage nlp.memory_zone(), which will begin
memory_zone() blocks on the vocab, string store, and potentially
other components.
Example usage:
```
with nlp.memory_zone():
for text in nlp.pipe(texts):
do_something(doc)
# do_something(doc) <-- Invalid
```
Once the memory_zone() block expires, spaCy will free any shared
resources that were allocated for the text-processing that occurred
within the memory_zone. If you create Doc objects within a memory
zone, it's invalid to access them once the memory zone is expired.
The purpose of this is that spaCy creates and stores Lexeme objects
in the Vocab that can be shared between multiple Doc objects. It also
interns strings. Normally, spaCy can't know when all Doc objects using
a Lexeme are out-of-scope, so new Lexemes accumulate in the vocab,
causing memory pressure.
Memory zones solve this problem by telling spaCy "okay none of the
documents allocated within this block will be accessed again". This
lets spaCy free all new Lexeme objects and other data that were
created during the block.
The mechanism is general, so memory_zone() context managers can be
added to other components that could benefit from them, e.g. pipeline
components.
I experimented with adding memory zone support to the tokenizer as well,
for its cache. However, this seems unnecessarily complicated. It makes
more sense to just stick a limit on the cache size. This lets spaCy
benefit from the efficiency advantage of the cache better, because
we can maintain a (bounded) cache even if only small batches of
documents are being processed.
2024-09-09 09:19:39 +00:00
|
|
|
self._intern_utf8(string, len(string), &str_hash, allow_transient)
|
2017-05-28 10:36:27 +00:00
|
|
|
else:
|
2018-04-03 13:50:31 +00:00
|
|
|
raise TypeError(Errors.E017.format(value_type=type(string)))
|
2022-07-04 13:04:03 +00:00
|
|
|
return str_hash
|
2017-05-28 10:36:27 +00:00
|
|
|
|
|
|
|
def __len__(self):
|
|
|
|
"""The number of strings in the store.
|
Support 'memory zones' for user memory management (#13621)
Add a context manage nlp.memory_zone(), which will begin
memory_zone() blocks on the vocab, string store, and potentially
other components.
Example usage:
```
with nlp.memory_zone():
for text in nlp.pipe(texts):
do_something(doc)
# do_something(doc) <-- Invalid
```
Once the memory_zone() block expires, spaCy will free any shared
resources that were allocated for the text-processing that occurred
within the memory_zone. If you create Doc objects within a memory
zone, it's invalid to access them once the memory zone is expired.
The purpose of this is that spaCy creates and stores Lexeme objects
in the Vocab that can be shared between multiple Doc objects. It also
interns strings. Normally, spaCy can't know when all Doc objects using
a Lexeme are out-of-scope, so new Lexemes accumulate in the vocab,
causing memory pressure.
Memory zones solve this problem by telling spaCy "okay none of the
documents allocated within this block will be accessed again". This
lets spaCy free all new Lexeme objects and other data that were
created during the block.
The mechanism is general, so memory_zone() context managers can be
added to other components that could benefit from them, e.g. pipeline
components.
I experimented with adding memory zone support to the tokenizer as well,
for its cache. However, this seems unnecessarily complicated. It makes
more sense to just stick a limit on the cache size. This lets spaCy
benefit from the efficiency advantage of the cache better, because
we can maintain a (bounded) cache even if only small batches of
documents are being processed.
2024-09-09 09:19:39 +00:00
|
|
|
if string in SYMBOLS_BY_STR:
|
|
|
|
return SYMBOLS_BY_STR[string]
|
|
|
|
else:
|
|
|
|
return self._intern_str(string, allow_transient)
|
2017-05-28 10:36:27 +00:00
|
|
|
|
|
|
|
RETURNS (int): The number of strings in the store.
|
|
|
|
"""
|
Support 'memory zones' for user memory management (#13621)
Add a context manage nlp.memory_zone(), which will begin
memory_zone() blocks on the vocab, string store, and potentially
other components.
Example usage:
```
with nlp.memory_zone():
for text in nlp.pipe(texts):
do_something(doc)
# do_something(doc) <-- Invalid
```
Once the memory_zone() block expires, spaCy will free any shared
resources that were allocated for the text-processing that occurred
within the memory_zone. If you create Doc objects within a memory
zone, it's invalid to access them once the memory zone is expired.
The purpose of this is that spaCy creates and stores Lexeme objects
in the Vocab that can be shared between multiple Doc objects. It also
interns strings. Normally, spaCy can't know when all Doc objects using
a Lexeme are out-of-scope, so new Lexemes accumulate in the vocab,
causing memory pressure.
Memory zones solve this problem by telling spaCy "okay none of the
documents allocated within this block will be accessed again". This
lets spaCy free all new Lexeme objects and other data that were
created during the block.
The mechanism is general, so memory_zone() context managers can be
added to other components that could benefit from them, e.g. pipeline
components.
I experimented with adding memory zone support to the tokenizer as well,
for its cache. However, this seems unnecessarily complicated. It makes
more sense to just stick a limit on the cache size. This lets spaCy
benefit from the efficiency advantage of the cache better, because
we can maintain a (bounded) cache even if only small batches of
documents are being processed.
2024-09-09 09:19:39 +00:00
|
|
|
return self.keys.size() + self._transient_keys.size()
|
2014-12-19 19:42:01 +00:00
|
|
|
|
2022-07-04 13:04:03 +00:00
|
|
|
def __contains__(self, string_or_id not None):
|
|
|
|
"""Check whether a string or ID is in the store.
|
2016-11-01 11:25:36 +00:00
|
|
|
|
2022-07-04 13:04:03 +00:00
|
|
|
string_or_id (str or int): The string to check.
|
2017-05-21 12:18:58 +00:00
|
|
|
RETURNS (bool): Whether the store contains the string.
|
2016-11-01 11:25:36 +00:00
|
|
|
"""
|
2022-07-04 13:04:03 +00:00
|
|
|
cdef hash_t str_hash
|
|
|
|
if isinstance(string_or_id, str):
|
|
|
|
if len(string_or_id) == 0:
|
2017-05-28 16:09:27 +00:00
|
|
|
return True
|
2022-07-04 13:04:03 +00:00
|
|
|
elif string_or_id in SYMBOLS_BY_STR:
|
|
|
|
return True
|
|
|
|
str_hash = hash_string(string_or_id)
|
|
|
|
elif _try_coerce_to_hash(string_or_id, &str_hash):
|
|
|
|
pass
|
2017-05-28 16:09:27 +00:00
|
|
|
else:
|
2022-07-04 13:04:03 +00:00
|
|
|
# TODO: Raise an error instead
|
Support 'memory zones' for user memory management (#13621)
Add a context manage nlp.memory_zone(), which will begin
memory_zone() blocks on the vocab, string store, and potentially
other components.
Example usage:
```
with nlp.memory_zone():
for text in nlp.pipe(texts):
do_something(doc)
# do_something(doc) <-- Invalid
```
Once the memory_zone() block expires, spaCy will free any shared
resources that were allocated for the text-processing that occurred
within the memory_zone. If you create Doc objects within a memory
zone, it's invalid to access them once the memory zone is expired.
The purpose of this is that spaCy creates and stores Lexeme objects
in the Vocab that can be shared between multiple Doc objects. It also
interns strings. Normally, spaCy can't know when all Doc objects using
a Lexeme are out-of-scope, so new Lexemes accumulate in the vocab,
causing memory pressure.
Memory zones solve this problem by telling spaCy "okay none of the
documents allocated within this block will be accessed again". This
lets spaCy free all new Lexeme objects and other data that were
created during the block.
The mechanism is general, so memory_zone() context managers can be
added to other components that could benefit from them, e.g. pipeline
components.
I experimented with adding memory zone support to the tokenizer as well,
for its cache. However, this seems unnecessarily complicated. It makes
more sense to just stick a limit on the cache size. This lets spaCy
benefit from the efficiency advantage of the cache better, because
we can maintain a (bounded) cache even if only small batches of
documents are being processed.
2024-09-09 09:19:39 +00:00
|
|
|
if self._map.get(string_or_id) is not NULL:
|
|
|
|
return True
|
|
|
|
elif self._transient_map is not None and self._transient_map.get(string_or_id) is not NULL:
|
|
|
|
return True
|
|
|
|
else:
|
|
|
|
return False
|
2022-07-04 13:04:03 +00:00
|
|
|
if str_hash < len(SYMBOLS_BY_INT):
|
2019-03-07 11:52:15 +00:00
|
|
|
return True
|
|
|
|
else:
|
Support 'memory zones' for user memory management (#13621)
Add a context manage nlp.memory_zone(), which will begin
memory_zone() blocks on the vocab, string store, and potentially
other components.
Example usage:
```
with nlp.memory_zone():
for text in nlp.pipe(texts):
do_something(doc)
# do_something(doc) <-- Invalid
```
Once the memory_zone() block expires, spaCy will free any shared
resources that were allocated for the text-processing that occurred
within the memory_zone. If you create Doc objects within a memory
zone, it's invalid to access them once the memory zone is expired.
The purpose of this is that spaCy creates and stores Lexeme objects
in the Vocab that can be shared between multiple Doc objects. It also
interns strings. Normally, spaCy can't know when all Doc objects using
a Lexeme are out-of-scope, so new Lexemes accumulate in the vocab,
causing memory pressure.
Memory zones solve this problem by telling spaCy "okay none of the
documents allocated within this block will be accessed again". This
lets spaCy free all new Lexeme objects and other data that were
created during the block.
The mechanism is general, so memory_zone() context managers can be
added to other components that could benefit from them, e.g. pipeline
components.
I experimented with adding memory zone support to the tokenizer as well,
for its cache. However, this seems unnecessarily complicated. It makes
more sense to just stick a limit on the cache size. This lets spaCy
benefit from the efficiency advantage of the cache better, because
we can maintain a (bounded) cache even if only small batches of
documents are being processed.
2024-09-09 09:19:39 +00:00
|
|
|
if self._map.get(str_hash) is not NULL:
|
|
|
|
return True
|
|
|
|
elif self._transient_map is not None and self._transient_map.get(string_or_id) is not NULL:
|
|
|
|
return True
|
|
|
|
else:
|
|
|
|
return False
|
2016-03-08 15:49:10 +00:00
|
|
|
|
2015-08-22 20:04:34 +00:00
|
|
|
def __iter__(self):
|
2017-05-21 12:18:58 +00:00
|
|
|
"""Iterate over the strings in the store, in order.
|
2016-11-01 11:25:36 +00:00
|
|
|
|
2020-05-24 15:20:58 +00:00
|
|
|
YIELDS (str): A string in the store.
|
2016-11-01 11:25:36 +00:00
|
|
|
"""
|
Support 'memory zones' for user memory management (#13621)
Add a context manage nlp.memory_zone(), which will begin
memory_zone() blocks on the vocab, string store, and potentially
other components.
Example usage:
```
with nlp.memory_zone():
for text in nlp.pipe(texts):
do_something(doc)
# do_something(doc) <-- Invalid
```
Once the memory_zone() block expires, spaCy will free any shared
resources that were allocated for the text-processing that occurred
within the memory_zone. If you create Doc objects within a memory
zone, it's invalid to access them once the memory zone is expired.
The purpose of this is that spaCy creates and stores Lexeme objects
in the Vocab that can be shared between multiple Doc objects. It also
interns strings. Normally, spaCy can't know when all Doc objects using
a Lexeme are out-of-scope, so new Lexemes accumulate in the vocab,
causing memory pressure.
Memory zones solve this problem by telling spaCy "okay none of the
documents allocated within this block will be accessed again". This
lets spaCy free all new Lexeme objects and other data that were
created during the block.
The mechanism is general, so memory_zone() context managers can be
added to other components that could benefit from them, e.g. pipeline
components.
I experimented with adding memory zone support to the tokenizer as well,
for its cache. However, this seems unnecessarily complicated. It makes
more sense to just stick a limit on the cache size. This lets spaCy
benefit from the efficiency advantage of the cache better, because
we can maintain a (bounded) cache even if only small batches of
documents are being processed.
2024-09-09 09:19:39 +00:00
|
|
|
yield from self.non_transient_keys()
|
|
|
|
yield from self.transient_keys()
|
|
|
|
|
|
|
|
def non_transient_keys(self) -> Iterator[str]:
|
|
|
|
"""Iterate over the stored strings in insertion order.
|
|
|
|
|
|
|
|
RETURNS: A list of strings.
|
|
|
|
"""
|
2015-08-22 20:04:34 +00:00
|
|
|
cdef int i
|
2017-05-28 10:36:27 +00:00
|
|
|
cdef hash_t key
|
|
|
|
for i in range(self.keys.size()):
|
|
|
|
key = self.keys[i]
|
|
|
|
utf8str = <Utf8Str*>self._map.get(key)
|
|
|
|
yield decode_Utf8Str(utf8str)
|
2015-08-22 20:04:34 +00:00
|
|
|
|
2015-10-12 04:12:32 +00:00
|
|
|
def __reduce__(self):
|
2017-05-28 10:36:27 +00:00
|
|
|
strings = list(self)
|
2015-10-12 04:12:32 +00:00
|
|
|
return (StringStore, (strings,), None, None, None)
|
|
|
|
|
Support 'memory zones' for user memory management (#13621)
Add a context manage nlp.memory_zone(), which will begin
memory_zone() blocks on the vocab, string store, and potentially
other components.
Example usage:
```
with nlp.memory_zone():
for text in nlp.pipe(texts):
do_something(doc)
# do_something(doc) <-- Invalid
```
Once the memory_zone() block expires, spaCy will free any shared
resources that were allocated for the text-processing that occurred
within the memory_zone. If you create Doc objects within a memory
zone, it's invalid to access them once the memory zone is expired.
The purpose of this is that spaCy creates and stores Lexeme objects
in the Vocab that can be shared between multiple Doc objects. It also
interns strings. Normally, spaCy can't know when all Doc objects using
a Lexeme are out-of-scope, so new Lexemes accumulate in the vocab,
causing memory pressure.
Memory zones solve this problem by telling spaCy "okay none of the
documents allocated within this block will be accessed again". This
lets spaCy free all new Lexeme objects and other data that were
created during the block.
The mechanism is general, so memory_zone() context managers can be
added to other components that could benefit from them, e.g. pipeline
components.
I experimented with adding memory zone support to the tokenizer as well,
for its cache. However, this seems unnecessarily complicated. It makes
more sense to just stick a limit on the cache size. This lets spaCy
benefit from the efficiency advantage of the cache better, because
we can maintain a (bounded) cache even if only small batches of
documents are being processed.
2024-09-09 09:19:39 +00:00
|
|
|
def transient_keys(self) -> Iterator[str]:
|
|
|
|
if self._transient_map is None:
|
|
|
|
return []
|
|
|
|
for i in range(self._transient_keys.size()):
|
|
|
|
utf8str = <Utf8Str*>self._transient_map.get(self._transient_keys[i])
|
|
|
|
yield decode_Utf8Str(utf8str)
|
|
|
|
|
|
|
|
def values(self) -> List[int]:
|
|
|
|
"""Iterate over the stored strings hashes in insertion order.
|
|
|
|
|
|
|
|
RETURNS: A list of string hashs.
|
|
|
|
"""
|
|
|
|
cdef int i
|
|
|
|
hashes = [None] * self._keys.size()
|
|
|
|
for i in range(self._keys.size()):
|
|
|
|
hashes[i] = self._keys[i]
|
|
|
|
if self._transient_map is not None:
|
|
|
|
transient_hashes = [None] * self._transient_keys.size()
|
|
|
|
for i in range(self._transient_keys.size()):
|
|
|
|
transient_hashes[i] = self._transient_keys[i]
|
|
|
|
else:
|
|
|
|
transient_hashes = []
|
|
|
|
return hashes + transient_hashes
|
|
|
|
|
2017-05-21 12:18:58 +00:00
|
|
|
def to_disk(self, path):
|
|
|
|
"""Save the current state to a directory.
|
|
|
|
|
2020-05-24 16:51:10 +00:00
|
|
|
path (str / Path): A path to a directory, which will be created if
|
2017-10-27 19:07:59 +00:00
|
|
|
it doesn't exist. Paths may be either strings or Path-like objects.
|
2017-05-21 12:18:58 +00:00
|
|
|
"""
|
2017-05-22 10:38:00 +00:00
|
|
|
path = util.ensure_path(path)
|
2021-04-09 09:53:13 +00:00
|
|
|
strings = sorted(self)
|
💫 Replace ujson, msgpack and dill/pickle/cloudpickle with srsly (#3003)
Remove hacks and wrappers, keep code in sync across our libraries and move spaCy a few steps closer to only depending on packages with binary wheels 🎉
See here: https://github.com/explosion/srsly
Serialization is hard, especially across Python versions and multiple platforms. After dealing with many subtle bugs over the years (encodings, locales, large files) our libraries like spaCy and Prodigy have steadily grown a number of utility functions to wrap the multiple serialization formats we need to support (especially json, msgpack and pickle). These wrapping functions ended up duplicated across our codebases, so we wanted to put them in one place.
At the same time, we noticed that having a lot of small dependencies was making maintainence harder, and making installation slower. To solve this, we've made srsly standalone, by including the component packages directly within it. This way we can provide all the serialization utilities we need in a single binary wheel.
srsly currently includes forks of the following packages:
ujson
msgpack
msgpack-numpy
cloudpickle
* WIP: replace json/ujson with srsly
* Replace ujson in examples
Use regular json instead of srsly to make code easier to read and follow
* Update requirements
* Fix imports
* Fix typos
* Replace msgpack with srsly
* Fix warning
2018-12-03 00:28:22 +00:00
|
|
|
srsly.write_json(path, strings)
|
2017-05-21 12:18:58 +00:00
|
|
|
|
|
|
|
def from_disk(self, path):
|
|
|
|
"""Loads state from a directory. Modifies the object in place and
|
|
|
|
returns it.
|
|
|
|
|
2020-05-24 16:51:10 +00:00
|
|
|
path (str / Path): A path to a directory. Paths may be either
|
2017-05-21 12:18:58 +00:00
|
|
|
strings or `Path`-like objects.
|
|
|
|
RETURNS (StringStore): The modified `StringStore` object.
|
|
|
|
"""
|
2017-05-22 10:38:00 +00:00
|
|
|
path = util.ensure_path(path)
|
💫 Replace ujson, msgpack and dill/pickle/cloudpickle with srsly (#3003)
Remove hacks and wrappers, keep code in sync across our libraries and move spaCy a few steps closer to only depending on packages with binary wheels 🎉
See here: https://github.com/explosion/srsly
Serialization is hard, especially across Python versions and multiple platforms. After dealing with many subtle bugs over the years (encodings, locales, large files) our libraries like spaCy and Prodigy have steadily grown a number of utility functions to wrap the multiple serialization formats we need to support (especially json, msgpack and pickle). These wrapping functions ended up duplicated across our codebases, so we wanted to put them in one place.
At the same time, we noticed that having a lot of small dependencies was making maintainence harder, and making installation slower. To solve this, we've made srsly standalone, by including the component packages directly within it. This way we can provide all the serialization utilities we need in a single binary wheel.
srsly currently includes forks of the following packages:
ujson
msgpack
msgpack-numpy
cloudpickle
* WIP: replace json/ujson with srsly
* Replace ujson in examples
Use regular json instead of srsly to make code easier to read and follow
* Update requirements
* Fix imports
* Fix typos
* Replace msgpack with srsly
* Fix warning
2018-12-03 00:28:22 +00:00
|
|
|
strings = srsly.read_json(path)
|
2017-08-19 20:42:17 +00:00
|
|
|
prev = list(self)
|
2017-05-22 10:38:00 +00:00
|
|
|
self._reset_and_load(strings)
|
2017-08-19 20:42:17 +00:00
|
|
|
for word in prev:
|
Support 'memory zones' for user memory management (#13621)
Add a context manage nlp.memory_zone(), which will begin
memory_zone() blocks on the vocab, string store, and potentially
other components.
Example usage:
```
with nlp.memory_zone():
for text in nlp.pipe(texts):
do_something(doc)
# do_something(doc) <-- Invalid
```
Once the memory_zone() block expires, spaCy will free any shared
resources that were allocated for the text-processing that occurred
within the memory_zone. If you create Doc objects within a memory
zone, it's invalid to access them once the memory zone is expired.
The purpose of this is that spaCy creates and stores Lexeme objects
in the Vocab that can be shared between multiple Doc objects. It also
interns strings. Normally, spaCy can't know when all Doc objects using
a Lexeme are out-of-scope, so new Lexemes accumulate in the vocab,
causing memory pressure.
Memory zones solve this problem by telling spaCy "okay none of the
documents allocated within this block will be accessed again". This
lets spaCy free all new Lexeme objects and other data that were
created during the block.
The mechanism is general, so memory_zone() context managers can be
added to other components that could benefit from them, e.g. pipeline
components.
I experimented with adding memory zone support to the tokenizer as well,
for its cache. However, this seems unnecessarily complicated. It makes
more sense to just stick a limit on the cache size. This lets spaCy
benefit from the efficiency advantage of the cache better, because
we can maintain a (bounded) cache even if only small batches of
documents are being processed.
2024-09-09 09:19:39 +00:00
|
|
|
self.add(word, allow_transient=False)
|
2017-05-22 10:38:00 +00:00
|
|
|
return self
|
2017-05-21 12:18:58 +00:00
|
|
|
|
2019-03-10 18:16:45 +00:00
|
|
|
def to_bytes(self, **kwargs):
|
2017-05-21 12:18:58 +00:00
|
|
|
"""Serialize the current state to a binary string.
|
|
|
|
|
|
|
|
RETURNS (bytes): The serialized form of the `StringStore` object.
|
|
|
|
"""
|
2021-04-09 09:53:13 +00:00
|
|
|
return srsly.json_dumps(sorted(self))
|
2017-05-21 12:18:58 +00:00
|
|
|
|
2019-03-10 18:16:45 +00:00
|
|
|
def from_bytes(self, bytes_data, **kwargs):
|
2017-05-21 12:18:58 +00:00
|
|
|
"""Load state from a binary string.
|
|
|
|
|
|
|
|
bytes_data (bytes): The data to load from.
|
|
|
|
RETURNS (StringStore): The `StringStore` object.
|
|
|
|
"""
|
💫 Replace ujson, msgpack and dill/pickle/cloudpickle with srsly (#3003)
Remove hacks and wrappers, keep code in sync across our libraries and move spaCy a few steps closer to only depending on packages with binary wheels 🎉
See here: https://github.com/explosion/srsly
Serialization is hard, especially across Python versions and multiple platforms. After dealing with many subtle bugs over the years (encodings, locales, large files) our libraries like spaCy and Prodigy have steadily grown a number of utility functions to wrap the multiple serialization formats we need to support (especially json, msgpack and pickle). These wrapping functions ended up duplicated across our codebases, so we wanted to put them in one place.
At the same time, we noticed that having a lot of small dependencies was making maintainence harder, and making installation slower. To solve this, we've made srsly standalone, by including the component packages directly within it. This way we can provide all the serialization utilities we need in a single binary wheel.
srsly currently includes forks of the following packages:
ujson
msgpack
msgpack-numpy
cloudpickle
* WIP: replace json/ujson with srsly
* Replace ujson in examples
Use regular json instead of srsly to make code easier to read and follow
* Update requirements
* Fix imports
* Fix typos
* Replace msgpack with srsly
* Fix warning
2018-12-03 00:28:22 +00:00
|
|
|
strings = srsly.json_loads(bytes_data)
|
2017-08-19 20:42:17 +00:00
|
|
|
prev = list(self)
|
2017-05-22 10:38:00 +00:00
|
|
|
self._reset_and_load(strings)
|
2017-08-19 20:42:17 +00:00
|
|
|
for word in prev:
|
Support 'memory zones' for user memory management (#13621)
Add a context manage nlp.memory_zone(), which will begin
memory_zone() blocks on the vocab, string store, and potentially
other components.
Example usage:
```
with nlp.memory_zone():
for text in nlp.pipe(texts):
do_something(doc)
# do_something(doc) <-- Invalid
```
Once the memory_zone() block expires, spaCy will free any shared
resources that were allocated for the text-processing that occurred
within the memory_zone. If you create Doc objects within a memory
zone, it's invalid to access them once the memory zone is expired.
The purpose of this is that spaCy creates and stores Lexeme objects
in the Vocab that can be shared between multiple Doc objects. It also
interns strings. Normally, spaCy can't know when all Doc objects using
a Lexeme are out-of-scope, so new Lexemes accumulate in the vocab,
causing memory pressure.
Memory zones solve this problem by telling spaCy "okay none of the
documents allocated within this block will be accessed again". This
lets spaCy free all new Lexeme objects and other data that were
created during the block.
The mechanism is general, so memory_zone() context managers can be
added to other components that could benefit from them, e.g. pipeline
components.
I experimented with adding memory zone support to the tokenizer as well,
for its cache. However, this seems unnecessarily complicated. It makes
more sense to just stick a limit on the cache size. This lets spaCy
benefit from the efficiency advantage of the cache better, because
we can maintain a (bounded) cache even if only small batches of
documents are being processed.
2024-09-09 09:19:39 +00:00
|
|
|
self.add(word, allow_transient=False)
|
2017-05-22 10:38:00 +00:00
|
|
|
return self
|
2017-05-21 12:18:58 +00:00
|
|
|
|
2017-10-16 17:23:10 +00:00
|
|
|
def _reset_and_load(self, strings):
|
2017-05-22 10:38:00 +00:00
|
|
|
self.mem = Pool()
|
|
|
|
self._map = PreshMap()
|
2017-05-28 10:36:27 +00:00
|
|
|
self.keys.clear()
|
2017-05-22 10:38:00 +00:00
|
|
|
for string in strings:
|
Support 'memory zones' for user memory management (#13621)
Add a context manage nlp.memory_zone(), which will begin
memory_zone() blocks on the vocab, string store, and potentially
other components.
Example usage:
```
with nlp.memory_zone():
for text in nlp.pipe(texts):
do_something(doc)
# do_something(doc) <-- Invalid
```
Once the memory_zone() block expires, spaCy will free any shared
resources that were allocated for the text-processing that occurred
within the memory_zone. If you create Doc objects within a memory
zone, it's invalid to access them once the memory zone is expired.
The purpose of this is that spaCy creates and stores Lexeme objects
in the Vocab that can be shared between multiple Doc objects. It also
interns strings. Normally, spaCy can't know when all Doc objects using
a Lexeme are out-of-scope, so new Lexemes accumulate in the vocab,
causing memory pressure.
Memory zones solve this problem by telling spaCy "okay none of the
documents allocated within this block will be accessed again". This
lets spaCy free all new Lexeme objects and other data that were
created during the block.
The mechanism is general, so memory_zone() context managers can be
added to other components that could benefit from them, e.g. pipeline
components.
I experimented with adding memory zone support to the tokenizer as well,
for its cache. However, this seems unnecessarily complicated. It makes
more sense to just stick a limit on the cache size. This lets spaCy
benefit from the efficiency advantage of the cache better, because
we can maintain a (bounded) cache even if only small batches of
documents are being processed.
2024-09-09 09:19:39 +00:00
|
|
|
self.add(string, allow_transient=False)
|
2017-05-22 10:38:00 +00:00
|
|
|
|
Support 'memory zones' for user memory management (#13621)
Add a context manage nlp.memory_zone(), which will begin
memory_zone() blocks on the vocab, string store, and potentially
other components.
Example usage:
```
with nlp.memory_zone():
for text in nlp.pipe(texts):
do_something(doc)
# do_something(doc) <-- Invalid
```
Once the memory_zone() block expires, spaCy will free any shared
resources that were allocated for the text-processing that occurred
within the memory_zone. If you create Doc objects within a memory
zone, it's invalid to access them once the memory zone is expired.
The purpose of this is that spaCy creates and stores Lexeme objects
in the Vocab that can be shared between multiple Doc objects. It also
interns strings. Normally, spaCy can't know when all Doc objects using
a Lexeme are out-of-scope, so new Lexemes accumulate in the vocab,
causing memory pressure.
Memory zones solve this problem by telling spaCy "okay none of the
documents allocated within this block will be accessed again". This
lets spaCy free all new Lexeme objects and other data that were
created during the block.
The mechanism is general, so memory_zone() context managers can be
added to other components that could benefit from them, e.g. pipeline
components.
I experimented with adding memory zone support to the tokenizer as well,
for its cache. However, this seems unnecessarily complicated. It makes
more sense to just stick a limit on the cache size. This lets spaCy
benefit from the efficiency advantage of the cache better, because
we can maintain a (bounded) cache even if only small batches of
documents are being processed.
2024-09-09 09:19:39 +00:00
|
|
|
cdef const Utf8Str* intern_unicode(self, str py_string, bint allow_transient):
|
2016-09-30 18:20:22 +00:00
|
|
|
# 0 means missing, but we don't bother offsetting the index.
|
2019-03-08 10:42:26 +00:00
|
|
|
cdef bytes byte_string = py_string.encode("utf8")
|
Support 'memory zones' for user memory management (#13621)
Add a context manage nlp.memory_zone(), which will begin
memory_zone() blocks on the vocab, string store, and potentially
other components.
Example usage:
```
with nlp.memory_zone():
for text in nlp.pipe(texts):
do_something(doc)
# do_something(doc) <-- Invalid
```
Once the memory_zone() block expires, spaCy will free any shared
resources that were allocated for the text-processing that occurred
within the memory_zone. If you create Doc objects within a memory
zone, it's invalid to access them once the memory zone is expired.
The purpose of this is that spaCy creates and stores Lexeme objects
in the Vocab that can be shared between multiple Doc objects. It also
interns strings. Normally, spaCy can't know when all Doc objects using
a Lexeme are out-of-scope, so new Lexemes accumulate in the vocab,
causing memory pressure.
Memory zones solve this problem by telling spaCy "okay none of the
documents allocated within this block will be accessed again". This
lets spaCy free all new Lexeme objects and other data that were
created during the block.
The mechanism is general, so memory_zone() context managers can be
added to other components that could benefit from them, e.g. pipeline
components.
I experimented with adding memory zone support to the tokenizer as well,
for its cache. However, this seems unnecessarily complicated. It makes
more sense to just stick a limit on the cache size. This lets spaCy
benefit from the efficiency advantage of the cache better, because
we can maintain a (bounded) cache even if only small batches of
documents are being processed.
2024-09-09 09:19:39 +00:00
|
|
|
return self._intern_utf8(byte_string, len(byte_string), NULL, allow_transient)
|
2015-07-20 09:26:46 +00:00
|
|
|
|
2016-09-30 08:14:47 +00:00
|
|
|
@cython.final
|
Support 'memory zones' for user memory management (#13621)
Add a context manage nlp.memory_zone(), which will begin
memory_zone() blocks on the vocab, string store, and potentially
other components.
Example usage:
```
with nlp.memory_zone():
for text in nlp.pipe(texts):
do_something(doc)
# do_something(doc) <-- Invalid
```
Once the memory_zone() block expires, spaCy will free any shared
resources that were allocated for the text-processing that occurred
within the memory_zone. If you create Doc objects within a memory
zone, it's invalid to access them once the memory zone is expired.
The purpose of this is that spaCy creates and stores Lexeme objects
in the Vocab that can be shared between multiple Doc objects. It also
interns strings. Normally, spaCy can't know when all Doc objects using
a Lexeme are out-of-scope, so new Lexemes accumulate in the vocab,
causing memory pressure.
Memory zones solve this problem by telling spaCy "okay none of the
documents allocated within this block will be accessed again". This
lets spaCy free all new Lexeme objects and other data that were
created during the block.
The mechanism is general, so memory_zone() context managers can be
added to other components that could benefit from them, e.g. pipeline
components.
I experimented with adding memory zone support to the tokenizer as well,
for its cache. However, this seems unnecessarily complicated. It makes
more sense to just stick a limit on the cache size. This lets spaCy
benefit from the efficiency advantage of the cache better, because
we can maintain a (bounded) cache even if only small batches of
documents are being processed.
2024-09-09 09:19:39 +00:00
|
|
|
cdef const Utf8Str* _intern_utf8(self, char* utf8_string, int length, hash_t* precalculated_hash, bint allow_transient):
|
2016-10-24 11:49:03 +00:00
|
|
|
# TODO: This function's API/behaviour is an unholy mess...
|
2016-09-30 18:20:22 +00:00
|
|
|
# 0 means missing, but we don't bother offsetting the index.
|
2022-07-04 13:04:03 +00:00
|
|
|
cdef hash_t key = precalculated_hash[0] if precalculated_hash is not NULL else hash_utf8(utf8_string, length)
|
2016-10-24 11:49:03 +00:00
|
|
|
cdef Utf8Str* value = <Utf8Str*>self._map.get(key)
|
|
|
|
if value is not NULL:
|
|
|
|
return value
|
Support 'memory zones' for user memory management (#13621)
Add a context manage nlp.memory_zone(), which will begin
memory_zone() blocks on the vocab, string store, and potentially
other components.
Example usage:
```
with nlp.memory_zone():
for text in nlp.pipe(texts):
do_something(doc)
# do_something(doc) <-- Invalid
```
Once the memory_zone() block expires, spaCy will free any shared
resources that were allocated for the text-processing that occurred
within the memory_zone. If you create Doc objects within a memory
zone, it's invalid to access them once the memory zone is expired.
The purpose of this is that spaCy creates and stores Lexeme objects
in the Vocab that can be shared between multiple Doc objects. It also
interns strings. Normally, spaCy can't know when all Doc objects using
a Lexeme are out-of-scope, so new Lexemes accumulate in the vocab,
causing memory pressure.
Memory zones solve this problem by telling spaCy "okay none of the
documents allocated within this block will be accessed again". This
lets spaCy free all new Lexeme objects and other data that were
created during the block.
The mechanism is general, so memory_zone() context managers can be
added to other components that could benefit from them, e.g. pipeline
components.
I experimented with adding memory zone support to the tokenizer as well,
for its cache. However, this seems unnecessarily complicated. It makes
more sense to just stick a limit on the cache size. This lets spaCy
benefit from the efficiency advantage of the cache better, because
we can maintain a (bounded) cache even if only small batches of
documents are being processed.
2024-09-09 09:19:39 +00:00
|
|
|
if allow_transient and self._transient_map is not None:
|
|
|
|
# If we've already allocated a transient string, and now we
|
|
|
|
# want to intern it permanently, we'll end up with the string
|
|
|
|
# in both places. That seems fine -- I don't see why we need
|
|
|
|
# to remove it from the transient map.
|
|
|
|
value = <Utf8Str*>self._transient_map.get(key)
|
|
|
|
if value is not NULL:
|
|
|
|
return value
|
2017-05-28 10:36:27 +00:00
|
|
|
value = _allocate(self.mem, <unsigned char*>utf8_string, length)
|
Support 'memory zones' for user memory management (#13621)
Add a context manage nlp.memory_zone(), which will begin
memory_zone() blocks on the vocab, string store, and potentially
other components.
Example usage:
```
with nlp.memory_zone():
for text in nlp.pipe(texts):
do_something(doc)
# do_something(doc) <-- Invalid
```
Once the memory_zone() block expires, spaCy will free any shared
resources that were allocated for the text-processing that occurred
within the memory_zone. If you create Doc objects within a memory
zone, it's invalid to access them once the memory zone is expired.
The purpose of this is that spaCy creates and stores Lexeme objects
in the Vocab that can be shared between multiple Doc objects. It also
interns strings. Normally, spaCy can't know when all Doc objects using
a Lexeme are out-of-scope, so new Lexemes accumulate in the vocab,
causing memory pressure.
Memory zones solve this problem by telling spaCy "okay none of the
documents allocated within this block will be accessed again". This
lets spaCy free all new Lexeme objects and other data that were
created during the block.
The mechanism is general, so memory_zone() context managers can be
added to other components that could benefit from them, e.g. pipeline
components.
I experimented with adding memory zone support to the tokenizer as well,
for its cache. However, this seems unnecessarily complicated. It makes
more sense to just stick a limit on the cache size. This lets spaCy
benefit from the efficiency advantage of the cache better, because
we can maintain a (bounded) cache even if only small batches of
documents are being processed.
2024-09-09 09:19:39 +00:00
|
|
|
if allow_transient and self._transient_map is not None:
|
|
|
|
self._transient_map.set(key, value)
|
|
|
|
self._transient_keys.push_back(key)
|
|
|
|
else:
|
|
|
|
self._map.set(key, value)
|
|
|
|
self.keys.push_back(key)
|
2017-05-28 10:36:27 +00:00
|
|
|
return value
|