spaCy/spacy/strings.pxd

from cymem.cymem cimport Pool
from libc.stdint cimport int64_t
from libcpp.set cimport set
from libcpp.vector cimport vector
from murmurhash.mrmr cimport hash64
from preshed.maps cimport PreshMap

from .typedefs cimport attr_t, hash_t


cpdef hash_t hash_string(str string) except 0
cdef hash_t hash_utf8(char* utf8_string, int length) nogil

cdef str decode_Utf8Str(const Utf8Str* string)


ctypedef union Utf8Str:
    unsigned char[8] s
    unsigned char* p


cdef class StringStore:
    cdef Pool mem

    cdef vector[hash_t] keys
    cdef public PreshMap _map

    cdef const Utf8Str* intern_unicode(self, str py_string, bint allow_transient)
    cdef const Utf8Str* _intern_utf8(self, char* utf8_string, int length, hash_t* precalculated_hash, bint allow_transient) 
    cdef vector[hash_t] _transient_keys
    cdef Pool _non_temp_mem
Configure isort to use the Black profile, recursively isort the `spacy` module (#12721) * Use isort with Black profile * isort all the things * Fix import cycles as a result of import sorting * Add DOCBIN_ALL_ATTRS type definition * Add isort to requirements * Remove isort from build dependencies check * Typo 2023-06-14 15:48:41 +00:00			`from cymem.cymem cimport Pool`
Refactor training, with new spacy.train module. Defaults still a little awkward. 2016-10-09 10:24:24 +00:00			`from libc.stdint cimport int64_t`
Try to fix StringStore clean up (see #1506) 2017-11-11 00:11:27 +00:00			`from libcpp.set cimport set`
Configure isort to use the Black profile, recursively isort the `spacy` module (#12721) * Use isort with Black profile * isort all the things * Fix import cycles as a result of import sorting * Add DOCBIN_ALL_ATTRS type definition * Add isort to requirements * Remove isort from build dependencies check * Typo 2023-06-14 15:48:41 +00:00			`from libcpp.vector cimport vector`
* Move murmurhash to .pxd in strings file 2014-12-19 20:41:08 +00:00			`from murmurhash.mrmr cimport hash64`
Configure isort to use the Black profile, recursively isort the `spacy` module (#12721) * Use isort with Black profile * isort all the things * Fix import cycles as a result of import sorting * Add DOCBIN_ALL_ATTRS type definition * Add isort to requirements * Remove isort from build dependencies check * Typo 2023-06-14 15:48:41 +00:00			`from preshed.maps cimport PreshMap`
* Fix type declarations for attr_t. Remove unused id_t. 2015-07-18 20:39:57 +00:00
Refactor training, with new spacy.train module. Defaults still a little awkward. 2016-10-09 10:24:24 +00:00			`from .typedefs cimport attr_t, hash_t`
* Move StringStore class to its own file 2014-12-19 19:42:01 +00:00
* Tmp commit. Refactoring to create a Python Lexeme class. 2015-01-11 23:26:22 +00:00
Update Cython string types (#9143) * Replace all basestring references with unicode `basestring` was a compatability type introduced by Cython to make dealing with utf-8 strings in Python2 easier. In Python3 it is equivalent to the unicode (or str) type. I replaced all references to basestring with unicode, since that was used elsewhere, but we could also just replace them with str, which shoudl also be equivalent. All tests pass locally. * Replace all references to unicode type with str Since we only support python3 this is simpler. * Remove all references to unicode type This removes all references to the unicode type across the codebase and replaces them with `str`, which makes it more drastic than the prior commits. In order to make this work importing `unicode_literals` had to be removed, and one explicit unicode literal also had to be removed (it is unclear why this is necessary in Cython with language level 3, but without doing it there were errors about implicit conversion). When `unicode` is used as a type in comments it was also edited to be `str`. Additionally `coding: utf8` headers were removed from a few files. 2021-09-13 15:02:17 +00:00			`cpdef hash_t hash_string(str string) except 0`
Work on changing StringStore to return hashes. 2017-05-28 10:36:27 +00:00			`cdef hash_t hash_utf8(char* utf8_string, int length) nogil`

Update Cython string types (#9143) * Replace all basestring references with unicode `basestring` was a compatability type introduced by Cython to make dealing with utf-8 strings in Python2 easier. In Python3 it is equivalent to the unicode (or str) type. I replaced all references to basestring with unicode, since that was used elsewhere, but we could also just replace them with str, which shoudl also be equivalent. All tests pass locally. * Replace all references to unicode type with str Since we only support python3 this is simpler. * Remove all references to unicode type This removes all references to the unicode type across the codebase and replaces them with `str`, which makes it more drastic than the prior commits. In order to make this work importing `unicode_literals` had to be removed, and one explicit unicode literal also had to be removed (it is unclear why this is necessary in Cython with language level 3, but without doing it there were errors about implicit conversion). When `unicode` is used as a type in comments it was also edited to be `str`. Additionally `coding: utf8` headers were removed from a few files. 2021-09-13 15:02:17 +00:00			`cdef str decode_Utf8Str(const Utf8Str* string)`
* Move murmurhash to .pxd in strings file 2014-12-19 20:41:08 +00:00

* Fix header for string store 2015-07-20 10:06:10 +00:00			`ctypedef union Utf8Str:`
			`unsigned char[8] s`
			`unsigned char* p`


* Move StringStore class to its own file 2014-12-19 19:42:01 +00:00			`cdef class StringStore:`
			`cdef Pool mem`

Work on changing StringStore to return hashes. 2017-05-28 10:36:27 +00:00			`cdef vector[hash_t] keys`
* Use unicode in StringStore.intern, instead of unreliably casting to bytes. 2015-11-05 11:32:19 +00:00			`cdef public PreshMap _map`
* Move StringStore class to its own file 2014-12-19 19:42:01 +00:00
Support 'memory zones' for user memory management (#13621) Add a context manage nlp.memory_zone(), which will begin memory_zone() blocks on the vocab, string store, and potentially other components. Example usage: ``` with nlp.memory_zone(): for text in nlp.pipe(texts): do_something(doc) # do_something(doc) <-- Invalid ``` Once the memory_zone() block expires, spaCy will free any shared resources that were allocated for the text-processing that occurred within the memory_zone. If you create Doc objects within a memory zone, it's invalid to access them once the memory zone is expired. The purpose of this is that spaCy creates and stores Lexeme objects in the Vocab that can be shared between multiple Doc objects. It also interns strings. Normally, spaCy can't know when all Doc objects using a Lexeme are out-of-scope, so new Lexemes accumulate in the vocab, causing memory pressure. Memory zones solve this problem by telling spaCy "okay none of the documents allocated within this block will be accessed again". This lets spaCy free all new Lexeme objects and other data that were created during the block. The mechanism is general, so memory_zone() context managers can be added to other components that could benefit from them, e.g. pipeline components. I experimented with adding memory zone support to the tokenizer as well, for its cache. However, this seems unnecessarily complicated. It makes more sense to just stick a limit on the cache size. This lets spaCy benefit from the efficiency advantage of the cache better, because we can maintain a (bounded) cache even if only small batches of documents are being processed. 2024-09-09 09:19:39 +00:00			`cdef const Utf8Str* intern_unicode(self, str py_string, bint allow_transient)`
			`cdef const Utf8Str* _intern_utf8(self, char* utf8_string, int length, hash_t* precalculated_hash, bint allow_transient)`
			`cdef vector[hash_t] _transient_keys`
			`cdef Pool _non_temp_mem`