spaCy/spacy/tokens/doc.pxd

from cymem.cymem cimport Pool
cimport numpy as np

from ..vocab cimport Vocab
from ..structs cimport TokenC, LexemeC
from ..typedefs cimport attr_t
from ..attrs cimport attr_id_t


cdef attr_t get_token_attr(const TokenC* token, attr_id_t feat_name) nogil
cdef attr_t get_token_attr_for_matcher(const TokenC* token, attr_id_t feat_name) nogil


ctypedef const LexemeC* const_Lexeme_ptr
ctypedef const TokenC* const_TokenC_ptr

ctypedef fused LexemeOrToken:
    const_Lexeme_ptr
    const_TokenC_ptr


cdef int set_children_from_heads(TokenC* tokens, int start, int end) except -1


cdef int _set_lr_kids_and_edges(TokenC* tokens, int start, int end, int loop_count) except -1


cdef int token_by_start(const TokenC* tokens, int length, int start_char) except -2


cdef int token_by_end(const TokenC* tokens, int length, int end_char) except -2


cdef int [:,:] _get_lca_matrix(Doc, int start, int end)

cdef class Doc:
    cdef readonly Pool mem
    cdef readonly Vocab vocab

    cdef public object _vector
    cdef public object _vector_norm

    cdef public object tensor
    cdef public object cats
    cdef public object user_data

    cdef TokenC* c

    cdef public float sentiment

    cdef public dict user_hooks
    cdef public dict user_token_hooks
    cdef public dict user_span_hooks

    cdef public bint has_unknown_spaces

    cdef public list _py_tokens

    cdef int length
    cdef int max_length


    cdef public object noun_chunks_iterator

    cdef object __weakref__

    cdef int push_back(self, LexemeOrToken lex_or_tok, bint has_space) except -1

    cpdef np.ndarray to_array(self, object features)
* Add spacy/tokens/doc.pyx, for Doc class in its own file 2015-07-13 17:58:26 +00:00			`from cymem.cymem cimport Pool`
			`cimport numpy as np`

			`from ..vocab cimport Vocab`
			`from ..structs cimport TokenC, LexemeC`
* Gazetteer stuff working, now need to wire up to API 2015-08-05 22:35:40 +00:00			`from ..typedefs cimport attr_t`
			`from ..attrs cimport attr_id_t`


			`cdef attr_t get_token_attr(const TokenC* token, attr_id_t feat_name) nogil`
Normalize TokenC.sent_start values for Matcher (#5346) Normalize TokenC.sent_start values to booleans for the `Matcher`. 2020-04-29 10:57:30 +00:00			`cdef attr_t get_token_attr_for_matcher(const TokenC* token, attr_id_t feat_name) nogil`
* Add spacy/tokens/doc.pyx, for Doc class in its own file 2015-07-13 17:58:26 +00:00

			`ctypedef const LexemeC* const_Lexeme_ptr`
* More work on language-generic parsing 2015-08-28 00:02:33 +00:00			`ctypedef const TokenC* const_TokenC_ptr`
* Add spacy/tokens/doc.pyx, for Doc class in its own file 2015-07-13 17:58:26 +00:00
			`ctypedef fused LexemeOrToken:`
			`const_Lexeme_ptr`
* More work on language-generic parsing 2015-08-28 00:02:33 +00:00			`const_TokenC_ptr`
* Add spacy/tokens/doc.pyx, for Doc class in its own file 2015-07-13 17:58:26 +00:00

Clean up spacy.tokens (#6046) * Clean up spacy.tokens * Update `set_children_from_heads`: * Don't check `dep` when setting lr_* or sentence starts * Set all non-sentence starts to `False` * Use `set_children_from_heads` in `Token.head` setter * Reduce similar/duplicate code (admittedly adds a bit of overhead) * Update sentence starts consistently * Remove unused `Doc.set_parse` * Minor changes: * Declare cython variables (to avoid cython warnings) * Clean up imports * Modify set_children_from_heads to set token range Modify `set_children_from_heads` so that it adjust tokens within a specified range rather then the whole document. Modify the `Token.head` setter to adjust only the tokens affected by the new head assignment. 2020-09-16 18:32:38 +00:00			`cdef int set_children_from_heads(TokenC* tokens, int start, int end) except -1`
Revert "Merge branch 'develop' of https://github.com/explosion/spaCy into develop" This reverts commit c9ba3d3c2dc7067cf8bd55f878cec45a8c6d73d4, reversing changes made to 92c26a35d425d4e8ca1b805ea776ea10f5ded3df. 2018-03-27 17:23:02 +00:00

Clean up spacy.tokens (#6046) * Clean up spacy.tokens * Update `set_children_from_heads`: * Don't check `dep` when setting lr_* or sentence starts * Set all non-sentence starts to `False` * Use `set_children_from_heads` in `Token.head` setter * Reduce similar/duplicate code (admittedly adds a bit of overhead) * Update sentence starts consistently * Remove unused `Doc.set_parse` * Minor changes: * Declare cython variables (to avoid cython warnings) * Clean up imports * Modify set_children_from_heads to set token range Modify `set_children_from_heads` so that it adjust tokens within a specified range rather then the whole document. Modify the `Token.head` setter to adjust only the tokens affected by the new head assignment. 2020-09-16 18:32:38 +00:00			`cdef int _set_lr_kids_and_edges(TokenC* tokens, int start, int end, int loop_count) except -1`
Iterate over lr_edges until sents are correct (#4702) Iterate over lr_edges until all heads are within the current sentence. Instead of iterating over them for a fixed number of iterations, check whether the sentence boundaries are correct for the heads and stop when all are correct. Stop after a maximum of 10 iterations, providing a warning in this case since the sentence boundaries may not be correct. 2019-11-25 12:06:36 +00:00

* Rework the Span-merge patch, to avoid extending the interface of Doc, and avoid virtualizing the Span.start and Span.end indices, to keep Span usage efficient 2015-11-06 21:55:34 +00:00			`cdef int token_by_start(const TokenC* tokens, int length, int start_char) except -2`


			`cdef int token_by_end(const TokenC* tokens, int length, int end_char) except -2`


Fix issue 2396 (#3089) * Test on #2396: bug in Doc.get_lca_matrix() * reimplementation of Doc.get_lca_matrix(), (closes #2396) * reimplement Span.get_lca_matrix(), and call it from Doc.get_lca_matrix() * tests Span.get_lca_matrix() as well as Doc.get_lca_matrix() * implement _get_lca_matrix as a helper function in doc.pyx; call it from Doc.get_lca_matrix and Span.get_lca_matrix * use memory view instead of np.ndarray in _get_lca_matrix (faster) * fix bug when calling Span.get_lca_matrix; return lca matrix as np.array instead of memoryview * cleaner conditional, add comment 2018-12-29 17:02:26 +00:00			`cdef int [:,:] _get_lca_matrix(Doc, int start, int end)`

* Add spacy/tokens/doc.pyx, for Doc class in its own file 2015-07-13 17:58:26 +00:00			`cdef class Doc:`
* Make mem and vocab python-visible in Doc 2015-07-28 18:46:59 +00:00			`cdef readonly Pool mem`
			`cdef readonly Vocab vocab`
* Add spacy/tokens/doc.pyx, for Doc class in its own file 2015-07-13 17:58:26 +00:00
* Try giving Doc and Span objects vector and vector_norm attributes, and .similarity functions. Turns out to be bad idea. 2015-09-17 01:50:11 +00:00			`cdef public object _vector`
			`cdef public object _vector_norm`

Tmp GPU code 2017-05-07 16:04:24 +00:00			`cdef public object tensor`
Add slot for text categories to Doc 2017-07-21 22:34:15 +00:00			`cdef public object cats`
Add user_data attribute to Doc object. 2016-10-17 09:43:22 +00:00			`cdef public object user_data`
Add tensor field to Lexeme, Token, Doc and Span, so that users have a place to hang neural network outputs 2016-10-14 01:24:13 +00:00
* Rename Doc.data to Doc.c 2015-11-03 13:15:14 +00:00			`cdef TokenC* c`
* Add spacy/tokens/doc.pyx, for Doc class in its own file 2015-07-13 17:58:26 +00:00
Add sentiment field to doc, rename getters_for_tokens and getters_for_spans, add user_hooks field to Doc. 2016-10-19 18:54:03 +00:00			`cdef public float sentiment`

			`cdef public dict user_hooks`
			`cdef public dict user_token_hooks`
			`cdef public dict user_span_hooks`
Add getters_for_tokens and getters_for_spans attributes to Doc object. 2016-10-17 00:42:05 +00:00
Record whether Doc objects are built from known spacing (#5697) * Tell convert CLI to store user data for Doc * Remove assert * Add has_unknwon_spaces flag on Doc * Do not tokenize docs with unknown spaces in Corpus * Handle conversion of unknown spaces in Example * Fixes * Fixes * Draft has_known_spaces support in DocBin * Add test for serialize has_unknown_spaces * Fix DocBin serialization when has_unknown_spaces * Use serialization in test 2020-07-03 10:58:16 +00:00			`cdef public bint has_unknown_spaces`

* Restore _py_tokens cache, to handle orphan tokens. 2015-07-13 20:28:10 +00:00			`cdef public list _py_tokens`

* Add spacy/tokens/doc.pyx, for Doc class in its own file 2015-07-13 17:58:26 +00:00			`cdef int length`
			`cdef int max_length`

Record whether Doc objects are built from known spacing (#5697) * Tell convert CLI to store user data for Doc * Remove assert * Add has_unknwon_spaces flag on Doc * Do not tokenize docs with unknown spaces in Corpus * Handle conversion of unknown spaces in Example * Fixes * Fixes * Draft has_known_spaces support in DocBin * Add test for serialize has_unknown_spaces * Fix DocBin serialization when has_unknown_spaces * Use serialization in test 2020-07-03 10:58:16 +00:00
* Fix assignment of iterator on Doc object 2016-05-02 13:26:24 +00:00			`cdef public object noun_chunks_iterator`
add baseclass DocIterator for iterators over documents add classes for English and German noun chunks the respective iterators are set for the document when created by the parser as they depend on the annotation scheme of the parsing model 2016-03-16 14:53:35 +00:00
Allow weakrefs on Doc objects 2017-10-16 17:22:11 +00:00			`cdef object __weakref__`

Fix parameter name in .pxd file 2017-09-26 12:28:50 +00:00			`cdef int push_back(self, LexemeOrToken lex_or_tok, bint has_space) except -1`
* Add spacy/tokens/doc.pyx, for Doc class in its own file 2015-07-13 17:58:26 +00:00
			`cpdef np.ndarray to_array(self, object features)`