spaCy/spacy/spacy.pxd

from libcpp.vector cimport vector
from libc.stdint cimport uint32_t
from libc.stdint cimport uint64_t

# Circular import problems here
ctypedef size_t Lexeme_addr
ctypedef uint32_t StringHash
from spacy.lexeme cimport Lexeme

from spacy.tokens cimport Tokens

# Put these above import to avoid circular import problem
ctypedef char Bits8
ctypedef uint64_t Bits64
ctypedef int ClusterID


from spacy.lexeme cimport Lexeme


cdef class Language:
    cdef object name
    cdef dict chunks
    cdef dict vocab
    cdef dict bacov

    cpdef Tokens tokenize(self, unicode text)

    cdef Lexeme* lookup(self, unicode string) except NULL
    cdef Lexeme** lookup_chunk(self, unicode chunk) except NULL
    
    cdef Lexeme** new_chunk(self, unicode string, list substrings) except NULL
    cdef Lexeme* new_lexeme(self, unicode lex) except NULL
    
    cpdef unicode unhash(self, StringHash hashed)
    
    cpdef list find_substrings(self, unicode chunk)
    cdef int find_split(self, unicode word)
    cdef int set_orth(self, unicode string, Lexeme* word)
* Initial commit. Tests passing for punctuation handling. Need contractions, file transport, tokenize function, etc. 2014-07-05 18:51:42 +00:00			`from libcpp.vector cimport vector`
* Switch to 32bit hash for strings 2014-08-02 20:51:52 +00:00			`from libc.stdint cimport uint32_t`
* Reorganized, moving language-independent stuff to spacy. The functions in spacy ask for the dictionaries and split function on input, but the language-specific modules are curried versions that use the globals 2014-07-07 02:21:06 +00:00			`from libc.stdint cimport uint64_t`

			`# Circular import problems here`
			`ctypedef size_t Lexeme_addr`
* Switch to 32bit hash for strings 2014-08-02 20:51:52 +00:00			`ctypedef uint32_t StringHash`
* Refactor for string view features. Working on setting up flags and enums. 2014-07-07 14:58:48 +00:00			`from spacy.lexeme cimport Lexeme`

			`from spacy.tokens cimport Tokens`

			`# Put these above import to avoid circular import problem`
			`ctypedef char Bits8`
			`ctypedef uint64_t Bits64`
			`ctypedef int ClusterID`
* Initial commit. Tests passing for punctuation handling. Need contractions, file transport, tokenize function, etc. 2014-07-05 18:51:42 +00:00

* Reorganized, moving language-independent stuff to spacy. The functions in spacy ask for the dictionaries and split function on input, but the language-specific modules are curried versions that use the globals 2014-07-07 02:21:06 +00:00			`from spacy.lexeme cimport Lexeme`
* Refactor for string view features. Working on setting up flags and enums. 2014-07-07 14:58:48 +00:00
* Reorganized code to accomodate Tokens class. Need string views before group_by and count_by can be done well. 2014-07-07 10:47:21 +00:00
			`cdef class Language:`
			`cdef object name`
* Replace the use of dense_hash_map with Python dict 2014-08-22 15:13:09 +00:00			`cdef dict chunks`
			`cdef dict vocab`
* Reorganized code to accomodate Tokens class. Need string views before group_by and count_by can be done well. 2014-07-07 10:47:21 +00:00			`cdef dict bacov`
* Reorganized, moving language-independent stuff to spacy. The functions in spacy ask for the dictionaries and split function on input, but the language-specific modules are curried versions that use the globals 2014-07-07 02:21:06 +00:00
* Broken version being refactored for docs 2014-08-20 11:39:39 +00:00			`cpdef Tokens tokenize(self, unicode text)`
* Refactoring tokenizer 2014-08-16 01:22:03 +00:00
* Refactor spacy so that chunks return arrays of lexemes, so that there is properly one lexeme per word. 2014-08-18 17:14:00 +00:00			`cdef Lexeme* lookup(self, unicode string) except NULL`
* Roll back to using unicode, and never Py_UNICODE. No dependence on murmurhash either. 2014-08-18 18:48:48 +00:00			`cdef Lexeme** lookup_chunk(self, unicode chunk) except NULL`
* Refactoring tokenizer 2014-08-16 01:22:03 +00:00
* Refactor spacy so that chunks return arrays of lexemes, so that there is properly one lexeme per word. 2014-08-18 17:14:00 +00:00			`cdef Lexeme** new_chunk(self, unicode string, list substrings) except NULL`
			`cdef Lexeme* new_lexeme(self, unicode lex) except NULL`
* Refactoring tokenizer 2014-08-16 01:22:03 +00:00
* Broken version being refactored for docs 2014-08-20 11:39:39 +00:00			`cpdef unicode unhash(self, StringHash hashed)`
* Reorganized code to accomodate Tokens class. Need string views before group_by and count_by can be done well. 2014-07-07 10:47:21 +00:00
* Broken version being refactored for docs 2014-08-20 11:39:39 +00:00			`cpdef list find_substrings(self, unicode chunk)`
* Refactor spacy so that chunks return arrays of lexemes, so that there is properly one lexeme per word. 2014-08-18 17:14:00 +00:00			`cdef int find_split(self, unicode word)`
* Broken version being refactored for docs 2014-08-20 11:39:39 +00:00			`cdef int set_orth(self, unicode string, Lexeme* word)`