Matthew Honnibal
|
e5e951ae67
|
* Remove the feature array stuff from Tokens class, and replace vector with array-based implementation, with padding.
|
2014-10-23 01:57:59 +11:00 |
Matthew Honnibal
|
0a0e41f6c8
|
* Add prefix and suffix features
|
2014-10-22 12:56:09 +11:00 |
Matthew Honnibal
|
65d3ead4fd
|
* Rename LexStr_casefix to LexStr_norm and LexInt_i to LexInt_id
|
2014-10-14 15:19:07 +11:00 |
Matthew Honnibal
|
71ee921055
|
* Slight cleaning of tokenizer code
|
2014-10-10 19:17:22 +11:00 |
Matthew Honnibal
|
59b41a9fd3
|
* Switch to new data model, tests passing
|
2014-10-10 08:11:31 +11:00 |
Matthew Honnibal
|
1b0e01d3d8
|
* Revising data model of lexeme. Compiles.
|
2014-10-09 19:53:30 +11:00 |
Matthew Honnibal
|
e40caae51f
|
* Update Lexicon class to expect a list of lexeme dict descriptions
|
2014-10-09 14:51:35 +11:00 |
Matthew Honnibal
|
51d75b244b
|
* Add serialize/deserialize functions for lexeme, transport to/from python dict.
|
2014-10-09 14:10:46 +11:00 |
Matthew Honnibal
|
d73d89a2de
|
* Add i attribute to lexeme, giving lexemes sequential IDs.
|
2014-10-09 13:50:05 +11:00 |
Matthew Honnibal
|
ac522e2553
|
* Switch from own memory class to cymem, in pip
|
2014-09-17 23:09:24 +02:00 |
Matthew Honnibal
|
6266cac593
|
* Switch to using a Python ref counted gateway to malloc/free, to prevent memory leaks
|
2014-09-17 20:02:26 +02:00 |
Matthew Honnibal
|
f77b7098c0
|
* Upd Tokens to use vector, with bounds checking.
|
2014-09-15 03:22:40 +02:00 |
Matthew Honnibal
|
b488224c09
|
* Restoring Lexeme-as-struct
|
2014-09-10 20:41:37 +02:00 |
Matthew Honnibal
|
88095666dc
|
* Remove Lexeme struct, preparing to rename Word to Lexeme.
|
2014-08-24 19:24:42 +02:00 |
Matthew Honnibal
|
e289896603
|
* Fix ptb3 module
|
2014-08-22 16:36:17 +02:00 |
Matthew Honnibal
|
811b7a6b91
|
* Struggling with arbitrary attr access...
|
2014-08-21 23:49:14 +02:00 |
Matthew Honnibal
|
d10993f41a
|
* More docs work
|
2014-08-21 16:37:13 +02:00 |
Matthew Honnibal
|
a78ad4152d
|
* Broken version being refactored for docs
|
2014-08-20 13:39:39 +02:00 |
Matthew Honnibal
|
5fddb8d165
|
* Working refactor, with updated data model for Lexemes
|
2014-08-19 04:21:20 +02:00 |
Matthew Honnibal
|
3379d7a571
|
* Reforming data model for lexemes
|
2014-08-19 02:40:37 +02:00 |
Matthew Honnibal
|
01469b0888
|
* Refactor spacy so that chunks return arrays of lexemes, so that there is properly one lexeme per word.
|
2014-08-18 19:14:00 +02:00 |
Matthew Honnibal
|
515d41d325
|
* Restore string saving to spacy
|
2014-08-16 16:09:24 +02:00 |
Matthew Honnibal
|
a225ca5b0d
|
* Refactoring tokenizer
|
2014-08-16 03:22:03 +02:00 |
Matthew Honnibal
|
d6e07aa922
|
* Switch to 32bit hash for strings
|
2014-08-02 21:51:52 +01:00 |
Matthew Honnibal
|
6319ff0f22
|
* Add length property
|
2014-08-02 21:26:44 +01:00 |
Matthew Honnibal
|
571808a274
|
Group-by seems to be working
|
2014-07-07 20:27:02 +02:00 |
Matthew Honnibal
|
80b36f9f27
|
* 710k words per second for counts
|
2014-07-07 19:12:19 +02:00 |
Matthew Honnibal
|
057c21969b
|
* Refactor for string view features. Working on setting up flags and enums.
|
2014-07-07 16:58:48 +02:00 |
Matthew Honnibal
|
f1bcbd4c4e
|
* Reorganized code to accomodate Tokens class. Need string views before group_by and count_by can be done well.
|
2014-07-07 12:47:21 +02:00 |
Matthew Honnibal
|
ff1869ff07
|
* Fixed major efficiency problem, from not quite grokking pass by reference in cython c++
|
2014-07-07 07:36:43 +02:00 |
Matthew Honnibal
|
d5bef02c72
|
* Reorganized, moving language-independent stuff to spacy. The functions in spacy ask for the dictionaries and split function on input, but the language-specific modules are curried versions that use the globals
|
2014-07-07 04:21:06 +02:00 |
Matthew Honnibal
|
556f6a18ca
|
* Initial commit. Tests passing for punctuation handling. Need contractions, file transport, tokenize function, etc.
|
2014-07-05 20:51:42 +02:00 |