spaCy

Commit Graph

Author	SHA1	Message	Date
Matthew Honnibal	59b41a9fd3	* Switch to new data model, tests passing	2014-10-10 08:11:31 +11:00
Matthew Honnibal	b15619e170	* Use PointerHash instead of locally provided _hashing module	2014-09-25 18:23:35 +02:00
Matthew Honnibal	6266cac593	* Switch to using a Python ref counted gateway to malloc/free, to prevent memory leaks	2014-09-17 20:02:26 +02:00
Matthew Honnibal	0152831c89	* Refactor tokenization, enable cache, and ensure we look up specials correctly even when there's confusing punctuation surrounding the token.	2014-09-16 18:01:46 +02:00
Matthew Honnibal	143e51ec73	* Refactor tokenization, splitting it into a clearer life-cycle.	2014-09-16 13:16:02 +02:00
Matthew Honnibal	7959141d36	* Add a few abbreviations, to get tests to pass	2014-09-15 06:32:18 +02:00
Matthew Honnibal	df24e3708c	* Move EnglishTokens stuff to Tokens	2014-09-15 01:31:44 +02:00
Matthew Honnibal	0447279c57	* PointerHash working, efficiency is good. 6-7 mins	2014-09-13 16:43:59 +02:00
Matthew Honnibal	85d68e8e95	* Replaced cache with own hash table. Similar timing	2014-09-13 03:14:43 +02:00
Matthew Honnibal	126a8453a5	* Fix performance issues by implementing a better cache. Add own String struct to help	2014-09-12 23:50:37 +02:00
Matthew Honnibal	9298e36b36	* Move special tokenization into its own lookup table, away from the cache.	2014-09-12 19:43:14 +02:00
Matthew Honnibal	985bc68327	* Fix bug with trailing punct on contractions. Reduced efficiency, and slightly hacky implementation.	2014-09-12 18:26:26 +02:00
Matthew Honnibal	5aa591106b	* Fiddle with token features	2014-09-12 15:49:36 +02:00
Matthew Honnibal	1533041885	* Update the split_one method, so that it doesn't need to cast back to a Python object	2014-09-12 05:10:59 +02:00
Matthew Honnibal	4817277d66	* Replace main lexicon dict with dense_hash_map. May be unsuitable, if strings need recovery.	2014-09-12 04:29:09 +02:00
Matthew Honnibal	073ee0de63	* Restore dense_hash_map for cache dictionary. Seems to double efficiency	2014-09-12 02:23:51 +02:00
Matthew Honnibal	1a3222af4b	* Moving tokens to use an array internally, instead of a list of Lexeme objects.	2014-09-11 16:57:08 +02:00
Matthew Honnibal	7c09c73a14	* Refactor to use tokens class.	2014-09-10 18:27:44 +02:00
Matthew Honnibal	cf412adba8	* Refactoring to use Tokens object	2014-09-10 18:11:13 +02:00
Matthew Honnibal	dcab14ede2	* Begin testing more functionality	2014-08-30 19:01:15 +02:00
Matthew Honnibal	45a22d6b2c	* Docs coming together	2014-08-29 01:59:23 +02:00
Matthew Honnibal	c282e6d5fb	* Redesign proceeding	2014-08-28 19:45:09 +02:00
Matthew Honnibal	fdaf24604a	* Basic punct tests updated and passing	2014-08-27 19:38:57 +02:00
Matthew Honnibal	8d20617dfd	* Whitespace	2014-08-27 17:16:16 +02:00
Matthew Honnibal	e9a62b6eba	* Refactoring with Lexeme as a class now compiles. Basic design seems to work	2014-08-27 17:15:39 +02:00
Matthew Honnibal	68bae2fec6	* More refactoring	2014-08-25 16:42:22 +02:00
Matthew Honnibal	88095666dc	* Remove Lexeme struct, preparing to rename Word to Lexeme.	2014-08-24 19:24:42 +02:00
Matthew Honnibal	3b793cf4f7	* Tests passing for new Word object version	2014-08-24 18:13:53 +02:00
Matthew Honnibal	782806df08	* Moving to Word objects in place of the Lexeme struct.	2014-08-22 17:28:23 +02:00
Matthew Honnibal	e289896603	* Fix ptb3 module	2014-08-22 16:36:17 +02:00
Matthew Honnibal	07ecf5d2f4	* Fixed group_by, removed idea of general attr_of function.	2014-08-22 00:02:37 +02:00
Matthew Honnibal	811b7a6b91	* Struggling with arbitrary attr access...	2014-08-21 23:49:14 +02:00
Matthew Honnibal	314658b31c	* Improve module docstring	2014-08-21 18:42:47 +02:00
Matthew Honnibal	248cbb6d07	* Update doc strings	2014-08-21 03:29:15 +02:00
Matthew Honnibal	a78ad4152d	* Broken version being refactored for docs	2014-08-20 13:39:39 +02:00
Matthew Honnibal	5fddb8d165	* Working refactor, with updated data model for Lexemes	2014-08-19 04:21:20 +02:00
Matthew Honnibal	3379d7a571	* Reforming data model for lexemes	2014-08-19 02:40:37 +02:00
Matthew Honnibal	01469b0888	* Refactor spacy so that chunks return arrays of lexemes, so that there is properly one lexeme per word.	2014-08-18 19:14:00 +02:00
Matthew Honnibal	a225ca5b0d	* Refactoring tokenizer	2014-08-16 03:22:03 +02:00
Matthew Honnibal	a895fe5ddb	* Upd from spacy	2014-07-23 17:35:18 +01:00
Matthew Honnibal	87bf205b82	* Fix open apostrophe bug	2014-07-07 23:26:01 +02:00
Matthew Honnibal	057c21969b	* Refactor for string view features. Working on setting up flags and enums.	2014-07-07 16:58:48 +02:00
Matthew Honnibal	f1bcbd4c4e	* Reorganized code to accomodate Tokens class. Need string views before group_by and count_by can be done well.	2014-07-07 12:47:21 +02:00
Matthew Honnibal	ff1869ff07	* Fixed major efficiency problem, from not quite grokking pass by reference in cython c++	2014-07-07 07:36:43 +02:00
Matthew Honnibal	d5bef02c72	* Reorganized, moving language-independent stuff to spacy. The functions in spacy ask for the dictionaries and split function on input, but the language-specific modules are curried versions that use the globals	2014-07-07 04:21:06 +02:00
Matthew Honnibal	a62c38e1ef	* Working tokenization. en doesn't match PTB perfectly. Need to reorganize before adding more schemes.	2014-07-07 01:15:59 +02:00
Matthew Honnibal	4e79446dc2	* Reading in tokenization rules correctly. Passing tests.	2014-07-07 00:02:55 +02:00
Matthew Honnibal	72159e7011	* Fixes to tokenization. Now segment sequences of the same punctuation.	2014-07-06 19:28:42 +02:00
Matthew Honnibal	e98e97d483	* Possessive test passing	2014-07-06 18:35:55 +02:00
Matthew Honnibal	556f6a18ca	* Initial commit. Tests passing for punctuation handling. Need contractions, file transport, tokenize function, etc.	2014-07-05 20:51:42 +02:00

50 Commits