spaCy

Commit Graph

Author	SHA1	Message	Date
Wolfgang Seeker	eae35e9b27	add tokenizer files for German, add/change code to train German pos tagger - add files to specify rules for German tokenization - change generate_specials.py to generate from an external file (abbrev.de.tab) - copy gazetteer.json from lang_data/en/ - init_model.py - change doc freq threshold to 0 - add train_german_tagger.py - expects conll09-formatted input	2016-02-18 13:24:20 +01:00
Matthew Honnibal	2348a08481	* Load/dump strings with a json file, instead of the hacky strings file we were using.	2015-10-22 21:13:03 +11:00
Matthew Honnibal	92f750cf8b	* Use a gzipped frequencies file in init_model	2015-10-11 06:59:44 +02:00
Matthew Honnibal	064bd69ad0	* Refactor symbols, so that frequency rank can be derived from the orth id of a word.	2015-10-10 16:03:48 +11:00
Matthew Honnibal	83dccf0fd7	* Use io module insteads of deprecated codecs module	2015-10-10 14:13:01 +11:00
alvations	8caedba42a	caught more codecs.open -> io.open	2015-09-30 20:20:09 +02:00
Matthew Honnibal	1ae55cb63a	* Copy tag_map.json in init_model	2015-09-12 05:54:02 +02:00
Matthew Honnibal	5ad4527c42	* Rename Deutsch to German	2015-09-06 20:18:58 +02:00
Matthew Honnibal	950ce36660	* Update init model	2015-09-06 17:51:30 +02:00
Matthew Honnibal	b6b1e1aa12	* Add link for Finnish model	2015-08-27 10:26:02 +02:00
Matthew Honnibal	dc13edd7cb	* Refactor init_model to accomodate other languages	2015-08-26 19:14:05 +02:00
Matthew Honnibal	bbf07ac253	* Cut down init_model to work on more languages	2015-08-24 01:05:20 +02:00
Matthew Honnibal	3ecacb9635	* Copy gazetteer file in init_model	2015-08-06 16:07:23 +02:00
Matthew Honnibal	174ed1ad20	* Tighten the frequency filter in init_model	2015-07-27 21:44:51 +02:00
Matthew Honnibal	6047f2aa35	* Fix path to freqs.txt	2015-07-27 02:22:35 +02:00
Matthew Honnibal	0368889d6c	* Support gzipped frequencies in init_model	2015-07-26 22:39:22 +02:00
Matthew Honnibal	c4f20847da	* Fix init_model for travis tests	2015-07-26 14:03:30 +02:00
Matthew Honnibal	09312b9353	* Fix init_model for travis tests	2015-07-26 13:55:47 +02:00
Matthew Honnibal	90ad717dc4	* Update default freq thresholds in init_model	2015-07-26 01:41:17 +02:00
Matthew Honnibal	6a5e035a48	* Ensure data files are copied for tokenizer in init_model	2015-07-26 01:36:19 +02:00
Matthew Honnibal	ab93898ac6	* Make heuristics more explicit in init_model	2015-07-26 00:22:19 +02:00
Matthew Honnibal	5c04dcd7c1	* Fix init_model	2015-07-25 23:33:02 +02:00
Matthew Honnibal	fd525f0675	* Pass OOV probability around	2015-07-25 23:29:51 +02:00
Matthew Honnibal	5b6bf4d4a6	* Remove probability cap on lexicon	2015-07-25 23:05:51 +02:00
Matthew Honnibal	c62eb110c0	* Fix merge conflict in init_model	2015-07-25 23:04:30 +02:00
Matthew Honnibal	0301472d15	* Fix init_model	2015-07-25 22:56:35 +02:00
Matthew Honnibal	8e800adfbc	* Fix init_model	2015-07-25 22:54:08 +02:00
Matthew Honnibal	6076213c16	* Fix init_model script	2015-07-25 22:35:52 +02:00
Matthew Honnibal	ef448649b3	* Add read_freqs function in init_model	2015-07-25 22:16:36 +02:00
Matthew Honnibal	6be3ee311c	Py3 compatibility tweak	2015-07-23 13:13:15 +02:00
Matthew Honnibal	d4407d8e2f	Py3 compatibility tweak	2015-07-23 09:45:15 +02:00
Matthew Honnibal	da4821fc14	* Add cluster words to probs in init_model	2015-07-23 09:27:07 +02:00
Matthew Honnibal	4af2595d99	* Fix structure of wordnet directory for init_model	2015-07-23 06:35:38 +02:00
Matthew Honnibal	83c0f0da22	* Remove lemmatizer from init_model	2015-07-23 02:32:34 +02:00
Matthew Honnibal	386246db5b	* Update init_model, making language resources optional	2015-07-22 00:25:14 +02:00
Matthew Honnibal	af54d05d60	* Remove sense stuff from init_model	2015-07-14 10:56:17 +02:00
Matthew Honnibal	62cfcd76fe	* Add supersense sets to lexemes, from WordNet. Look-up via lemmatization.	2015-07-01 18:48:59 +02:00
Matthew Honnibal	c8a553fe91	* Fix cluster initialization	2015-05-31 15:21:28 +02:00
Matthew Honnibal	c037f80638	* Add case expansion to Brown clusters	2015-05-31 05:50:50 +02:00
Matthew Honnibal	5ab0f233a1	* Ensure words in Brown clusters make it into the vocab, even if they're not in our probs list	2015-05-31 05:46:16 +02:00
Matthew Honnibal	4489d87550	* Add cluster=0 by default in init_model	2015-04-29 14:23:13 +02:00
Matthew Honnibal	693c5a1558	* Exclude clusterings for words only seen 1 or 2 times, as their clusters are unreliable	2015-04-17 04:44:52 +02:00
Matthew Honnibal	1629b33082	* Fix copying of tokenizer data in init_model	2015-04-12 04:45:31 +02:00
Matthew Honnibal	baff0f8ad8	* Add docstring explaining script a bit, and add handling of word vectors	2015-04-08 08:20:15 +02:00
Matthew Honnibal	156b70ed82	* Add new script to replace make_lexicon, that does full setup of data	2015-04-08 07:46:53 +02:00

45 Commits