Commit Graph

49 Commits

Author SHA1 Message Date
Matthew Honnibal 8569dbc2d0 * Add initial stuff for Chinese parsing 2016-04-24 18:44:24 +02:00
Wolfgang Seeker f9150ccf2a rename vectors.tgz to vectors.bz2 because it's not compressed with gzip but bzip 2016-04-08 13:38:07 +02:00
Wolfgang Seeker a8f4e49900 update init_model.py to previous (better) state 2016-03-29 16:12:13 +02:00
Yaser Martinez Palenzuela 3c210f45fa make use of log_smooth_count 2016-03-17 12:19:52 +01:00
Wolfgang Seeker eae35e9b27 add tokenizer files for German, add/change code to train German pos tagger
- add files to specify rules for German tokenization
- change generate_specials.py to generate from an external file (abbrev.de.tab)
- copy gazetteer.json from lang_data/en/

- init_model.py
	- change doc freq threshold to 0
- add train_german_tagger.py
	- expects conll09-formatted input
2016-02-18 13:24:20 +01:00
Matthew Honnibal 2348a08481 * Load/dump strings with a json file, instead of the hacky strings file we were using. 2015-10-22 21:13:03 +11:00
Matthew Honnibal 92f750cf8b * Use a gzipped frequencies file in init_model 2015-10-11 06:59:44 +02:00
Matthew Honnibal 064bd69ad0 * Refactor symbols, so that frequency rank can be derived from the orth id of a word. 2015-10-10 16:03:48 +11:00
Matthew Honnibal 83dccf0fd7 * Use io module insteads of deprecated codecs module 2015-10-10 14:13:01 +11:00
alvations 8caedba42a caught more codecs.open -> io.open 2015-09-30 20:20:09 +02:00
Matthew Honnibal 1ae55cb63a * Copy tag_map.json in init_model 2015-09-12 05:54:02 +02:00
Matthew Honnibal 5ad4527c42 * Rename Deutsch to German 2015-09-06 20:18:58 +02:00
Matthew Honnibal 950ce36660 * Update init model 2015-09-06 17:51:30 +02:00
Matthew Honnibal b6b1e1aa12 * Add link for Finnish model 2015-08-27 10:26:02 +02:00
Matthew Honnibal dc13edd7cb * Refactor init_model to accomodate other languages 2015-08-26 19:14:05 +02:00
Matthew Honnibal bbf07ac253 * Cut down init_model to work on more languages 2015-08-24 01:05:20 +02:00
Matthew Honnibal 3ecacb9635 * Copy gazetteer file in init_model 2015-08-06 16:07:23 +02:00
Matthew Honnibal 174ed1ad20 * Tighten the frequency filter in init_model 2015-07-27 21:44:51 +02:00
Matthew Honnibal 6047f2aa35 * Fix path to freqs.txt 2015-07-27 02:22:35 +02:00
Matthew Honnibal 0368889d6c * Support gzipped frequencies in init_model 2015-07-26 22:39:22 +02:00
Matthew Honnibal c4f20847da * Fix init_model for travis tests 2015-07-26 14:03:30 +02:00
Matthew Honnibal 09312b9353 * Fix init_model for travis tests 2015-07-26 13:55:47 +02:00
Matthew Honnibal 90ad717dc4 * Update default freq thresholds in init_model 2015-07-26 01:41:17 +02:00
Matthew Honnibal 6a5e035a48 * Ensure data files are copied for tokenizer in init_model 2015-07-26 01:36:19 +02:00
Matthew Honnibal ab93898ac6 * Make heuristics more explicit in init_model 2015-07-26 00:22:19 +02:00
Matthew Honnibal 5c04dcd7c1 * Fix init_model 2015-07-25 23:33:02 +02:00
Matthew Honnibal fd525f0675 * Pass OOV probability around 2015-07-25 23:29:51 +02:00
Matthew Honnibal 5b6bf4d4a6 * Remove probability cap on lexicon 2015-07-25 23:05:51 +02:00
Matthew Honnibal c62eb110c0 * Fix merge conflict in init_model 2015-07-25 23:04:30 +02:00
Matthew Honnibal 0301472d15 * Fix init_model 2015-07-25 22:56:35 +02:00
Matthew Honnibal 8e800adfbc * Fix init_model 2015-07-25 22:54:08 +02:00
Matthew Honnibal 6076213c16 * Fix init_model script 2015-07-25 22:35:52 +02:00
Matthew Honnibal ef448649b3 * Add read_freqs function in init_model 2015-07-25 22:16:36 +02:00
Matthew Honnibal 6be3ee311c Py3 compatibility tweak 2015-07-23 13:13:15 +02:00
Matthew Honnibal d4407d8e2f Py3 compatibility tweak 2015-07-23 09:45:15 +02:00
Matthew Honnibal da4821fc14 * Add cluster words to probs in init_model 2015-07-23 09:27:07 +02:00
Matthew Honnibal 4af2595d99 * Fix structure of wordnet directory for init_model 2015-07-23 06:35:38 +02:00
Matthew Honnibal 83c0f0da22 * Remove lemmatizer from init_model 2015-07-23 02:32:34 +02:00
Matthew Honnibal 386246db5b * Update init_model, making language resources optional 2015-07-22 00:25:14 +02:00
Matthew Honnibal af54d05d60 * Remove sense stuff from init_model 2015-07-14 10:56:17 +02:00
Matthew Honnibal 62cfcd76fe * Add supersense sets to lexemes, from WordNet. Look-up via lemmatization. 2015-07-01 18:48:59 +02:00
Matthew Honnibal c8a553fe91 * Fix cluster initialization 2015-05-31 15:21:28 +02:00
Matthew Honnibal c037f80638 * Add case expansion to Brown clusters 2015-05-31 05:50:50 +02:00
Matthew Honnibal 5ab0f233a1 * Ensure words in Brown clusters make it into the vocab, even if they're not in our probs list 2015-05-31 05:46:16 +02:00
Matthew Honnibal 4489d87550 * Add cluster=0 by default in init_model 2015-04-29 14:23:13 +02:00
Matthew Honnibal 693c5a1558 * Exclude clusterings for words only seen 1 or 2 times, as their clusters are unreliable 2015-04-17 04:44:52 +02:00
Matthew Honnibal 1629b33082 * Fix copying of tokenizer data in init_model 2015-04-12 04:45:31 +02:00
Matthew Honnibal baff0f8ad8 * Add docstring explaining script a bit, and add handling of word vectors 2015-04-08 08:20:15 +02:00
Matthew Honnibal 156b70ed82 * Add new script to replace make_lexicon, that does full setup of data 2015-04-08 07:46:53 +02:00