Wolfgang Seeker
eae35e9b27
add tokenizer files for German, add/change code to train German pos tagger
...
- add files to specify rules for German tokenization
- change generate_specials.py to generate from an external file (abbrev.de.tab)
- copy gazetteer.json from lang_data/en/
- init_model.py
- change doc freq threshold to 0
- add train_german_tagger.py
- expects conll09-formatted input
2016-02-18 13:24:20 +01:00
Matthew Honnibal
2348a08481
* Load/dump strings with a json file, instead of the hacky strings file we were using.
2015-10-22 21:13:03 +11:00
Matthew Honnibal
92f750cf8b
* Use a gzipped frequencies file in init_model
2015-10-11 06:59:44 +02:00
Matthew Honnibal
064bd69ad0
* Refactor symbols, so that frequency rank can be derived from the orth id of a word.
2015-10-10 16:03:48 +11:00
Matthew Honnibal
83dccf0fd7
* Use io module insteads of deprecated codecs module
2015-10-10 14:13:01 +11:00
alvations
8caedba42a
caught more codecs.open -> io.open
2015-09-30 20:20:09 +02:00
Matthew Honnibal
1ae55cb63a
* Copy tag_map.json in init_model
2015-09-12 05:54:02 +02:00
Matthew Honnibal
5ad4527c42
* Rename Deutsch to German
2015-09-06 20:18:58 +02:00
Matthew Honnibal
950ce36660
* Update init model
2015-09-06 17:51:30 +02:00
Matthew Honnibal
b6b1e1aa12
* Add link for Finnish model
2015-08-27 10:26:02 +02:00
Matthew Honnibal
dc13edd7cb
* Refactor init_model to accomodate other languages
2015-08-26 19:14:05 +02:00
Matthew Honnibal
bbf07ac253
* Cut down init_model to work on more languages
2015-08-24 01:05:20 +02:00
Matthew Honnibal
3ecacb9635
* Copy gazetteer file in init_model
2015-08-06 16:07:23 +02:00
Matthew Honnibal
174ed1ad20
* Tighten the frequency filter in init_model
2015-07-27 21:44:51 +02:00
Matthew Honnibal
6047f2aa35
* Fix path to freqs.txt
2015-07-27 02:22:35 +02:00
Matthew Honnibal
0368889d6c
* Support gzipped frequencies in init_model
2015-07-26 22:39:22 +02:00
Matthew Honnibal
c4f20847da
* Fix init_model for travis tests
2015-07-26 14:03:30 +02:00
Matthew Honnibal
09312b9353
* Fix init_model for travis tests
2015-07-26 13:55:47 +02:00
Matthew Honnibal
90ad717dc4
* Update default freq thresholds in init_model
2015-07-26 01:41:17 +02:00
Matthew Honnibal
6a5e035a48
* Ensure data files are copied for tokenizer in init_model
2015-07-26 01:36:19 +02:00
Matthew Honnibal
ab93898ac6
* Make heuristics more explicit in init_model
2015-07-26 00:22:19 +02:00
Matthew Honnibal
5c04dcd7c1
* Fix init_model
2015-07-25 23:33:02 +02:00
Matthew Honnibal
fd525f0675
* Pass OOV probability around
2015-07-25 23:29:51 +02:00
Matthew Honnibal
5b6bf4d4a6
* Remove probability cap on lexicon
2015-07-25 23:05:51 +02:00
Matthew Honnibal
c62eb110c0
* Fix merge conflict in init_model
2015-07-25 23:04:30 +02:00
Matthew Honnibal
0301472d15
* Fix init_model
2015-07-25 22:56:35 +02:00
Matthew Honnibal
8e800adfbc
* Fix init_model
2015-07-25 22:54:08 +02:00
Matthew Honnibal
6076213c16
* Fix init_model script
2015-07-25 22:35:52 +02:00
Matthew Honnibal
ef448649b3
* Add read_freqs function in init_model
2015-07-25 22:16:36 +02:00
Matthew Honnibal
6be3ee311c
Py3 compatibility tweak
2015-07-23 13:13:15 +02:00
Matthew Honnibal
d4407d8e2f
Py3 compatibility tweak
2015-07-23 09:45:15 +02:00
Matthew Honnibal
da4821fc14
* Add cluster words to probs in init_model
2015-07-23 09:27:07 +02:00
Matthew Honnibal
4af2595d99
* Fix structure of wordnet directory for init_model
2015-07-23 06:35:38 +02:00
Matthew Honnibal
83c0f0da22
* Remove lemmatizer from init_model
2015-07-23 02:32:34 +02:00
Matthew Honnibal
386246db5b
* Update init_model, making language resources optional
2015-07-22 00:25:14 +02:00
Matthew Honnibal
af54d05d60
* Remove sense stuff from init_model
2015-07-14 10:56:17 +02:00
Matthew Honnibal
62cfcd76fe
* Add supersense sets to lexemes, from WordNet. Look-up via lemmatization.
2015-07-01 18:48:59 +02:00
Matthew Honnibal
c8a553fe91
* Fix cluster initialization
2015-05-31 15:21:28 +02:00
Matthew Honnibal
c037f80638
* Add case expansion to Brown clusters
2015-05-31 05:50:50 +02:00
Matthew Honnibal
5ab0f233a1
* Ensure words in Brown clusters make it into the vocab, even if they're not in our probs list
2015-05-31 05:46:16 +02:00
Matthew Honnibal
4489d87550
* Add cluster=0 by default in init_model
2015-04-29 14:23:13 +02:00
Matthew Honnibal
693c5a1558
* Exclude clusterings for words only seen 1 or 2 times, as their clusters are unreliable
2015-04-17 04:44:52 +02:00
Matthew Honnibal
1629b33082
* Fix copying of tokenizer data in init_model
2015-04-12 04:45:31 +02:00
Matthew Honnibal
baff0f8ad8
* Add docstring explaining script a bit, and add handling of word vectors
2015-04-08 08:20:15 +02:00
Matthew Honnibal
156b70ed82
* Add new script to replace make_lexicon, that does full setup of data
2015-04-08 07:46:53 +02:00