Commit Graph

207 Commits

Author SHA1 Message Date
Matthew Honnibal 4ff92184f1 Improve train_ud script 2017-01-09 09:53:46 -06:00
Matthew Honnibal c1ef07788c Update train_ud.py
Create deps folder if it doesn't exist.
2017-01-09 10:55:44 +11:00
Matthew Honnibal 46e98ec029 Move init_model.py script from repo. These meta-tools should live elsewhere 2016-12-18 14:03:40 +01:00
dafnevk cdf5dcc40a fixed bug in init_model so that it runs for dutch 2016-12-13 14:33:44 +01:00
Matthew Honnibal c7889492f9 Fix model saving error for Python 3 2016-11-25 18:04:30 -06:00
Matthew Honnibal 22189e60db Use unicode literals in train_ud 2016-11-25 17:45:45 -06:00
Matthew Honnibal da5f0cce36 Fix train_ud script, which trains models from the Universal Dependencies format. 2016-11-25 11:19:33 -06:00
Matthew Honnibal 314bc8d34f Fix train script for 1.0 2016-11-25 08:57:37 -06:00
Matthew Honnibal bd1bfcca61 Update train.py 2016-10-13 03:23:48 +02:00
Matthew Honnibal ea23b64cc8 Refactor training, with new spacy.train module. Defaults still a little awkward. 2016-10-09 12:24:24 +02:00
Matthew Honnibal 53fbd3dd1c Fix train.py for v1.0.0-rc1 2016-10-05 01:11:46 +02:00
Matthew Honnibal ae202e7a60 Fix init_model.py 2016-09-25 15:58:51 +02:00
Matthew Honnibal af847e07fc Fix usage of pathlib for Python3 -- turning paths to strings. 2016-09-24 21:05:27 +02:00
Matthew Honnibal d310dc73ef Fix bin/init_model.py after refactoring 2016-09-24 20:38:18 +02:00
Matthew Honnibal 8036368d96 * Fix model saving 2016-05-23 12:01:46 +00:00
Matthew Honnibal 35214053fd * Work around get_lex_attr bug introduced during German parsing 2016-05-23 10:53:00 +00:00
Wolfgang Seeker dae6bc05eb define German dummy lemmatizer until morphology is done 2016-05-02 16:04:53 +02:00
Matthew Honnibal 8569dbc2d0 * Add initial stuff for Chinese parsing 2016-04-24 18:44:24 +02:00
Wolfgang Seeker f9150ccf2a rename vectors.tgz to vectors.bz2 because it's not compressed with gzip but bzip 2016-04-08 13:38:07 +02:00
Wolfgang Seeker a8f4e49900 update init_model.py to previous (better) state 2016-03-29 16:12:13 +02:00
Matthew Honnibal d249e2f7f3 * Improve error message in bin/parser/train.py 2016-03-29 13:04:33 +11:00
Yaser Martinez Palenzuela 3c210f45fa make use of log_smooth_count 2016-03-17 12:19:52 +01:00
Matthew Honnibal fcaa0ad7ce Merge pull request #280 from wbwseeker/german_parser
German parser
2016-03-04 03:27:42 +11:00
Wolfgang Seeker 690c5acabf adjust train.py to train both english and german models 2016-03-03 15:21:00 +01:00
Matthew Honnibal 9d51e4d13c Delete gather_freqs.py
This script was in a broken state, and should be unnecessary. The functionality is subsumed by `get_freqs.py`
2016-03-02 00:42:55 +11:00
Yaser Martinez Palenzuela 1a93d7f725 replace codecs.open with io.open 2016-03-01 14:10:11 +01:00
Wolfgang Seeker eae35e9b27 add tokenizer files for German, add/change code to train German pos tagger
- add files to specify rules for German tokenization
- change generate_specials.py to generate from an external file (abbrev.de.tab)
- copy gazetteer.json from lang_data/en/

- init_model.py
	- change doc freq threshold to 0
- add train_german_tagger.py
	- expects conll09-formatted input
2016-02-18 13:24:20 +01:00
Henning Peters a89ca6537b fix cythonize 2016-02-05 16:17:23 +01:00
Henning Peters 3a50448bf3 py3 compatibility 2016-02-05 15:43:50 +01:00
Henning Peters 7627969aba refactor, listen on setup.py, *.pxd 2016-02-05 15:37:00 +01:00
Matthew Honnibal 5dc6cffc67 * Fix gather_freqs.py 2016-02-04 20:21:58 +01:00
Matthew Honnibal e2ed6251d7 * Fancy up the CLI for the conll train script 2016-02-02 22:58:06 +01:00
Matthew Honnibal a676d66807 * Update the CoNLL train script, to get working on other languages 2016-02-02 22:29:34 +01:00
Henning Peters 73674a4afb try using system-wide headers 2015-12-13 12:51:23 +01:00
Henning Peters 92fabd0114 wrap virtualenv around cythonize 2015-12-13 12:32:22 +01:00
Henning Peters 9662cf04c9 new approach to dependency headers 2015-12-13 11:53:02 +01:00
Matthew Honnibal 6e68b344c1 * Train after parsing, not before. 2015-11-12 04:43:52 +11:00
Matthew Honnibal 4fb038a9eb * Update conll_train.py script for spaCy v0.97 2015-10-31 00:53:51 +11:00
Matthew Honnibal cfaa4bde5d * Add train and parse scripts that use CoNLL formatted data 2015-10-30 12:54:49 +11:00
Matthew Honnibal 2348a08481 * Load/dump strings with a json file, instead of the hacky strings file we were using. 2015-10-22 21:13:03 +11:00
Matthew Honnibal 0ce12e4548 * Import io in get_freqs 2015-10-19 12:56:18 +11:00
Matthew Honnibal 17fffb4c57 * Update get_freqs.py script 2015-10-16 04:33:49 +11:00
Matthew Honnibal 5ff4454177 * Update get_freqs.py script 2015-10-16 04:31:15 +11:00
Matthew Honnibal a748146dd3 * Update get_freqs.py script 2015-10-16 04:24:50 +11:00
Matthew Honnibal a29fd79fbc * Update get_freqs.py script 2015-10-16 04:24:08 +11:00
Matthew Honnibal e08a4b46a2 * Update get_freqs.py script 2015-10-16 04:20:35 +11:00
Matthew Honnibal 92f750cf8b * Use a gzipped frequencies file in init_model 2015-10-11 06:59:44 +02:00
Matthew Honnibal 064bd69ad0 * Refactor symbols, so that frequency rank can be derived from the orth id of a word. 2015-10-10 16:03:48 +11:00
Matthew Honnibal 83dccf0fd7 * Use io module insteads of deprecated codecs module 2015-10-10 14:13:01 +11:00
Matthew Honnibal f35632e2e5 * Remove SBD print statement in train, after SBD evaluation was removed from Scorer 2015-10-09 11:08:58 +02:00