Commit Graph

242 Commits

Author SHA1 Message Date
Matthew Honnibal 35214053fd * Work around get_lex_attr bug introduced during German parsing 2016-05-23 10:53:00 +00:00
Wolfgang Seeker dae6bc05eb define German dummy lemmatizer until morphology is done 2016-05-02 16:04:53 +02:00
Matthew Honnibal 8569dbc2d0 * Add initial stuff for Chinese parsing 2016-04-24 18:44:24 +02:00
Wolfgang Seeker f9150ccf2a rename vectors.tgz to vectors.bz2 because it's not compressed with gzip but bzip 2016-04-08 13:38:07 +02:00
Wolfgang Seeker a8f4e49900 update init_model.py to previous (better) state 2016-03-29 16:12:13 +02:00
Matthew Honnibal d249e2f7f3 * Improve error message in bin/parser/train.py 2016-03-29 13:04:33 +11:00
Yaser Martinez Palenzuela 3c210f45fa make use of log_smooth_count 2016-03-17 12:19:52 +01:00
Matthew Honnibal fcaa0ad7ce Merge pull request #280 from wbwseeker/german_parser
German parser
2016-03-04 03:27:42 +11:00
Wolfgang Seeker 690c5acabf adjust train.py to train both english and german models 2016-03-03 15:21:00 +01:00
Matthew Honnibal 9d51e4d13c Delete gather_freqs.py
This script was in a broken state, and should be unnecessary. The functionality is subsumed by `get_freqs.py`
2016-03-02 00:42:55 +11:00
Yaser Martinez Palenzuela 1a93d7f725 replace codecs.open with io.open 2016-03-01 14:10:11 +01:00
Wolfgang Seeker eae35e9b27 add tokenizer files for German, add/change code to train German pos tagger
- add files to specify rules for German tokenization
- change generate_specials.py to generate from an external file (abbrev.de.tab)
- copy gazetteer.json from lang_data/en/

- init_model.py
	- change doc freq threshold to 0
- add train_german_tagger.py
	- expects conll09-formatted input
2016-02-18 13:24:20 +01:00
Henning Peters a89ca6537b fix cythonize 2016-02-05 16:17:23 +01:00
Henning Peters 3a50448bf3 py3 compatibility 2016-02-05 15:43:50 +01:00
Henning Peters 7627969aba refactor, listen on setup.py, *.pxd 2016-02-05 15:37:00 +01:00
Matthew Honnibal 5dc6cffc67 * Fix gather_freqs.py 2016-02-04 20:21:58 +01:00
Matthew Honnibal e2ed6251d7 * Fancy up the CLI for the conll train script 2016-02-02 22:58:06 +01:00
Matthew Honnibal a676d66807 * Update the CoNLL train script, to get working on other languages 2016-02-02 22:29:34 +01:00
Henning Peters 73674a4afb try using system-wide headers 2015-12-13 12:51:23 +01:00
Henning Peters 92fabd0114 wrap virtualenv around cythonize 2015-12-13 12:32:22 +01:00
Henning Peters 9662cf04c9 new approach to dependency headers 2015-12-13 11:53:02 +01:00
Matthew Honnibal 6e68b344c1 * Train after parsing, not before. 2015-11-12 04:43:52 +11:00
Matthew Honnibal 4fb038a9eb * Update conll_train.py script for spaCy v0.97 2015-10-31 00:53:51 +11:00
Matthew Honnibal cfaa4bde5d * Add train and parse scripts that use CoNLL formatted data 2015-10-30 12:54:49 +11:00
Matthew Honnibal 2348a08481 * Load/dump strings with a json file, instead of the hacky strings file we were using. 2015-10-22 21:13:03 +11:00
Matthew Honnibal 0ce12e4548 * Import io in get_freqs 2015-10-19 12:56:18 +11:00
Matthew Honnibal 17fffb4c57 * Update get_freqs.py script 2015-10-16 04:33:49 +11:00
Matthew Honnibal 5ff4454177 * Update get_freqs.py script 2015-10-16 04:31:15 +11:00
Matthew Honnibal a748146dd3 * Update get_freqs.py script 2015-10-16 04:24:50 +11:00
Matthew Honnibal a29fd79fbc * Update get_freqs.py script 2015-10-16 04:24:08 +11:00
Matthew Honnibal e08a4b46a2 * Update get_freqs.py script 2015-10-16 04:20:35 +11:00
Matthew Honnibal 92f750cf8b * Use a gzipped frequencies file in init_model 2015-10-11 06:59:44 +02:00
Matthew Honnibal 064bd69ad0 * Refactor symbols, so that frequency rank can be derived from the orth id of a word. 2015-10-10 16:03:48 +11:00
Matthew Honnibal 83dccf0fd7 * Use io module insteads of deprecated codecs module 2015-10-10 14:13:01 +11:00
Matthew Honnibal f35632e2e5 * Remove SBD print statement in train, after SBD evaluation was removed from Scorer 2015-10-09 11:08:58 +02:00
Matthew Honnibal 6ea1601e93 * Add script to train models off the UD treebanks. Note that the UD data is restricted to research purposes only, and should only be used to train models for academic experiments. 2015-10-08 12:01:08 +11:00
Matthew Honnibal c503654ec1 * Update bin/parser/train for printing output. 2015-10-06 10:35:22 +11:00
alvations 8caedba42a caught more codecs.open -> io.open 2015-09-30 20:20:09 +02:00
alvations 764bdc62e7 caught another codecs.open 2015-09-30 20:16:52 +02:00
Matthew Honnibal 1ae55cb63a * Copy tag_map.json in init_model 2015-09-12 05:54:02 +02:00
Matthew Honnibal b2e82e55f6 * Create POS model dir in training script 2015-09-08 15:36:23 +02:00
Matthew Honnibal 5ad4527c42 * Rename Deutsch to German 2015-09-06 20:18:58 +02:00
Matthew Honnibal d1eea2d865 * Update train.py for language-generic spaCy 2015-09-06 17:51:48 +02:00
Matthew Honnibal 950ce36660 * Update init model 2015-09-06 17:51:30 +02:00
Matthew Honnibal b6b1e1aa12 * Add link for Finnish model 2015-08-27 10:26:02 +02:00
Matthew Honnibal 320ced276a * Add tagger training script 2015-08-27 09:15:41 +02:00
Matthew Honnibal dc13edd7cb * Refactor init_model to accomodate other languages 2015-08-26 19:14:05 +02:00
Matthew Honnibal bbf07ac253 * Cut down init_model to work on more languages 2015-08-24 01:05:20 +02:00
Matthew Honnibal 3ecacb9635 * Copy gazetteer file in init_model 2015-08-06 16:07:23 +02:00
Matthew Honnibal ddc1a5cfe5 * Fix training under python3 2015-07-28 14:09:30 +02:00
Matthew Honnibal 174ed1ad20 * Tighten the frequency filter in init_model 2015-07-27 21:44:51 +02:00
Matthew Honnibal 6047f2aa35 * Fix path to freqs.txt 2015-07-27 02:22:35 +02:00
Matthew Honnibal 0368889d6c * Support gzipped frequencies in init_model 2015-07-26 22:39:22 +02:00
Matthew Honnibal c4f20847da * Fix init_model for travis tests 2015-07-26 14:03:30 +02:00
Matthew Honnibal 09312b9353 * Fix init_model for travis tests 2015-07-26 13:55:47 +02:00
Matthew Honnibal 90ad717dc4 * Update default freq thresholds in init_model 2015-07-26 01:41:17 +02:00
Matthew Honnibal 6a5e035a48 * Ensure data files are copied for tokenizer in init_model 2015-07-26 01:36:19 +02:00
Matthew Honnibal ab93898ac6 * Make heuristics more explicit in init_model 2015-07-26 00:22:19 +02:00
Matthew Honnibal 5c04dcd7c1 * Fix init_model 2015-07-25 23:33:02 +02:00
Matthew Honnibal fd525f0675 * Pass OOV probability around 2015-07-25 23:29:51 +02:00
Matthew Honnibal 5b6bf4d4a6 * Remove probability cap on lexicon 2015-07-25 23:05:51 +02:00
Matthew Honnibal c62eb110c0 * Fix merge conflict in init_model 2015-07-25 23:04:30 +02:00
Matthew Honnibal 0301472d15 * Fix init_model 2015-07-25 22:56:35 +02:00
Matthew Honnibal 8e800adfbc * Fix init_model 2015-07-25 22:54:08 +02:00
Matthew Honnibal 5f183098e4 Merge branch 'master' of ssh://github.com/honnibal/spaCy 2015-07-25 22:37:04 +02:00
Matthew Honnibal 6076213c16 * Fix init_model script 2015-07-25 22:35:52 +02:00
Matthew Honnibal 1a99eb69da Merge branch 'master' of https://github.com/honnibal/spaCy 2015-07-25 22:19:48 +02:00
Matthew Honnibal ef448649b3 * Add read_freqs function in init_model 2015-07-25 22:16:36 +02:00
Matthew Honnibal 2e6a60eaec Merge branch 'master' of https://github.com/honnibal/spaCy 2015-07-25 21:14:07 +02:00
Matthew Honnibal 105305b4aa * Upd get_freqs script 2015-07-25 21:13:41 +02:00
Matthew Honnibal 616445e027 * Add simple script to collate frequencies from sorted file 2015-07-25 21:12:45 +02:00
Matthew Honnibal c52179f5fa * Use print function in train.py, for py 2/3 compatibility 2015-07-24 04:52:35 +02:00
Matthew Honnibal 6be3ee311c Py3 compatibility tweak 2015-07-23 13:13:15 +02:00
Matthew Honnibal d4407d8e2f Py3 compatibility tweak 2015-07-23 09:45:15 +02:00
Matthew Honnibal da4821fc14 * Add cluster words to probs in init_model 2015-07-23 09:27:07 +02:00
Matthew Honnibal 4af2595d99 * Fix structure of wordnet directory for init_model 2015-07-23 06:35:38 +02:00
Matthew Honnibal 83c0f0da22 * Remove lemmatizer from init_model 2015-07-23 02:32:34 +02:00
Matthew Honnibal 4729200dfc * Whitespace 2015-07-23 01:19:26 +02:00
Matthew Honnibal 2b7bd46508 * Update get_freqs script 2015-07-22 15:43:06 +02:00
Matthew Honnibal 386246db5b * Update init_model, making language resources optional 2015-07-22 00:25:14 +02:00
Matthew Honnibal 317cbbc015 * Serialization round trip now working with decent API, but with rough spots in the organisation and requiring vocabulary to be fixed ahead of time. 2015-07-19 15:18:17 +02:00
Matthew Honnibal a6ff7e6ca4 * Fix redundant options in train.py 2015-07-17 22:38:05 +02:00
Matthew Honnibal 6cfa83157e Merge branch 'refactor' of ssh://github.com/honnibal/spaCy into refactor 2015-07-17 21:38:04 +02:00
Matthew Honnibal 38ca0c33f5 Merge branch 'neuralnet' into refactor
Mostly refactors parser, to use new thinc3.2 Example class.
Aim is to remove use of shared memory, so that we can parallelize
over documents easily.

Conflicts:
	setup.py
	spacy/syntax/parser.pxd
	spacy/syntax/parser.pyx
	spacy/syntax/stateclass.pyx
2015-07-14 14:13:47 +02:00
Matthew Honnibal af54d05d60 * Remove sense stuff from init_model 2015-07-14 10:56:17 +02:00
Matthew Honnibal 3de1b3ef1d * Change get_freqs to take a list of files 2015-07-14 10:55:56 +02:00
Matthew Honnibal 39c93116eb * Add get_freqs script 2015-07-14 02:31:32 +02:00
Matthew Honnibal 62cfcd76fe * Add supersense sets to lexemes, from WordNet. Look-up via lemmatization. 2015-07-01 18:48:59 +02:00
Matthew Honnibal 31b5e58aeb * Begin reorganizing neuralnet work 2015-06-30 14:26:53 +02:00
Matthew Honnibal 1135cfe50a * Tidy nn_train a bit 2015-06-29 16:45:14 +02:00
Matthew Honnibal df8179ca4f * Add separate Param and AdadeltaParam classes. AdadeltaParam seems broken. 2015-06-29 16:39:16 +02:00
Matthew Honnibal 1dff04acb5 * Apply regularization to the softmax, not the bias 2015-06-29 11:45:38 +02:00
Matthew Honnibal ca30fe1582 * Use He initialization trick 2015-06-29 10:56:02 +02:00
Matthew Honnibal fc34e1b6e4 * Move Theano functions into nn_train.py script 2015-06-29 07:09:16 +02:00
Matthew Honnibal fe7b24ecef * whitespace 2015-06-28 11:37:17 +02:00
Matthew Honnibal 7b8275fcc4 * Wire hyperparameters to script interface 2015-06-28 11:37:17 +02:00
Matthew Honnibal 897dd0dd0b * Merge changes, and adjust Example to use memoryview 2015-06-28 11:36:11 +02:00
Matthew Honnibal ef97b90833 * Fix token scoring 2015-06-28 06:22:18 +02:00
Matthew Honnibal 34c0ef2ee8 * Don't compile the orig_arc_eager and tree_arc_eager modules used for the EMNLP paper 2015-06-23 05:38:17 +02:00
Matthew Honnibal 59e9f9153c * Remove projectivity constraint in train.py, but raise Exception if non-projective sentence is encountered, since we've told GoldParse to projectivize 2015-06-23 05:04:46 +02:00