Commit Graph

1588 Commits

Author SHA1 Message Date
Wolfgang Seeker d65ef41d08 make error messages language independent 2016-03-24 11:47:09 +01:00
Henning Peters 963570aa49 Merge branch 'master' of github.com:spacy-io/spaCy 2016-03-24 11:19:47 +01:00
Henning Peters a7d7ea3afa first idea for supporting multiple langs in download script 2016-03-24 11:19:43 +01:00
Wolfgang Seeker 5080077097 revert init_model.py back to pre-german state (because it makes more sense)
simplify token.n_rights and token.n_lefts
2016-03-21 16:10:25 +01:00
Wolfgang Seeker 5e2e8e951a add baseclass DocIterator for iterators over documents
add classes for English and German noun chunks

the respective iterators are set for the document when created by the parser
as they depend on the annotation scheme of the parsing model
2016-03-16 15:53:35 +01:00
Matthew Honnibal 80134eb12d Merge branch 'master' of https://github.com/spacy-io/spaCy 2016-03-15 19:14:50 +00:00
Wolfgang Seeker 2ae253ef5b changed head.__set__ to make it simpler 2016-03-14 13:43:48 +01:00
Henning Peters c12d3dd200 add __init__.py to empty package dirs 2016-03-14 11:28:03 +01:00
Henning Peters 54f3447b5f cleanup 2016-03-14 01:46:33 +01:00
Wolfgang Seeker 46e3f979f1 add function for setting head and label to token
change PseudoProjectivity.deprojectivize to use these functions
2016-03-11 17:31:06 +01:00
Wolfgang Seeker 03fb498dbe introduce lang field for LexemeC to hold language id
put noun_chunk logic into iterators.py for each language separately
2016-03-10 13:01:34 +01:00
Wolfgang Seeker bc9c62e279 replace Language functions with corresponding orth functions
implement punctuation functions in orth
2016-03-09 18:07:37 +01:00
Wolfgang Seeker d9312bc9ea add new files npchunks.{pyx,pxd} to hold noun phrase chunk generators 2016-03-09 16:18:48 +01:00
Matthew Honnibal 1508528c8c * Increment version 2016-03-08 15:58:45 +00:00
Matthew Honnibal 963fe5258e * Add missing __contains__ method to vocab 2016-03-08 15:49:10 +00:00
Matthew Honnibal 478aa21cb0 * Remove broken __reduce__ method on vocab 2016-03-08 15:48:21 +00:00
Matthew Honnibal 20235bde00 Merge pull request #282 from henningpeters/switch_vectors
initial proposal for ability to switch vectors
2016-03-09 01:39:41 +11:00
Henning Peters eb7ae61b1c cleanup api 2016-03-08 12:59:18 +01:00
Henning Peters b740f20191 hash_string() should not depend on python's internal unicode representation, also fixes https://github.com/spacy-io/sense2vec/issues/5 for py2 2016-03-06 09:19:27 +01:00
Henning Peters aa4d964c14 cleanup api 2016-03-05 17:51:32 +01:00
Henning Peters 931c07a609 initial proposal for separate vector package 2016-03-04 11:09:06 +01:00
Wolfgang Seeker 7adbd7a785 replace Counter with normal dict 2016-03-03 21:36:27 +01:00
Wolfgang Seeker 1ae487a4f6 add backwards compatibility with python 2.6 2016-03-03 21:18:12 +01:00
Wolfgang Seeker 9d1e6de4a0 make a proper list from zip iterator 2016-03-03 19:51:01 +01:00
Wolfgang Seeker 49f9d1c085 change test_nonproj.py to not use zip inside numpy.asarray 2016-03-03 19:42:09 +01:00
Wolfgang Seeker 72b8df0684 turned PseudoProjectivity into a normal python class 2016-03-03 19:05:08 +01:00
Matthew Honnibal fcaa0ad7ce Merge pull request #280 from wbwseeker/german_parser
German parser
2016-03-04 03:27:42 +11:00
Wolfgang Seeker 690c5acabf adjust train.py to train both english and german models 2016-03-03 15:21:00 +01:00
Wolfgang Seeker 3448cb40a4 integrated pseudo-projective parsing into parser
- nonproj.pyx holds a class PseudoProjectivity which currently holds
  all functionality to implement Nivre & Nilsson 2005's pseudo-projective
  parsing using the HEAD decoration scheme
- changed lefts/rights in Token to account for possible non-projective
  structures
2016-03-01 10:09:08 +01:00
Wolfgang Seeker 56b7210e82 moved nonproj.py to syntax/nonproj.pyx 2016-02-25 15:08:49 +01:00
Henning Peters f3df736e0a remove unidecode-related test 2016-02-24 18:22:22 +01:00
Wolfgang Seeker 4b2297d5d4 add class PseudoProjective for pseudo-projective parsing
PseudoProjective() implements the algorithm from Nivre & Nilsson 2005
using their HEAD decoration scheme.
2016-02-24 11:26:25 +01:00
Henning Peters 12d58a7099 remove text-unidecode dependency 2016-02-24 08:01:59 +01:00
Wolfgang Seeker 8d531c958b replace tests for non-projectivity
- add functions to find non-projective edges
- add test file for non-projectivity functions
2016-02-22 14:40:40 +01:00
Matthew Honnibal 141639ea3a * Fix bug in tokenizer that caused new tokens to be added for affixes 2016-02-21 23:17:47 +00:00
Wolfgang Seeker eae35e9b27 add tokenizer files for German, add/change code to train German pos tagger
- add files to specify rules for German tokenization
- change generate_specials.py to generate from an external file (abbrev.de.tab)
- copy gazetteer.json from lang_data/en/

- init_model.py
	- change doc freq threshold to 0
- add train_german_tagger.py
	- expects conll09-formatted input
2016-02-18 13:24:20 +01:00
Henning Peters 9cc4f8d5b3 avoid shadowing __name__ 2016-02-15 01:33:39 +01:00
Henning Peters 4c9e3c7911 upgrade spuntik, enforce data api via model version constraints 2016-02-14 16:03:17 +01:00
Henning Peters 9d8966a2c0 Update test_tokenizer.py 2016-02-10 19:24:37 +01:00
Henning Peters 3b5f1e753b py26 compatibility 2016-02-10 14:32:54 +01:00
Henning Peters ee1f1ac300 mark test_sentence_space() as model test 2016-02-10 07:49:11 +01:00
Matthew Honnibal 5d96b3ef4f * Increment version 2016-02-07 13:48:58 +01:00
Matthew Honnibal 1b83cb9dfa * Fix Issue #251: Incorrect right edge calculation on left-clobber low in the tree 2016-02-07 00:00:42 +01:00
Matthew Honnibal c6623889c1 * Add test for Issue #251: Incorrect right edges, caused by bad update to r_edge in del_arc, triggered from non-monotonic left-arc 2016-02-06 23:47:51 +01:00
Matthew Honnibal a95974ad3f * Fix oov probability 2016-02-06 15:13:55 +01:00
Matthew Honnibal af8514cb0c * Refine the way the is_parsed attribute is set by from_array 2016-02-06 14:44:35 +01:00
Matthew Honnibal 161b01d4c0 * Tweak usage example for multi-processing 2016-02-06 14:44:11 +01:00
Matthew Honnibal 7f24229f10 * Don't try to pickle the tokenizer 2016-02-06 14:09:05 +01:00
Matthew Honnibal dcb401f3e1 * Remove broken Vocab pickling 2016-02-06 14:08:47 +01:00
Matthew Honnibal e66d45bf66 * Restore previous patch to Span.root, as it seems it wasn't the cause of the problem. 2016-02-06 13:37:41 +01:00
Matthew Honnibal 4412a70dc5 * Initialize StateC._empty_token to 0, to avoid undefined behaviour. 2016-02-06 13:34:38 +01:00
Matthew Honnibal 1b41f868d2 * Check for errors in parser, and parallelise the left-over batch 2016-02-06 10:06:30 +01:00
Matthew Honnibal 031b00cb91 * Fix Span.root calculation 2016-02-05 20:12:09 +01:00
Matthew Honnibal 165ca28b80 * Set is_parsed flag in Parser.pipe 2016-02-05 19:51:44 +01:00
Matthew Honnibal bdd579db0a * Set is_parsed flag in Parser.pipe 2016-02-05 19:50:11 +01:00
Matthew Honnibal 7119e77fb6 * Fix Matcher.pipe 2016-02-05 19:46:02 +01:00
Matthew Honnibal 1cf0100bf6 * Add test for multithreading 2016-02-05 19:38:22 +01:00
Matthew Honnibal b04c9aad71 * Fix off-by-one in Parser.pipe 2016-02-05 19:37:50 +01:00
Matthew Honnibal e5c447e237 * Questionable fix to problem in Span.root 2016-02-05 19:18:35 +01:00
Matthew Honnibal 1ef84a0557 * Merge master into rethinc2 2016-02-05 12:55:59 +01:00
Matthew Honnibal 4cf34fc170 Merge branch 'rethinc2' of ssh://github.com/honnibal/spaCy into rethinc2 2016-02-05 12:48:28 +01:00
Matthew Honnibal 249dccbe95 * Fix Language.pipe 2016-02-05 12:47:57 +01:00
Matthew Honnibal c0e63feccc * xfail pickle tests 2016-02-05 12:46:58 +01:00
Matthew Honnibal 6aa92b70f1 * Fix merge problem in span 2016-02-05 12:46:11 +01:00
Matthew Honnibal 048dfe35aa * cimport cython.parallel 2016-02-05 12:20:42 +01:00
Matthew Honnibal af58f273b3 * Fix spacy.language.pipe 2016-02-05 12:20:29 +01:00
Matthew Honnibal 8a13cebdcc * Update for modified thinc interface 2016-02-05 11:44:39 +01:00
Matthew Honnibal 48ce09687d * Skip pickling the vocab in the tests 2016-02-04 15:51:19 +01:00
Matthew Honnibal 419edfab50 * Use generic flags for the new attributes until they're added 2016-02-04 15:50:54 +01:00
Matthew Honnibal c4017a06d9 * Add placeholders for the new flags in attrs and symbols 2016-02-04 15:49:45 +01:00
Matthew Honnibal e5c96c969f * Wire up new attributes 2016-02-04 13:04:58 +01:00
Matthew Honnibal 9703ccc3de * Remove unused import 2016-02-04 13:04:33 +01:00
Matthew Honnibal 11810be33e * Add Python hooks for is_bracket/is_quote/is_left_punct/is_right_punct 2016-02-04 13:04:16 +01:00
Matthew Honnibal fe611132f0 * Add stubs for is_bracket/is_quote/is_left_punct/is_right_punct functions 2016-02-04 13:03:04 +01:00
Matthew Honnibal ee975d36d0 * Add stubs to test is_bracket/is_quote/is_left_punct/is_right_punct functions 2016-02-04 13:02:25 +01:00
Matthew Honnibal f9e765cae7 * Add pipe() method to tokenizer 2016-02-03 02:32:37 +01:00
Matthew Honnibal 4cbad510ff * Fix calculation of head for spans with punctuation. 2016-02-03 02:32:21 +01:00
Matthew Honnibal 84b247ef83 * Add a .pipe method, that takes a stream of input, operates on it, and streams the output. Internally, the stream may be buffered, to allow multi-threading. 2016-02-03 02:10:58 +01:00
Matthew Honnibal fcfc17a164 Merge branch 'master' into rethinc2 2016-02-02 23:05:34 +01:00
Matthew Honnibal f204daf27b * Add error warning that a gold tag is unrecognised 2016-02-02 22:59:59 +01:00
Matthew Honnibal 99b8906100 * Accept punct_labels as an argument to the scorer 2016-02-02 22:59:06 +01:00
Matthew Honnibal 59123443e2 * Check for presence/absence of the different models in Language.end_training 2016-02-02 22:49:55 +01:00
Matthew Honnibal 9e9d4c8706 * Fix stupid error in Language.batch 2016-02-01 09:49:32 +01:00
Matthew Honnibal e3db39dd21 * Fix compiler warning about signed/unsigned comparison 2016-02-01 09:08:07 +01:00
Matthew Honnibal 98fbdf2856 * Add Language.batch() method, to support multi-threaded jobs 2016-02-01 09:01:13 +01:00
Matthew Honnibal b3802562d6 Merge branch 'rethinc2' of https://github.com/honnibal/spaCy into rethinc2 2016-02-01 08:59:24 +01:00
Matthew Honnibal 4b08a3fafd * Fix merge conflict 2016-02-01 08:58:18 +01:00
Matthew Honnibal 5188f6d9d8 * Fix parseC function 2016-02-01 08:48:48 +01:00
Matthew Honnibal bcf8f7ba40 * Add a parse_batch method to Parser, that releases the GIL around a batch of documents. 2016-02-01 08:34:55 +01:00
Matthew Honnibal d5579cd0d8 Merge branch 'rethinc2' of https://github.com/honnibal/spaCy into rethinc2 2016-02-01 03:08:49 +01:00
Matthew Honnibal 490ba65398 * Use openmp in parser 2016-02-01 03:08:42 +01:00
Matthew Honnibal cb78d91ec5 * Fix ArcEager.set_valid 2016-02-01 03:07:37 +01:00
Matthew Honnibal 28e5ad62bc * Pass a StateC pointer into the transition and validation methods in the parser, so that the GIL can be released over a batch of documents 2016-02-01 03:00:15 +01:00
Matthew Honnibal a47f00901b * Pass a StateC pointer into the transition and validation methods in the parser, so that the GIL can be released over a batch of documents 2016-02-01 02:58:14 +01:00
Matthew Honnibal daaad66448 * Now fully proxied 2016-02-01 02:37:08 +01:00
Matthew Honnibal 7a0e3bb9c1 * Continue proxying. Some problem currently 2016-02-01 02:22:21 +01:00
Matthew Honnibal 2169bbb7ea * Shadow StateClass with StateC, to start proxying 2016-02-01 01:16:14 +01:00
Matthew Honnibal 2fa228458e * Add _state file, which StateClass will proxy to 2016-02-01 01:09:21 +01:00
Matthew Honnibal 6bb007d16e * Make set_parse nogil 2016-01-30 20:27:52 +01:00
Matthew Honnibal 9410e74c92 * Switch parser to use nogil functions 2016-01-30 20:27:07 +01:00
Matthew Honnibal 10877a7791 * Update for thinc 5.0, including changing cost from int to weight_t, and updating the tagger and parser 2016-01-30 14:31:36 +01:00
Matthew Honnibal ea4ff94cde * Whitespace 2016-01-29 03:59:22 +01:00
Matthew Honnibal b0718b6ee1 * Move to thinc 5.0 2016-01-29 03:58:55 +01:00
Matthew Honnibal 9721502c81 * Update version 2016-01-25 15:52:59 +01:00
Matthew Honnibal 907e8cf07d * Add u prefix to string in web example 2016-01-25 15:51:38 +01:00
Matthew Honnibal eba03695ef * Comment out pickle tests 2016-01-25 15:51:13 +01:00
Matthew Honnibal de94e6c525 * Mark pickle tests as xfail, due to temp files problem 2016-01-25 15:24:17 +01:00
Matthew Honnibal 87172a15c6 * Fix runtime error bug that arose from updated Span.root function. 2016-01-25 15:22:42 +01:00
Matthew Honnibal 2c8dd91785 * Fix first code example on the website 2016-01-23 18:09:19 +01:00
Matthew Honnibal 3af84cfd6e * Increment version 2016-01-21 17:49:27 +01:00
Henning Peters 65aeac24cb remove package version constraint 2016-01-21 17:40:51 +01:00
Matthew Honnibal 792c98a438 * Increment version for OSX-fixed release of v0.100 2016-01-21 00:23:04 +01:00
Matthew Honnibal 82d011ac43 * Fix test for whitespace 2016-01-19 20:38:26 +01:00
Matthew Honnibal e89069dcae * Fix matcher test 2016-01-19 20:24:01 +01:00
Matthew Honnibal 63e3d4e27f * Add comment on Vocab.__reduce__ 2016-01-19 20:11:25 +01:00
Matthew Honnibal e1282b7f2f * Require user-custom NER classes to work without adding the label. 2016-01-19 20:11:03 +01:00
Matthew Honnibal 84c5dfbfc3 * Clean up debugging python list 2016-01-19 20:10:32 +01:00
Matthew Honnibal 04d0686b26 * Make TransitionSystem.add_action idempotent, i.e. ignore duplicate added actions. 2016-01-19 20:10:04 +01:00
Matthew Honnibal c4a89d56bd * Automatically register any entity types pre-set on the tokens, so that the NER works with user-given entity types. 2016-01-19 20:09:26 +01:00
Matthew Honnibal f0f92793f6 * Add test for user NER classes in matcher blocking the NER model. Re Issue #178 and Issue #217 2016-01-19 19:23:16 +01:00
Matthew Honnibal 65c5bc4988 * Add add_label method, to allow users to register new entity types and dependency labels. 2016-01-19 19:11:02 +01:00
Matthew Honnibal 151aa0b0e2 * Allow users to add_label, in order to extend the entity recogniser to new classes. Does not by itself add a class to the model 2016-01-19 19:09:33 +01:00
Matthew Honnibal c8e0011ebc * Add iterators to the NER and parser transition systems, to get the action types 2016-01-19 19:07:43 +01:00
Matthew Honnibal 515493c675 * Add xfail test for Issue #225: tokenization with non-whitespace delimiters 2016-01-19 13:20:14 +01:00
Matthew Honnibal 7abe653223 * Fix imports 2016-01-19 03:36:51 +01:00
Matthew Honnibal 590f38bdb2 * Add hacky solution to Issue #220. Currently specials.json only supports literal patterns, which doesn't allow us to pre-tag whitespace with the correct token, SP, as a rule. The data-driven approach should be easy but for some reason fails here. Adding a hard code in Morphology isn't a good solution, but we do want to fix the behaviour right away, and don't want to wait for an architecturally better solution. 2016-01-19 03:35:20 +01:00
Matthew Honnibal 445164d5b4 * Restore the LOCAL_DATA_DIR global in spacy/en/__init__.py, although this is now deprecated 2016-01-19 02:54:56 +01:00
Matthew Honnibal 04177debd0 * Unwind limit to sentence boundary detection that prevents it from inserting boundaries on whitespace. Replace it with a check for whitespace in StateClass.fast_forward, so that whitespace is LeftArced when it's on the stack. This should prevent the previous problem of whitespace-only sentences. Should fix Issue #184, but may cause further problems. Needs testing. 2016-01-19 02:54:15 +01:00
Matthew Honnibal 7893de3203 * Add test for Issue #184: Whitespace at sentence boundary causes sentence boundary error. 2016-01-18 23:04:38 +01:00
Matthew Honnibal bba0a5e078 * Handle string paths in default_vocab, default_parser, default_entity in Language class 2016-01-18 22:37:24 +01:00
Matthew Honnibal e825fd9554 * Make some of the website tests work without models 2016-01-18 18:14:44 +01:00
Matthew Honnibal 334c4b2b57 * Disprefer punctuation and spaces as heads of spans 2016-01-18 18:14:09 +01:00
Matthew Honnibal bed36ab0ff * Fix import of HEAD attribute 2016-01-18 17:34:43 +01:00
Matthew Honnibal 28c659c1fe * Fix import for numpy 2016-01-18 17:25:04 +01:00
Matthew Honnibal fc36bcf458 * Fix import for English 2016-01-18 17:14:40 +01:00
Matthew Honnibal cc4c335e14 * Set heads for test_merge_tokens, to make the test run without models 2016-01-18 17:00:11 +01:00
Matthew Honnibal c107da9738 * Bug fix to _count_words_to_root 2016-01-18 16:59:38 +01:00
Matthew Honnibal f24833d607 * Fix merge for coordinations 2016-01-18 16:03:19 +01:00
Matthew Honnibal 14534958a9 * Fix bug in Span.root 2016-01-18 15:40:28 +01:00
Matthew Honnibal 714cbc03d5 * Add test for Issue #203: nested noun chunks. 2016-01-16 18:02:30 +01:00
Matthew Honnibal 4e2253170c * Move test for doc.merge to tokens_api file, to avoid name conflicts which upset pytest 2016-01-16 18:01:36 +01:00
Matthew Honnibal 34a157511f * Move test_merge_hang to test_tokens_api 2016-01-16 18:00:26 +01:00
Matthew Honnibal fc8f26584a * Don't consider NPs connected to parse via conj relation as noun chunks. Change motivated by the nested noun chunks identified in Issue #203, but might be problematic. Also allow root NPs to be considered noun chunks. 2016-01-16 17:52:40 +01:00
Matthew Honnibal 4a16dbfeca * Add test for Issue #203: noun chunks should be flat, but sometimes are nested 2016-01-16 17:41:25 +01:00
Matthew Honnibal 995b2d18fd * Route token.string via token.txt_with_ws, to deprecate token.string in future 2016-01-16 17:14:34 +01:00
Matthew Honnibal 54a98eaf19 * Fix typo text_wth_ws --> text_with_ws. Reroute .string attribute to text_with_ws, to deprecate .string in future 2016-01-16 17:13:50 +01:00
Matthew Honnibal 3e9961d2c4 * If final token is whitespace, don't mark it as owning a trailing space. Fixes Issue #154 2016-01-16 17:08:59 +01:00
Matthew Honnibal 223d2b3484 * Add test for Issue #154: Additional whitespace introduced when string ends with a whitespace token. 2016-01-16 17:08:07 +01:00
Matthew Honnibal 3dc398b727 * Fix merge conflict in requirements.txt 2016-01-16 16:20:49 +01:00
Matthew Honnibal fc5962a77d * Improve test for root token in Span 2016-01-16 16:19:09 +01:00