Commit Graph

1345 Commits

Author SHA1 Message Date
Matthew Honnibal 01ab464383 * Prevent Begin and In moves from applying in NER if we're at the last token of a sentence, as this would mean the entity would span over a sentence boundary. Re Issue #169 2015-11-07 05:30:44 +11:00
Matthew Honnibal b65633f270 * Fix function that returns nth entity in StateClass. Was only returning the first. 2015-11-07 05:29:11 +11:00
Matthew Honnibal 410b6f9ec1 * Remove deprecated _ml.pyx. We now use the nicer APIs provided by thinc 4.0, and subclass the AveragedPerceptron class. 2015-11-07 05:13:10 +11:00
Matthew Honnibal 3c162dcac3 * Refactor away from the _ml module, to use thinc 4.0. Still some work needs to be done, e.g. to add __reduce__ to the models, more testing, etc. 2015-11-07 03:24:30 +11:00
Matthew Honnibal 9d1b2a103a * Fix capitalization in lemmatizer 2015-11-06 05:44:35 +11:00
Matthew Honnibal 6ed3aedf79 * Merge vocab changes 2015-11-06 00:48:08 +11:00
Matthew Honnibal 72abbb43fb * Add type declarations in strings.pyx 2015-11-06 00:47:26 +11:00
Matthew Honnibal 5b2af4864f * When lemmatizing non-noun, non-verb, non-adj words, output lower-case 2015-11-06 00:45:09 +11:00
Matthew Honnibal 754bf04162 * Remove declaration of Model.update 2015-11-06 00:31:15 +11:00
Matthew Honnibal e18bdff23a Merge branch 'master' of ssh://github.com/honnibal/spaCy 2015-11-06 00:26:15 +11:00
Matthew Honnibal b9991fbd20 * Update to use thinc 3.0 2015-11-06 00:25:59 +11:00
Matthew Honnibal 864a8f45d8 * Use unicode in StringStore.intern, instead of unreliably casting to bytes. 2015-11-05 11:32:19 +00:00
Matthew Honnibal b18204cd52 * Fix StringStore._realloc, re Issue #155 2015-11-05 11:28:26 +00:00
Matthew Honnibal f8004c5f65 * Begin upgrading to improved thinc API 2015-11-05 03:53:03 +11:00
Matthew Honnibal adc7bbd6cf * Fix name of like_num in default_lex_attrs 2015-11-04 22:02:47 +11:00
Matthew Honnibal e96faf29e7 * Rename like_number to like_num, to fix inconsistency re Issue #166 2015-11-04 22:01:44 +11:00
Matthew Honnibal 65934b7cd4 * Enforce import of ujson in strings.pyx, because otherwise it's too slow 2015-11-04 00:32:02 +11:00
Matthew Honnibal 1ce5d5602d * Rename Doc.data to Doc.c 2015-11-04 00:17:13 +11:00
Matthew Honnibal 68f479e821 * Rename Doc.data to Doc.c 2015-11-04 00:15:14 +11:00
Matthew Honnibal 3ddea19b2b * Rename spans.pyx to span.pyx 2015-11-04 00:14:40 +11:00
Matthew Honnibal 9482d616bc * Rename spans.pyx to span.pyx 2015-11-03 23:51:05 +11:00
Matthew Honnibal 116da5990a * Clean up setting of tag in doc.from_bytes 2015-11-03 23:48:57 +11:00
Matthew Honnibal 9ec7b9c454 * Clean up unused Constituent struct. 2015-11-03 23:48:21 +11:00
Matthew Honnibal 1e99fcd413 * Rename .repvec to .vector in C API 2015-11-03 23:47:59 +11:00
Matthew Honnibal ee3f9ba581 * Fix test of serializer 2015-11-03 19:45:16 +11:00
Matthew Honnibal d06ba26371 * Fix test of serializer 2015-11-03 19:43:27 +11:00
Matthew Honnibal 4083059650 Merge branch 'master' of https://github.com/honnibal/spaCy 2015-11-03 09:07:19 +01:00
Matthew Honnibal 9e37437ba8 * Fix assign_tag in doc.merge 2015-11-03 19:07:02 +11:00
Matthew Honnibal dde9e1357c * Add todo to morphology.lemmatize 2015-11-03 18:54:35 +11:00
Matthew Honnibal ffedff9e6c * Remove the archive after download, to save disk space 2015-11-03 18:54:05 +11:00
Matthew Honnibal 85372468e3 * Fix serialize test 2015-11-03 08:51:33 +01:00
Matthew Honnibal 833eb35c57 * Fix tag assignment in doc.from_array 2015-11-03 18:45:54 +11:00
Matthew Honnibal 09664177d7 * Fix tag handling in doc.merge, and assign sent_start when setting heads. 2015-11-03 18:15:52 +11:00
Matthew Honnibal 389a373807 Merge branch 'master' of ssh://github.com/honnibal/spaCy 2015-11-03 18:07:25 +11:00
Matthew Honnibal 3f44b3e43f * Mark serializer test as requiring models 2015-11-03 18:07:08 +11:00
Matthew Honnibal 25ed7be8f8 Merge branch 'master' of https://github.com/honnibal/spaCy 2015-11-03 07:58:17 +01:00
Matthew Honnibal 604ceac4c6 * Fix morphological assignment in doc.merge() 2015-11-03 17:57:51 +11:00
Matthew Honnibal 5e040855a5 * Ensure morphological features and lemmas are loaded in from_array, re Issue #152 2015-11-03 17:56:50 +11:00
Matthew Honnibal 5668feb235 * Fix pickle test for python3 2015-11-03 04:57:02 +01:00
Matthew Honnibal 6161d2529a Merge branch 'master' of ssh://github.com/honnibal/spaCy 2015-11-03 13:36:30 +11:00
Matthew Honnibal 5887506f5d * Don't expect lexemes.bin in Vocab 2015-11-03 13:23:39 +11:00
Matthew Honnibal f7dd377575 * Adjust conjuncts iterator in Token 2015-11-03 13:23:22 +11:00
Andreas Grivas d418f00eb1 fixed error when printing unicode 2015-11-02 20:23:18 +02:00
Matthew Honnibal 52fc338001 * Set is_parsed and is_tagged attrs when loading annotations into Doc, re Issue #152 2015-10-28 10:43:22 +11:00
Matthew Honnibal 1c0356e4c2 * Set test file mode to w+t 2015-10-26 22:40:48 +11:00
Matthew Honnibal 0fe98f358b * Fix mode on text file for Python3 in strings test 2015-10-26 22:25:16 +11:00
Matthew Honnibal 8ba9cf905e * Fix mode on text file for Python3 in strings test 2015-10-26 21:44:34 +11:00
Matthew Honnibal a0730699b1 * Fix mode on text file for Python3 in strings test 2015-10-26 21:25:56 +11:00
Matthew Honnibal 725344d349 * Fix tempfile in test 2015-10-26 21:08:18 +11:00
Matthew Honnibal f11030aadc * Remove out-dated TODO comment 2015-10-26 12:33:38 +11:00
Matthew Honnibal a371a1071d * Save and load word vectors during pickling, re Issue #125 2015-10-26 12:33:04 +11:00
Matthew Honnibal a824a98312 * Add tests for pickling vectors, re: Issue #125 2015-10-26 12:31:05 +11:00
Matthew Honnibal 314090cc78 * Set vectors length when unpickling vocab, re Issue #125 2015-10-26 12:05:08 +11:00
Matthew Honnibal 4e16f9e435 * Move tests underneath spacy/ 2015-10-26 00:07:31 +11:00
Matthew Honnibal 3a6e48e814 Merge pull request #149 from chrisdubois/pickle-patch
Add __reduce__ to Tokenizer so that English pickles.
2015-10-25 15:30:31 +11:00
Chris DuBois dac8fe7bdb Add __reduce__ to Tokenizer so that English pickles.
- Add tests to test_pickle and test_tokenizer that save to tempfiles.
2015-10-23 22:24:03 -07:00
Matthew Honnibal ff4fe524ee * Fix exception for python 2 2015-10-23 01:56:13 +02:00
Matthew Honnibal 341a3e85cd * Upd downloaded data version 2015-10-23 00:56:57 +02:00
Matthew Honnibal f18fd8c659 * Fix language.py for change in StringStore load API 2015-10-23 03:48:12 +11:00
Matthew Honnibal 23855db3ca Merge branch 'master' of ssh://github.com/honnibal/spaCy into develop 2015-10-23 03:46:09 +11:00
Matthew Honnibal 4f13849065 Merge pull request #145 from henningpeters/master
better error reporting, cleanup
2015-10-23 03:45:47 +11:00
Matthew Honnibal 3be94be0c0 Merge pull request #148 from maxirmx/master
Utf8 encoding for lemma_rules.json
2015-10-22 21:46:28 +11:00
Matthew Honnibal c86bda8d1a * Fix import of uget 2015-10-22 21:13:56 +11:00
Matthew Honnibal 2348a08481 * Load/dump strings with a json file, instead of the hacky strings file we were using. 2015-10-22 21:13:03 +11:00
Matthew Honnibal 9baf0abd59 * Save vocab after training. 2015-10-22 21:09:14 +11:00
maxirmx f07e4accd7 Fixing encoding issue #4 2015-10-21 20:45:56 +03:00
maxirmx fcbfff043f Fixing encoding issue #3 2015-10-21 15:52:34 +03:00
maxirmx fe9d2e2c4e Fixing encode issue #2 2015-10-21 15:36:21 +03:00
maxirmx e4a1726f77 Fixing encoding issue
UTF-8
2015-10-21 14:16:37 +03:00
Andreas Grivas 93ada458e2 added __repr__ that prints text in ipython for doc, token, and span objects 2015-10-21 14:11:46 +03:00
Henning Peters ccffd2ef53 fixed extract directory 2015-10-21 07:59:34 +02:00
Henning Peters da4c9cee06 assert filename match 2015-10-20 19:33:59 +02:00
Henning Peters 4f703f0cb4 better error reporting, cleanup 2015-10-20 19:11:29 +02:00
Matthew Honnibal 9cdea6e450 * Import uget correctly 2015-10-19 08:32:41 +02:00
Matthew Honnibal 6727a46bb5 * Fix Issue #118: Matcher behaves unpredictably when matches overlap. 2015-10-19 16:45:32 +11:00
Matthew Honnibal 135062d23c * Fix error with merged text when merged region did not have trailing whitespace 2015-10-19 15:47:04 +11:00
Henning Peters bfde91fa49 add custom download tool (uget), replace wget with uget 2015-10-18 12:35:04 +02:00
Matthew Honnibal 9839cd2c0b * Fix whitespace_ calculation in Token 2015-10-18 17:21:11 +11:00
Matthew Honnibal c99285b8b9 * Clean up C++ usage in spacy/matcher.pyx 2015-10-18 17:20:50 +11:00
Matthew Honnibal a7e6c5ac8f * Fix Issue #122: Incorrect calculation of children after Doc.merge() 2015-10-18 17:17:27 +11:00
Matthew Honnibal 3ba66f2dc7 * Add string length cap in Tokenizer.__call__ 2015-10-16 04:54:16 +11:00
Matthew Honnibal 6e0f985afc * Fix token.conjuncts 2015-10-15 03:49:45 +11:00
Matthew Honnibal 2e0104ac81 * Fix token.conjuncts 2015-10-15 03:47:45 +11:00
Matthew Honnibal b8f3345a82 * Fix token.conjuncts method 2015-10-15 03:36:01 +11:00
Matthew Honnibal 23818f89b8 * Fix token.conjuncts method 2015-10-15 03:34:57 +11:00
Matthew Honnibal 7a15d1b60c * Add Python 2/3 compatibility fix for copy_reg 2015-10-13 20:04:40 +11:00
Matthew Honnibal 329ae57520 * Fix whitespace attachment thing 2015-10-13 09:46:38 +02:00
Matthew Honnibal 37919eac82 * Fix whitespace attachment in simpler way. Leaves problem with setting left/right children. 2015-10-13 18:23:24 +11:00
Matthew Honnibal c70eb776ae * Fix whitespace attachment, so that left/right children are consistent with head. 2015-10-13 15:58:22 +11:00
Matthew Honnibal 531182f937 * Fix Model.__reduce__ 2015-10-13 15:14:38 +11:00
Matthew Honnibal 6c227a6c1f * Fix Model.__reduce__ 2015-10-13 15:10:04 +11:00
Matthew Honnibal 358c82595c * Fix NAMES list in spacy/parts_of_speech.pyx 2015-10-13 14:18:45 +11:00
Matthew Honnibal c1fdc487bc Merge branch 'attrs' 2015-10-13 14:03:41 +11:00
Matthew Honnibal e886e6a406 * Inc version 2015-10-13 13:46:17 +11:00
Matthew Honnibal 20fd36a0f7 * Very scrappy, likely buggy first-cut pickle implementation, to work on Issue #125: allow pickle for Apache Spark. The current implementation sends stuff to temp files, and does almost nothing to ensure all modifiable state is actually preserved. The Language() instance is a deep tree of extension objects, and if pickling during training, some of the C-data state is hard to preserve. 2015-10-13 13:44:41 +11:00
Matthew Honnibal f8de403483 * Work on pickling Vocab instances. The current implementation is not correct, but it may serve to see whether this approach is workable. Pickling is necessary to address Issue #125 2015-10-13 13:44:41 +11:00
Matthew Honnibal 85e7944572 * Start trying to pickle Vocab 2015-10-13 13:44:41 +11:00
Matthew Honnibal 5ca57bd859 * Ensure Morphology can be pickled, to address Issue #125. 2015-10-13 13:44:41 +11:00
Matthew Honnibal 0cee928467 * Allow StringStore to be pickled, to start addressing Issue #125 2015-10-13 13:44:41 +11:00
Matthew Honnibal 41012907a8 * Fix variable name 2015-10-13 13:44:40 +11:00
Matthew Honnibal e70368d157 * Use lower case strings for dependency label names in symbols enum 2015-10-13 13:44:40 +11:00
Matthew Honnibal 7b4af3d1e7 * Fix parts_of_speech now that symbols list has been reformed 2015-10-13 13:44:40 +11:00
Matthew Honnibal 37b909b6b6 * Use the symbols file in vocab instead of the symbols subfiles like attrs.pxd 2015-10-13 13:44:40 +11:00
Matthew Honnibal ce65ec698c * Remove qualified naming in symbols 2015-10-13 13:44:40 +11:00
Matthew Honnibal 9f4be0adcd * Map NO_TAG to NIL in parts_of_speech.pxd 2015-10-13 13:44:40 +11:00
Matthew Honnibal 278e12f7e8 * Addmorphology symbols to morphology. May need to remove these as an enum. 2015-10-13 13:44:40 +11:00
Matthew Honnibal d80067eda1 * Map empty string to NULL_ATTR in attrs 2015-10-13 13:44:40 +11:00
Matthew Honnibal d70e8cac2c * Fix empty values in attributes and parts of speech, so symbols align correctly with the StringStore 2015-10-13 13:44:40 +11:00
Matthew Honnibal a29c8ee23d * Add symbols to the vocab before reading the strings, so that they line up correctly 2015-10-13 13:44:39 +11:00
Matthew Honnibal 74c0853471 * Rename ATTR_IDS to attrs.IDS. Rename ATTR_NAMES to attrs.NAMES. Rename UNIV_POS_IDS to parts_of_speech.IDS 2015-10-13 13:44:39 +11:00
Matthew Honnibal 10a4a843ea * Enumerate all symbols in one file 2015-10-13 13:44:39 +11:00
Matthew Honnibal 85ce36ab11 * Refactor symbols, so that frequency rank can be derived from the orth id of a word. 2015-10-13 13:44:39 +11:00
Matthew Honnibal dfbcff2ff1 * Revert codecs/io change to strings.pyx, as it seemed to cause an error? Will investigate. 2015-10-10 15:54:55 +11:00
Matthew Honnibal 9dd2f25c74 * Fix Issue #131: Force whitespace characters to attach syntactically to previous token, and ensure they cannot serve as stand-alone 'sentence' units. 2015-10-10 15:53:30 +11:00
Matthew Honnibal 8b39feefbe * Add dependency post-process rule to ensure spaces are attached to neighbouring tokens, so that they can't be sentence boundaries 2015-10-10 15:32:13 +11:00
Matthew Honnibal 2153067958 * Fix use of io in strings.pyx 2015-10-10 15:03:12 +11:00
Matthew Honnibal ec874247b5 Merge branch 'master' of ssh://github.com/honnibal/spaCy 2015-10-10 14:23:51 +11:00
Matthew Honnibal 30de4135c9 * Fix merge problem 2015-10-10 14:22:32 +11:00
Matthew Honnibal dc393a5f1d Merge pull request #126 from tomtung/master
Improve slicing support for both Doc and Span
2015-10-10 14:14:57 +11:00
Matthew Honnibal 83dccf0fd7 * Use io module insteads of deprecated codecs module 2015-10-10 14:13:01 +11:00
Matthew Honnibal a3dfe2b901 * Increment data version 2015-10-09 13:26:17 +02:00
Matthew Honnibal 2d9e5bf566 * Allow punctuation to be lemmatized 2015-10-09 19:02:42 +11:00
Matthew Honnibal 5332c0b697 * Add support for punctuation lemmatization, to handle unicode characters. This should help in addressing Issue #130 2015-10-09 18:54:40 +11:00
Yubing (Tom) Dong 9a6811acc4 Merge remote-tracking branch 'upstream/master' 2015-10-08 22:53:02 -07:00
Matthew Honnibal b125289f30 * Fix type declaration in asciied function 2015-10-09 13:46:57 +11:00
Matthew Honnibal 801d55a6d9 * Fix phrase matcher 2015-10-09 02:00:45 +11:00
Matthew Honnibal b3a70e6375 * Clean up unnecessary try/except block 2015-10-08 14:34:11 +11:00
Yubing (Tom) Dong 0f601b8b75 Update docstring of Doc.__getitem__ 2015-10-07 01:27:28 -07:00
Yubing (Tom) Dong 3fd3bc79aa Refactor to remove duplicate slicing logic 2015-10-07 01:25:35 -07:00
Yubing (Tom) Dong 97685aecb7 Add slicing support to Span 2015-10-06 02:45:49 -07:00
Yubing (Tom) Dong ef2af20cd3 Make Doc's slicing behavior conform to Python conventions 2015-10-06 02:41:28 -07:00
Yubing (Tom) Dong 2fc33e8024 Allow step=1 when slicing a Doc 2015-10-06 00:57:05 -07:00
Matthew Honnibal b228a8f4a6 * Remove spacy/en/attrs 2015-10-06 16:20:46 +11:00
Matthew Honnibal 693677fd8d * Prepare to remove en/attrx file, now that moving to symbols.pyx 2015-10-06 16:20:13 +11:00
Matthew Honnibal 3d9f41c2c9 * Add LookupError for better error reporting in Vocab 2015-10-06 10:34:59 +11:00
Matthew Honnibal ecc5281b36 * Remove en/pos.pyx, as the tagger code now lives in spacy/tagger.pyx 2015-10-06 10:12:08 +11:00
alvations 8caedba42a caught more codecs.open -> io.open 2015-09-30 20:20:09 +02:00
alvations 8199012d26 changing deprecated codecs.open to io.open =) 2015-09-30 20:10:15 +02:00
Matthew Honnibal 87e6186828 * Rename _seq to doc attribute in Span 2015-09-29 23:03:55 +10:00
Matthew Honnibal ab694b0364 * Fix open-bounded slice indices. 2015-09-29 23:03:09 +10:00
Matthew Honnibal a6ced80c0c * Fix Issue #116: Misleading handling of True value in Language.__init__. 2015-09-29 20:54:12 +10:00
Matthew Honnibal f9d2a5b651 * Fix issue #112: Replace unidecode with text-unidecode, to avoid license problems. 2015-09-28 23:40:18 +10:00
Matthew Honnibal 2c33a96ac3 Merge pull request #99 from rw/patch-1
Force SSL for downloading English language data.
2015-09-28 17:46:26 +10:00
Matthew Honnibal abf0d930af * Fix API for loading word vectors from a file. 2015-09-23 23:51:08 +10:00
Matthew Honnibal f5c256745b Merge branch 'master' of ssh://github.com/honnibal/spaCy 2015-09-22 12:26:24 +10:00
Matthew Honnibal 528e26a506 * Add rule to ensure ordinals are preserved as single tokens 2015-09-22 12:26:05 +10:00
Robert 8711b64860 Force SSL for downloading English language data.
It would also be nice to have a checksum for this.
2015-09-21 17:26:01 -07:00
Matthew Honnibal f7283a5067 * Fix vectors bugs for OOV words 2015-09-22 02:10:25 +02:00
Matthew Honnibal 44aecba701 * Fix Token.has_vector and Lexeme.has_vector 2015-09-22 01:43:16 +02:00
Matthew Honnibal 596fde8daa * Add has_vector attribute to Token and Lexeme 2015-09-21 19:52:43 +10:00