Commit Graph

232 Commits

Author SHA1 Message Date
Matthew Honnibal 11664b9f20 Fix variable error in token 2016-11-01 13:28:00 +01:00
Matthew Honnibal 8c4d1b46ce Fix variable error in Span 2016-11-01 13:27:44 +01:00
Matthew Honnibal e7af6b937f Fix syntax error while fixing doc strings 2016-11-01 13:27:32 +01:00
Matthew Honnibal b86f8af0c1 Fix doc strings 2016-11-01 12:25:36 +01:00
Matthew Honnibal 4ca31b4d87 Fix clobbering of 'missing' named ent values after assigning ents. 2016-10-26 13:13:56 +02:00
Matthew Honnibal 15c9b59f0e Fix Issue #461: O tag was being clobbered by doc.ents.__set__ 2016-10-23 15:50:26 +02:00
Matthew Honnibal 2c3a67b693 Fix calculation of vector norm, re Issue #522. Need to consolidate the calculations into a helper function. 2016-10-23 14:49:31 +02:00
Matthew Honnibal e80944276f Fix Span.vector_norm 2016-10-20 21:58:56 +02:00
Matthew Honnibal 3588a18fb8 Fix hook names in doc 2016-10-19 21:15:16 +02:00
Matthew Honnibal 5d5742b773 Add sentiment field to doc, rename getters_for_tokens and getters_for_spans, add user_hooks field to Doc. 2016-10-19 20:54:22 +02:00
Matthew Honnibal 9b60186266 Fix doc class 2016-10-17 15:23:47 +02:00
Matthew Honnibal 7fd98fc91c Remove deprecation shim around str/bytes in Token. 2016-10-17 14:02:47 +02:00
Matthew Honnibal b67697a97b Improve API for doc.merge() and span.merge(), to use keyword arguments. 2016-10-17 14:02:13 +02:00
Matthew Honnibal fbb7f3f15c Add user_data attribute to Doc object. 2016-10-17 11:43:22 +02:00
Matthew Honnibal c1abc8f6ed Fix deprecation stuff in Token: Remove the shim for the str/unicode semantics, and raise for has_repvec and repvec 2016-10-17 11:18:41 +02:00
Matthew Honnibal 09ab447a18 Remove tensor property from token. 2016-10-17 02:45:09 +02:00
Matthew Honnibal 5d10e2005c Defer some attributes to Doc, via getters_for_tokens attribute. 2016-10-17 02:44:49 +02:00
Matthew Honnibal 8829984efb Remove tensor attribute from Span and Token. 2016-10-17 02:44:04 +02:00
Matthew Honnibal d15a88c66a Defer some attributes to Doc via getters_for_spans 2016-10-17 02:43:35 +02:00
Matthew Honnibal 62230dd13a Add getters_for_spans and getters_for_tokens attributes to Doc. Fix docstring 2016-10-17 02:42:51 +02:00
Matthew Honnibal ae11ea8240 Add getters_for_tokens and getters_for_spans attributes to Doc object. 2016-10-17 02:42:05 +02:00
Matthew Honnibal 311a985fe0 Add input error handling in Doc 2016-10-16 18:16:42 +02:00
Matthew Honnibal 06322ba99d Add words and spaces keyword arguments to Doc. 2016-10-16 18:13:03 +02:00
Matthew Honnibal f3be9d0a9a Add tensor field to Lexeme, Token, Doc and Span, so that users have a place to hang neural network outputs 2016-10-14 03:24:13 +02:00
Matthew Honnibal ca32a1ab01 Revert "Work on Issue #285: intern strings into document-specific pools, to address streaming data memory growth. StringStore.__getitem__ now raises KeyError when it can't find the string. Use StringStore.intern() to get the old behaviour. Still need to hunt down all uses of StringStore.__getitem__ in library and do testing, but logic looks good."
This reverts commit 8423e8627f.
2016-09-30 20:20:22 +02:00
Matthew Honnibal 6736977d82 Revert "Changes to Doc and Token for new string store scheme"
This reverts commit 99de44d864.
2016-09-30 20:11:15 +02:00
Matthew Honnibal 99de44d864 Changes to Doc and Token for new string store scheme 2016-09-30 20:00:21 +02:00
Matthew Honnibal 8423e8627f Work on Issue #285: intern strings into document-specific pools, to address streaming data memory growth. StringStore.__getitem__ now raises KeyError when it can't find the string. Use StringStore.intern() to get the old behaviour. Still need to hunt down all uses of StringStore.__getitem__ in library and do testing, but logic looks good. 2016-09-30 10:14:47 +02:00
Matthew Honnibal d3dc5718b2 Fix syntax error in Doc 2016-09-28 11:39:49 +02:00
Matthew Honnibal 1b520e7bab Improve docstrings for Doc object 2016-09-28 11:15:13 +02:00
Matthew Honnibal fc4a7ad794 Test and fix Issue #411: IndexError when .sents property is used on empty string. 2016-09-27 18:49:14 +02:00
Matthew Honnibal 15e42a1ba9 Allow entities to be set by Span, or by 4-tuple (with entity ID) 2016-09-24 01:17:43 +02:00
Matthew Honnibal e48df859b5 Fix typedef import in span.pyx 2016-09-23 16:02:28 +02:00
Matthew Honnibal 4de13606fd Fix token.pyx 2016-09-23 15:07:07 +02:00
Matthew Honnibal b4de419e19 Import hash_t typedef in token.pyx 2016-09-23 14:22:06 +02:00
Matthew Honnibal c1a2e96604 Clean up notes at end of token.pyx 2016-09-21 20:45:51 +02:00
Matthew Honnibal 58e83fe34b Initial, limited support for quantified patterns in Matcher, and tracking of ent_id attribute in Token and Span. The quantifiers need a lot more testing, and there are some known problems. The main known problem is that the zero-plus and one-plus quantifiers won't work if a token can match both the quantified pattern expression AND the tail of the match. 2016-09-21 14:54:55 +02:00
Matthew Honnibal 2735b6247b Fix orths_and_spaces in Doc.__init__ 2016-09-21 14:52:05 +02:00
Matthew Honnibal cdc10e9a1c * Fix Issue #375: noun phrase iteration results in index error if noun phrases are merged during the loop. Fix by accumulating the spans inside the noun_chunks property, allowing the Span index tricks to work. 2016-05-20 10:14:06 +02:00
Matthew Honnibal 5d86c30f0b * Fix Issue #367: Missing has_vector property on Doc and Span objects 2016-05-09 12:36:14 +02:00
Matthew Honnibal 8c0888d6cb * Fix error in span.sent 2016-05-06 00:28:05 +02:00
Matthew Honnibal 26095f9722 * Add span.sent property, re Issue #366 2016-05-06 00:17:38 +02:00
Matthew Honnibal 76021cb853 * Fix bug in Doc.text, introduced by a862edc 2016-05-04 11:02:16 +02:00
Matthew Honnibal 29a114e645 * Don't assign 0-valued tags in Doc.from_array 2016-05-02 16:07:50 +02:00
Matthew Honnibal 276fbe9996 * Fix assignment of iterator on Doc object 2016-05-02 15:26:24 +02:00
Matthew Honnibal 508fd1f6dc * Refactor noun chunk iterators, so that they're simple functions. Install the iterator when the Doc is created, but allow users to write to the noun_chunk_iterator attribute. The iterator functions accept an object and yield (int start, int end, int label) triples. 2016-05-02 14:25:10 +02:00
Matthew Honnibal 6df3858dbc * Fix Issue #323: Incorrect semantics of Token.__str__ built-in. Add flag to allow users to switch the old semantics back on, to ease transition. 2016-04-12 13:17:59 +10:00
Matthew Honnibal 872695759d Merge pull request #306 from wbwseeker/german_noun_chunks
add German noun chunk functionality
2016-04-08 00:54:24 +10:00
Matthew Honnibal 26622f0ffc Merge branch 'master' of ssh://github.com/honnibal/spaCy 2016-03-29 14:31:52 +11:00
Matthew Honnibal ad119c074f * Fix incorrect whitespacing in Doc.text. This change is potentially breaking, to anyone who was relying on the previous incorrect semantics. 2016-03-29 13:02:42 +11:00
Wolfgang Seeker d65ef41d08 make error messages language independent 2016-03-24 11:47:09 +01:00
Wolfgang Seeker 5080077097 revert init_model.py back to pre-german state (because it makes more sense)
simplify token.n_rights and token.n_lefts
2016-03-21 16:10:25 +01:00
Wolfgang Seeker 5e2e8e951a add baseclass DocIterator for iterators over documents
add classes for English and German noun chunks

the respective iterators are set for the document when created by the parser
as they depend on the annotation scheme of the parsing model
2016-03-16 15:53:35 +01:00
Wolfgang Seeker 2ae253ef5b changed head.__set__ to make it simpler 2016-03-14 13:43:48 +01:00
Wolfgang Seeker 46e3f979f1 add function for setting head and label to token
change PseudoProjectivity.deprojectivize to use these functions
2016-03-11 17:31:06 +01:00
Wolfgang Seeker 03fb498dbe introduce lang field for LexemeC to hold language id
put noun_chunk logic into iterators.py for each language separately
2016-03-10 13:01:34 +01:00
Wolfgang Seeker d9312bc9ea add new files npchunks.{pyx,pxd} to hold noun phrase chunk generators 2016-03-09 16:18:48 +01:00
Wolfgang Seeker 3448cb40a4 integrated pseudo-projective parsing into parser
- nonproj.pyx holds a class PseudoProjectivity which currently holds
  all functionality to implement Nivre & Nilsson 2005's pseudo-projective
  parsing using the HEAD decoration scheme
- changed lefts/rights in Token to account for possible non-projective
  structures
2016-03-01 10:09:08 +01:00
Matthew Honnibal af8514cb0c * Refine the way the is_parsed attribute is set by from_array 2016-02-06 14:44:35 +01:00
Matthew Honnibal e66d45bf66 * Restore previous patch to Span.root, as it seems it wasn't the cause of the problem. 2016-02-06 13:37:41 +01:00
Matthew Honnibal 031b00cb91 * Fix Span.root calculation 2016-02-05 20:12:09 +01:00
Matthew Honnibal e5c447e237 * Questionable fix to problem in Span.root 2016-02-05 19:18:35 +01:00
Matthew Honnibal 1ef84a0557 * Merge master into rethinc2 2016-02-05 12:55:59 +01:00
Matthew Honnibal 6aa92b70f1 * Fix merge problem in span 2016-02-05 12:46:11 +01:00
Matthew Honnibal 419edfab50 * Use generic flags for the new attributes until they're added 2016-02-04 15:50:54 +01:00
Matthew Honnibal 11810be33e * Add Python hooks for is_bracket/is_quote/is_left_punct/is_right_punct 2016-02-04 13:04:16 +01:00
Matthew Honnibal 4cbad510ff * Fix calculation of head for spans with punctuation. 2016-02-03 02:32:21 +01:00
Matthew Honnibal 6bb007d16e * Make set_parse nogil 2016-01-30 20:27:52 +01:00
Matthew Honnibal 87172a15c6 * Fix runtime error bug that arose from updated Span.root function. 2016-01-25 15:22:42 +01:00
Matthew Honnibal 334c4b2b57 * Disprefer punctuation and spaces as heads of spans 2016-01-18 18:14:09 +01:00
Matthew Honnibal c107da9738 * Bug fix to _count_words_to_root 2016-01-18 16:59:38 +01:00
Matthew Honnibal f24833d607 * Fix merge for coordinations 2016-01-18 16:03:19 +01:00
Matthew Honnibal 14534958a9 * Fix bug in Span.root 2016-01-18 15:40:28 +01:00
Matthew Honnibal fc8f26584a * Don't consider NPs connected to parse via conj relation as noun chunks. Change motivated by the nested noun chunks identified in Issue #203, but might be problematic. Also allow root NPs to be considered noun chunks. 2016-01-16 17:52:40 +01:00
Matthew Honnibal 995b2d18fd * Route token.string via token.txt_with_ws, to deprecate token.string in future 2016-01-16 17:14:34 +01:00
Matthew Honnibal 54a98eaf19 * Fix typo text_wth_ws --> text_with_ws. Reroute .string attribute to text_with_ws, to deprecate .string in future 2016-01-16 17:13:50 +01:00
Matthew Honnibal 03e8a4293d * Add loop guard to Token.lefts and Token.rights properties 2016-01-16 16:18:17 +01:00
Matthew Honnibal 304339985e * Add a linear scan to Span.root method, to help with long sentences 2016-01-16 16:17:28 +01:00
Matthew Honnibal 8cbcc3a799 * Fix calculation of root token in Span. Now take root to be word with shortest tree path. Avoids parse trees ending up in inconsistent state, as had occurred in Issue #214. 2016-01-16 15:38:50 +01:00
Matthew Honnibal 42a9f29b40 * Add loop guard in Span.root, to raise errors if there is a cycle in the dependency parse, instead of entering an infinite loop. Re Issue #214 2016-01-16 11:53:37 +01:00
Matthew Honnibal ab5aac5b2f * Add .rank property to Token and Lexeme, for frequency rank 2015-11-08 16:18:25 +01:00
Matthew Honnibal 7663970d5f * Removed unused i variable from Span, and set attributes to read-only 2015-11-07 17:06:15 +11:00
Matthew Honnibal 4b3c96d76d * Fix zero-length spans 2015-11-07 17:05:16 +11:00
Matthew Honnibal cc8febcbe1 * Fix Span comparison 2015-11-07 09:54:14 +11:00
Matthew Honnibal a9b612abdf * Rework the Span-merge patch, to avoid extending the interface of Doc, and avoid virtualizing the Span.start and Span.end indices, to keep Span usage efficient 2015-11-07 09:01:12 +11:00
Matthew Honnibal 56499d89ef * Rework the Span-merge patch, to avoid extending the interface of Doc, and avoid virtualizing the Span.start and Span.end indices, to keep Span usage efficient 2015-11-07 08:55:34 +11:00
Andreas Grivas 4be7fda453 * span start, end -> properties. autoupdate after merge 2015-11-07 07:57:04 +11:00
Andreas Grivas 562db6d2d0 * merge add lex last - add index finder funcs 2015-11-07 07:57:04 +11:00
Matthew Honnibal 68f479e821 * Rename Doc.data to Doc.c 2015-11-04 00:15:14 +11:00
Matthew Honnibal 3ddea19b2b * Rename spans.pyx to span.pyx 2015-11-04 00:14:40 +11:00
Matthew Honnibal 9482d616bc * Rename spans.pyx to span.pyx 2015-11-03 23:51:05 +11:00
Matthew Honnibal 116da5990a * Clean up setting of tag in doc.from_bytes 2015-11-03 23:48:57 +11:00
Matthew Honnibal 1e99fcd413 * Rename .repvec to .vector in C API 2015-11-03 23:47:59 +11:00
Matthew Honnibal 9e37437ba8 * Fix assign_tag in doc.merge 2015-11-03 19:07:02 +11:00
Matthew Honnibal 833eb35c57 * Fix tag assignment in doc.from_array 2015-11-03 18:45:54 +11:00
Matthew Honnibal 09664177d7 * Fix tag handling in doc.merge, and assign sent_start when setting heads. 2015-11-03 18:15:52 +11:00
Matthew Honnibal 604ceac4c6 * Fix morphological assignment in doc.merge() 2015-11-03 17:57:51 +11:00
Matthew Honnibal 5e040855a5 * Ensure morphological features and lemmas are loaded in from_array, re Issue #152 2015-11-03 17:56:50 +11:00
Matthew Honnibal 6161d2529a Merge branch 'master' of ssh://github.com/honnibal/spaCy 2015-11-03 13:36:30 +11:00
Matthew Honnibal f7dd377575 * Adjust conjuncts iterator in Token 2015-11-03 13:23:22 +11:00