Commit Graph

299 Commits

Author SHA1 Message Date
Matthew Honnibal 8b7ac77c23 Allow span label to be string in Doc.char_span 2017-08-19 16:18:09 +02:00
Matthew Honnibal 80236116a6 Add Doc.char_span method, to get a span by character offset 2017-08-19 12:21:09 +02:00
Matthew Honnibal a6a2159969 Add slot for text categories to Doc 2017-07-22 00:34:15 +02:00
Matthew Honnibal 2a3bd5ee90 Fix fetching of noun chunk iterator 2017-06-04 15:53:05 -05:00
Matthew Honnibal 92ae36f84e Improve way noun chunks iterator is looked up 2017-06-04 21:53:39 +02:00
Matthew Honnibal 675f448313 Fix vector linkage on Doc 2017-06-04 14:25:30 -05:00
ines 459a1e8470 Fix whitespace 2017-06-03 11:31:18 +02:00
ines 5109bba910 Port over fix from #1070 2017-06-03 11:31:11 +02:00
Matthew Honnibal 498ad85309 Try using tensor for vector/similarity methdos 2017-05-30 23:35:17 +02:00
Matthew Honnibal 4ddff020c3 Fix compile error 2017-05-28 23:30:40 +02:00
Matthew Honnibal 6d3caeadd2 Fix type check for long 2017-05-28 23:22:45 +02:00
Matthew Honnibal 7996d21717 Fixes for new StringStore 2017-05-28 11:09:27 -05:00
Matthew Honnibal fe11564b8e Finish stringstore change. Also xfail vectors tests 2017-05-28 15:10:22 +02:00
Matthew Honnibal 84e66ca6d4 WIP on stringstore change. 27 failures 2017-05-28 14:06:40 +02:00
ines 66088851dc Add Doc.to_disk() and Doc.from_disk() methods 2017-05-24 11:58:17 +02:00
Matthew Honnibal d44b1eafc4 Fix conflict artefacts 2017-05-23 18:47:11 +02:00
Matthew Honnibal d68dd1f251 Add SENT_START attribute, for custom sentence boundary detection 2017-05-23 18:37:58 +02:00
ines 23f9a3ccc8 Update docstrings and API docs for Doc 2017-05-19 18:47:39 +02:00
ines 8455cb1327 Update docstring for Doc.__getitem__ 2017-05-19 00:30:51 +02:00
ines b687ad109d Update docstrings and API docs for Doc class 2017-05-18 23:59:44 +02:00
ines b87066ff10 Update docstrings and API docs for Doc class 2017-05-18 22:17:41 +02:00
ines 9d85cda8e4 Fix models error message and use about.__docs_models__ (see #1051) 2017-05-13 13:05:47 +02:00
ines 6b942763f0 Tidy up imports 2017-05-13 13:04:40 +02:00
ines b9dea345e5 Remove old import 2017-05-13 12:32:11 +02:00
ines 293ee359c5 Fix formatting 2017-05-13 12:32:06 +02:00
Matthew Honnibal ee1d35bdb0 Fix merge conflict 2017-05-13 03:20:19 +02:00
Matthew Honnibal b2540d2379 Merge Kengz's tree_print patch 2017-05-13 03:18:49 +02:00
Matthew Honnibal 4efb391994 Fix serializer 2017-05-09 18:45:18 +02:00
Matthew Honnibal 1166b0c491 Implement Doc.to_bytes and Doc.from_bytes methods 2017-05-09 18:11:34 +02:00
Matthew Honnibal 9e167b7bb6 Strip serializer from code 2017-05-09 17:28:50 +02:00
ines 0739ae7b76 Tidy up and fix formatting and imports 2017-04-15 13:05:15 +02:00
ines e71a1f4bd0 Fix download commands in error messages (see #946) 2017-04-01 10:20:57 +02:00
Matthew Honnibal 51882ee2b8 Fix check for setting ent_id in merge 2017-03-31 19:32:01 +02:00
Matthew Honnibal 9720103428 Improve attribute handlign in doc.merge(). Still unsatisfying 2017-03-31 13:59:58 +02:00
Matthew Honnibal 0fefdfcbda Merge pull request #935 from ericzhao28/master
Add option to use label=ent_type in doc.merge arguments (Bug fix for issue #862)
2017-03-30 02:51:24 +02:00
Eric Zhao aafdf6ffb8 Add option to use label karg to determine ent_type in doc.merge 2017-03-28 23:35:03 -07:00
Roman Inflianskas 66e1109b53 Add support for Universal Dependencies v2.0 2017-03-03 13:17:34 +01:00
Matvey Ezhov 32a22291bc Small `Doc.count_by` documentation update
Current example doesn't work
2017-01-31 19:18:45 +03:00
Matthew Honnibal 6c665b81df Fix redundant == TAG in from_array conditional 2017-01-31 00:46:21 +11:00
Matthew Honnibal 44e2b0100d Support TAG attribute in doc.from_array 2017-01-10 22:47:07 +01:00
kengz 73a38bd4d1 Merge remote-tracking branch 'upstream/master' 2016-12-30 12:19:59 -05:00
kengz da44183ae1 move parse_tree logic to a new tokens/printers.py file 2016-12-30 12:19:18 -05:00
Pokey Rule 3e3bda142d Add noun_chunks to Span 2016-11-24 10:47:20 +00:00
Matthew Honnibal 1fb09c3dc1 Fix morphology tagger 2016-11-04 19:19:09 +01:00
Matthew Honnibal f292f7f0e6 Fix Issue #599, by considering empty documents to be parsed and tagged. Implementation is a bit dodgy. 2016-11-02 23:48:43 +01:00
Matthew Honnibal e7af6b937f Fix syntax error while fixing doc strings 2016-11-01 13:27:32 +01:00
Matthew Honnibal b86f8af0c1 Fix doc strings 2016-11-01 12:25:36 +01:00
Matthew Honnibal 4ca31b4d87 Fix clobbering of 'missing' named ent values after assigning ents. 2016-10-26 13:13:56 +02:00
Matthew Honnibal 15c9b59f0e Fix Issue #461: O tag was being clobbered by doc.ents.__set__ 2016-10-23 15:50:26 +02:00
Matthew Honnibal 2c3a67b693 Fix calculation of vector norm, re Issue #522. Need to consolidate the calculations into a helper function. 2016-10-23 14:49:31 +02:00
Matthew Honnibal 3588a18fb8 Fix hook names in doc 2016-10-19 21:15:16 +02:00
Matthew Honnibal 5d5742b773 Add sentiment field to doc, rename getters_for_tokens and getters_for_spans, add user_hooks field to Doc. 2016-10-19 20:54:22 +02:00
Matthew Honnibal 9b60186266 Fix doc class 2016-10-17 15:23:47 +02:00
Matthew Honnibal b67697a97b Improve API for doc.merge() and span.merge(), to use keyword arguments. 2016-10-17 14:02:13 +02:00
Matthew Honnibal fbb7f3f15c Add user_data attribute to Doc object. 2016-10-17 11:43:22 +02:00
Matthew Honnibal 62230dd13a Add getters_for_spans and getters_for_tokens attributes to Doc. Fix docstring 2016-10-17 02:42:51 +02:00
Matthew Honnibal 311a985fe0 Add input error handling in Doc 2016-10-16 18:16:42 +02:00
Matthew Honnibal 06322ba99d Add words and spaces keyword arguments to Doc. 2016-10-16 18:13:03 +02:00
Matthew Honnibal 6736977d82 Revert "Changes to Doc and Token for new string store scheme"
This reverts commit 99de44d864.
2016-09-30 20:11:15 +02:00
Matthew Honnibal 99de44d864 Changes to Doc and Token for new string store scheme 2016-09-30 20:00:21 +02:00
Matthew Honnibal d3dc5718b2 Fix syntax error in Doc 2016-09-28 11:39:49 +02:00
Matthew Honnibal 1b520e7bab Improve docstrings for Doc object 2016-09-28 11:15:13 +02:00
Matthew Honnibal fc4a7ad794 Test and fix Issue #411: IndexError when .sents property is used on empty string. 2016-09-27 18:49:14 +02:00
Matthew Honnibal 15e42a1ba9 Allow entities to be set by Span, or by 4-tuple (with entity ID) 2016-09-24 01:17:43 +02:00
Matthew Honnibal 2735b6247b Fix orths_and_spaces in Doc.__init__ 2016-09-21 14:52:05 +02:00
Matthew Honnibal cdc10e9a1c * Fix Issue #375: noun phrase iteration results in index error if noun phrases are merged during the loop. Fix by accumulating the spans inside the noun_chunks property, allowing the Span index tricks to work. 2016-05-20 10:14:06 +02:00
Matthew Honnibal 5d86c30f0b * Fix Issue #367: Missing has_vector property on Doc and Span objects 2016-05-09 12:36:14 +02:00
Matthew Honnibal 76021cb853 * Fix bug in Doc.text, introduced by a862edc 2016-05-04 11:02:16 +02:00
Matthew Honnibal 29a114e645 * Don't assign 0-valued tags in Doc.from_array 2016-05-02 16:07:50 +02:00
Matthew Honnibal 508fd1f6dc * Refactor noun chunk iterators, so that they're simple functions. Install the iterator when the Doc is created, but allow users to write to the noun_chunk_iterator attribute. The iterator functions accept an object and yield (int start, int end, int label) triples. 2016-05-02 14:25:10 +02:00
Matthew Honnibal 872695759d Merge pull request #306 from wbwseeker/german_noun_chunks
add German noun chunk functionality
2016-04-08 00:54:24 +10:00
Matthew Honnibal ad119c074f * Fix incorrect whitespacing in Doc.text. This change is potentially breaking, to anyone who was relying on the previous incorrect semantics. 2016-03-29 13:02:42 +11:00
Wolfgang Seeker d65ef41d08 make error messages language independent 2016-03-24 11:47:09 +01:00
Wolfgang Seeker 5e2e8e951a add baseclass DocIterator for iterators over documents
add classes for English and German noun chunks

the respective iterators are set for the document when created by the parser
as they depend on the annotation scheme of the parsing model
2016-03-16 15:53:35 +01:00
Wolfgang Seeker 03fb498dbe introduce lang field for LexemeC to hold language id
put noun_chunk logic into iterators.py for each language separately
2016-03-10 13:01:34 +01:00
Wolfgang Seeker d9312bc9ea add new files npchunks.{pyx,pxd} to hold noun phrase chunk generators 2016-03-09 16:18:48 +01:00
Matthew Honnibal af8514cb0c * Refine the way the is_parsed attribute is set by from_array 2016-02-06 14:44:35 +01:00
Matthew Honnibal 6bb007d16e * Make set_parse nogil 2016-01-30 20:27:52 +01:00
Matthew Honnibal f24833d607 * Fix merge for coordinations 2016-01-18 16:03:19 +01:00
Matthew Honnibal fc8f26584a * Don't consider NPs connected to parse via conj relation as noun chunks. Change motivated by the nested noun chunks identified in Issue #203, but might be problematic. Also allow root NPs to be considered noun chunks. 2016-01-16 17:52:40 +01:00
Matthew Honnibal 54a98eaf19 * Fix typo text_wth_ws --> text_with_ws. Reroute .string attribute to text_with_ws, to deprecate .string in future 2016-01-16 17:13:50 +01:00
Matthew Honnibal a9b612abdf * Rework the Span-merge patch, to avoid extending the interface of Doc, and avoid virtualizing the Span.start and Span.end indices, to keep Span usage efficient 2015-11-07 09:01:12 +11:00
Matthew Honnibal 56499d89ef * Rework the Span-merge patch, to avoid extending the interface of Doc, and avoid virtualizing the Span.start and Span.end indices, to keep Span usage efficient 2015-11-07 08:55:34 +11:00
Andreas Grivas 562db6d2d0 * merge add lex last - add index finder funcs 2015-11-07 07:57:04 +11:00
Matthew Honnibal 68f479e821 * Rename Doc.data to Doc.c 2015-11-04 00:15:14 +11:00
Matthew Honnibal 9482d616bc * Rename spans.pyx to span.pyx 2015-11-03 23:51:05 +11:00
Matthew Honnibal 116da5990a * Clean up setting of tag in doc.from_bytes 2015-11-03 23:48:57 +11:00
Matthew Honnibal 1e99fcd413 * Rename .repvec to .vector in C API 2015-11-03 23:47:59 +11:00
Matthew Honnibal 9e37437ba8 * Fix assign_tag in doc.merge 2015-11-03 19:07:02 +11:00
Matthew Honnibal 833eb35c57 * Fix tag assignment in doc.from_array 2015-11-03 18:45:54 +11:00
Matthew Honnibal 09664177d7 * Fix tag handling in doc.merge, and assign sent_start when setting heads. 2015-11-03 18:15:52 +11:00
Matthew Honnibal 604ceac4c6 * Fix morphological assignment in doc.merge() 2015-11-03 17:57:51 +11:00
Matthew Honnibal 5e040855a5 * Ensure morphological features and lemmas are loaded in from_array, re Issue #152 2015-11-03 17:56:50 +11:00
Andreas Grivas d418f00eb1 fixed error when printing unicode 2015-11-02 20:23:18 +02:00
Matthew Honnibal 52fc338001 * Set is_parsed and is_tagged attrs when loading annotations into Doc, re Issue #152 2015-10-28 10:43:22 +11:00
Andreas Grivas 93ada458e2 added __repr__ that prints text in ipython for doc, token, and span objects 2015-10-21 14:11:46 +03:00
Matthew Honnibal 135062d23c * Fix error with merged text when merged region did not have trailing whitespace 2015-10-19 15:47:04 +11:00
Matthew Honnibal a7e6c5ac8f * Fix Issue #122: Incorrect calculation of children after Doc.merge() 2015-10-18 17:17:27 +11:00
Matthew Honnibal 94bafc1417 * Rename ATTR_IDS to attrs.IDS. Rename ATTR_NAMES to attrs.NAMES. Rename UNIV_POS_IDS to parts_of_speech.IDS 2015-10-10 17:57:29 +11:00
Yubing (Tom) Dong 0f601b8b75 Update docstring of Doc.__getitem__ 2015-10-07 01:27:28 -07:00
Yubing (Tom) Dong 3fd3bc79aa Refactor to remove duplicate slicing logic 2015-10-07 01:25:35 -07:00
Yubing (Tom) Dong 2fc33e8024 Allow step=1 when slicing a Doc 2015-10-06 00:57:05 -07:00
Matthew Honnibal ab694b0364 * Fix open-bounded slice indices. 2015-09-29 23:03:09 +10:00
Matthew Honnibal f7283a5067 * Fix vectors bugs for OOV words 2015-09-22 02:10:25 +02:00
Matthew Honnibal f32927efbf * Raise exceptions if attempt to access parse, but data is not installed. This partly but not fully addresses Issue #97. Still need exceptions on the various Token attributes that access the parse tree, e.g. token.head, token.lefts, token.rights, etc. Exceptions should be centralized, too. 2015-09-21 18:35:40 +10:00
Matthew Honnibal 77856c4fcd * Try giving Doc and Span objects vector and vector_norm attributes, and .similarity functions. Turns out to be bad idea. 2015-09-17 11:50:11 +10:00
Matthew Honnibal 60c26b2dfa * Fix slicing when start or stop is None 2015-09-15 14:43:10 +10:00
Matthew Honnibal 65dc0d1dfb * Extend word vectors support, with .similarity() function, vector_norm property, and rename repvec to vector. Keep repvec name as well for now for backwards compatibility. 2015-09-14 17:49:58 +10:00
Matthew Honnibal c08f10083c * Add test and test_with_ws attributes. 2015-09-13 10:27:42 +10:00
Matthew Honnibal 9e7bfe8449 * Fix space at end of merged token 2015-09-10 14:45:17 +02:00
Matthew Honnibal 31ccf494e6 Merge branch 'develop' of https://github.com/honnibal/spaCy into develop 2015-09-09 14:33:38 +02:00
Matthew Honnibal 07686470a9 * Don't consider a coordinated NP a base chunk 2015-09-09 14:32:28 +02:00
Matthew Honnibal 0e24d099a1 * Fix L/R edge bug, by ensuring l_edge and r_edge are preset, and fixing the way the edge update in del_arc. Bugs keep arising here because the edges are absolute positions, where everything else is relative. I'm also not 100% convinced that del_arc is handled correctly. Do we need to update the parents? 2015-09-09 03:40:44 +02:00
Matthew Honnibal 86c888667f * Merge in changes from de branch 2015-09-06 19:49:28 +02:00
Matthew Honnibal d2fc104a26 * Begin merge of Gazetteer and DE branches 2015-09-06 19:45:15 +02:00
Matthew Honnibal fd1eeb3102 * Add POS attribute support in get_attr 2015-09-06 04:13:03 +02:00
Matthew Honnibal c2307fa9ee * More work on language-generic parsing 2015-08-28 02:02:33 +02:00
Matthew Honnibal 6f1743692a * Work on language-independent refactoring 2015-08-23 20:49:18 +02:00
Matthew Honnibal b0f5c39084 * Fix handling of exclusion entities 2015-08-06 17:28:43 +02:00
Matthew Honnibal 10d869d102 * Don't allow conjunction between NPs in base NP chunks 2015-08-06 16:31:53 +02:00
Matthew Honnibal 9c1724ecae * Gazetteer stuff working, now need to wire up to API 2015-08-06 00:35:40 +02:00
Matthew Honnibal eb7138c761 * Add attr relation in base NP detection 2015-08-01 00:34:40 +02:00
Matthew Honnibal 4988356cf0 * Fix dependency type bug from merged tokens 2015-08-01 00:33:24 +02:00
Matthew Honnibal 78a9068319 * Fix spacy attr on merged tokens 2015-07-30 04:25:58 +02:00
Matthew Honnibal 430e2edb96 * Fix noun_chunks issue 2015-07-30 03:51:50 +02:00
Matthew Honnibal 74d8cb3980 * Add noun_chunks iterator, and fix left/right child setting in Doc.merge 2015-07-30 02:29:49 +02:00
Matthew Honnibal b5132bed7d * Set left and right children when loading parse from byte string 2015-07-28 21:03:18 +02:00
Matthew Honnibal aa7a964a4f * Add a type declaration for doc.from_array 2015-07-27 22:57:22 +02:00
Matthew Honnibal 2060935cdb * Remove explicit bytes type in doc.from_bytes, to accept bytearray 2015-07-24 04:54:13 +02:00
Matthew Honnibal 0bb839d299 * Fix string coercion for Python 3 2015-07-24 03:49:30 +02:00
Matthew Honnibal a0e36e8efc * Add working to/from bytes API to Doc 2015-07-23 01:14:45 +02:00
Matthew Honnibal 4d61239eac * Reorganize the serialization functions on Doc 2015-07-22 04:53:01 +02:00
Matthew Honnibal 8743a8c084 * Update Doc serialization for new Packer interface 2015-07-20 01:38:04 +02:00
Matthew Honnibal 317cbbc015 * Serialization round trip now working with decent API, but with rough spots in the organisation and requiring vocabulary to be fixed ahead of time. 2015-07-19 15:18:17 +02:00
Matthew Honnibal 6b13e7227c * Remove duplicate get_lex_attr method from doc.pyx 2015-07-18 22:46:07 +02:00
Matthew Honnibal ced59ab9ea * Make minor efficiency improvement in Doc.__iter__ 2015-07-18 04:10:53 +02:00
Matthew Honnibal cf0c788892 * Tests passing on round-trip pack/unpack on basic example 2015-07-17 21:20:48 +02:00
Matthew Honnibal dfdf19f6a9 * Draft a from_orth method for Doc 2015-07-17 16:39:54 +02:00
Matthew Honnibal db9dfd2e23 * Major refactor of serialization. Nearly complete now. 2015-07-17 01:27:54 +02:00
Matthew Honnibal a6f401580d * Add from_array function to Doc. 2015-07-16 17:46:11 +02:00
Matthew Honnibal e2133d990e * Move serialization functionality out into a Serializer object 2015-07-16 11:21:44 +02:00
Matthew Honnibal 01fab6bb90 * Improve de/serialize functions 2015-07-16 01:26:35 +02:00
Matthew Honnibal 0e07c1ed2a * draft de/serialization functions in doc.pyx 2015-07-16 01:16:33 +02:00
Matthew Honnibal 9d956b07e9 * Fix import of attrs in doc.pyx, and update the get_token_attr function. 2015-07-16 01:15:34 +02:00
Matthew Honnibal 935ac53ee3 * Extend count_by method 2015-07-14 03:20:09 +02:00
Matthew Honnibal 81aa4e6dcc * Go back to having token reference doc, instead of complicated gymnastics. Rename the attr 'doc', to expose it in the API 2015-07-14 00:10:11 +02:00
Matthew Honnibal 8214b74eec * Restore _py_tokens cache, to handle orphan tokens. 2015-07-13 22:28:10 +02:00
Matthew Honnibal 67641f3b58 * Refactor tokenizer, to set the 'spacy' field on TokenC instead of passing a string 2015-07-13 21:46:02 +02:00
Matthew Honnibal 3ea8756c24 * Add spacy/tokens/doc.pyx, for Doc class in its own file 2015-07-13 19:58:26 +02:00