Matthew Honnibal
508fd1f6dc
* Refactor noun chunk iterators, so that they're simple functions. Install the iterator when the Doc is created, but allow users to write to the noun_chunk_iterator attribute. The iterator functions accept an object and yield (int start, int end, int label) triples.
2016-05-02 14:25:10 +02:00
Matthew Honnibal
872695759d
Merge pull request #306 from wbwseeker/german_noun_chunks
...
add German noun chunk functionality
2016-04-08 00:54:24 +10:00
Matthew Honnibal
ad119c074f
* Fix incorrect whitespacing in Doc.text. This change is potentially breaking, to anyone who was relying on the previous incorrect semantics.
2016-03-29 13:02:42 +11:00
Wolfgang Seeker
d65ef41d08
make error messages language independent
2016-03-24 11:47:09 +01:00
Wolfgang Seeker
5e2e8e951a
add baseclass DocIterator for iterators over documents
...
add classes for English and German noun chunks
the respective iterators are set for the document when created by the parser
as they depend on the annotation scheme of the parsing model
2016-03-16 15:53:35 +01:00
Wolfgang Seeker
03fb498dbe
introduce lang field for LexemeC to hold language id
...
put noun_chunk logic into iterators.py for each language separately
2016-03-10 13:01:34 +01:00
Wolfgang Seeker
d9312bc9ea
add new files npchunks.{pyx,pxd} to hold noun phrase chunk generators
2016-03-09 16:18:48 +01:00
Matthew Honnibal
af8514cb0c
* Refine the way the is_parsed attribute is set by from_array
2016-02-06 14:44:35 +01:00
Matthew Honnibal
6bb007d16e
* Make set_parse nogil
2016-01-30 20:27:52 +01:00
Matthew Honnibal
f24833d607
* Fix merge for coordinations
2016-01-18 16:03:19 +01:00
Matthew Honnibal
fc8f26584a
* Don't consider NPs connected to parse via conj relation as noun chunks. Change motivated by the nested noun chunks identified in Issue #203 , but might be problematic. Also allow root NPs to be considered noun chunks.
2016-01-16 17:52:40 +01:00
Matthew Honnibal
54a98eaf19
* Fix typo text_wth_ws --> text_with_ws. Reroute .string attribute to text_with_ws, to deprecate .string in future
2016-01-16 17:13:50 +01:00
Matthew Honnibal
a9b612abdf
* Rework the Span-merge patch, to avoid extending the interface of Doc, and avoid virtualizing the Span.start and Span.end indices, to keep Span usage efficient
2015-11-07 09:01:12 +11:00
Matthew Honnibal
56499d89ef
* Rework the Span-merge patch, to avoid extending the interface of Doc, and avoid virtualizing the Span.start and Span.end indices, to keep Span usage efficient
2015-11-07 08:55:34 +11:00
Andreas Grivas
562db6d2d0
* merge add lex last - add index finder funcs
2015-11-07 07:57:04 +11:00
Matthew Honnibal
68f479e821
* Rename Doc.data to Doc.c
2015-11-04 00:15:14 +11:00
Matthew Honnibal
9482d616bc
* Rename spans.pyx to span.pyx
2015-11-03 23:51:05 +11:00
Matthew Honnibal
116da5990a
* Clean up setting of tag in doc.from_bytes
2015-11-03 23:48:57 +11:00
Matthew Honnibal
1e99fcd413
* Rename .repvec to .vector in C API
2015-11-03 23:47:59 +11:00
Matthew Honnibal
9e37437ba8
* Fix assign_tag in doc.merge
2015-11-03 19:07:02 +11:00
Matthew Honnibal
833eb35c57
* Fix tag assignment in doc.from_array
2015-11-03 18:45:54 +11:00
Matthew Honnibal
09664177d7
* Fix tag handling in doc.merge, and assign sent_start when setting heads.
2015-11-03 18:15:52 +11:00
Matthew Honnibal
604ceac4c6
* Fix morphological assignment in doc.merge()
2015-11-03 17:57:51 +11:00
Matthew Honnibal
5e040855a5
* Ensure morphological features and lemmas are loaded in from_array, re Issue #152
2015-11-03 17:56:50 +11:00
Andreas Grivas
d418f00eb1
fixed error when printing unicode
2015-11-02 20:23:18 +02:00
Matthew Honnibal
52fc338001
* Set is_parsed and is_tagged attrs when loading annotations into Doc, re Issue #152
2015-10-28 10:43:22 +11:00
Andreas Grivas
93ada458e2
added __repr__ that prints text in ipython for doc, token, and span objects
2015-10-21 14:11:46 +03:00
Matthew Honnibal
135062d23c
* Fix error with merged text when merged region did not have trailing whitespace
2015-10-19 15:47:04 +11:00
Matthew Honnibal
a7e6c5ac8f
* Fix Issue #122 : Incorrect calculation of children after Doc.merge()
2015-10-18 17:17:27 +11:00
Matthew Honnibal
94bafc1417
* Rename ATTR_IDS to attrs.IDS. Rename ATTR_NAMES to attrs.NAMES. Rename UNIV_POS_IDS to parts_of_speech.IDS
2015-10-10 17:57:29 +11:00
Yubing (Tom) Dong
0f601b8b75
Update docstring of Doc.__getitem__
2015-10-07 01:27:28 -07:00
Yubing (Tom) Dong
3fd3bc79aa
Refactor to remove duplicate slicing logic
2015-10-07 01:25:35 -07:00
Yubing (Tom) Dong
2fc33e8024
Allow step=1 when slicing a Doc
2015-10-06 00:57:05 -07:00
Matthew Honnibal
ab694b0364
* Fix open-bounded slice indices.
2015-09-29 23:03:09 +10:00
Matthew Honnibal
f7283a5067
* Fix vectors bugs for OOV words
2015-09-22 02:10:25 +02:00
Matthew Honnibal
f32927efbf
* Raise exceptions if attempt to access parse, but data is not installed. This partly but not fully addresses Issue #97 . Still need exceptions on the various Token attributes that access the parse tree, e.g. token.head, token.lefts, token.rights, etc. Exceptions should be centralized, too.
2015-09-21 18:35:40 +10:00
Matthew Honnibal
77856c4fcd
* Try giving Doc and Span objects vector and vector_norm attributes, and .similarity functions. Turns out to be bad idea.
2015-09-17 11:50:11 +10:00
Matthew Honnibal
60c26b2dfa
* Fix slicing when start or stop is None
2015-09-15 14:43:10 +10:00
Matthew Honnibal
65dc0d1dfb
* Extend word vectors support, with .similarity() function, vector_norm property, and rename repvec to vector. Keep repvec name as well for now for backwards compatibility.
2015-09-14 17:49:58 +10:00
Matthew Honnibal
c08f10083c
* Add test and test_with_ws attributes.
2015-09-13 10:27:42 +10:00
Matthew Honnibal
9e7bfe8449
* Fix space at end of merged token
2015-09-10 14:45:17 +02:00
Matthew Honnibal
31ccf494e6
Merge branch 'develop' of https://github.com/honnibal/spaCy into develop
2015-09-09 14:33:38 +02:00
Matthew Honnibal
07686470a9
* Don't consider a coordinated NP a base chunk
2015-09-09 14:32:28 +02:00
Matthew Honnibal
0e24d099a1
* Fix L/R edge bug, by ensuring l_edge and r_edge are preset, and fixing the way the edge update in del_arc. Bugs keep arising here because the edges are absolute positions, where everything else is relative. I'm also not 100% convinced that del_arc is handled correctly. Do we need to update the parents?
2015-09-09 03:40:44 +02:00
Matthew Honnibal
86c888667f
* Merge in changes from de branch
2015-09-06 19:49:28 +02:00
Matthew Honnibal
d2fc104a26
* Begin merge of Gazetteer and DE branches
2015-09-06 19:45:15 +02:00
Matthew Honnibal
fd1eeb3102
* Add POS attribute support in get_attr
2015-09-06 04:13:03 +02:00
Matthew Honnibal
c2307fa9ee
* More work on language-generic parsing
2015-08-28 02:02:33 +02:00
Matthew Honnibal
6f1743692a
* Work on language-independent refactoring
2015-08-23 20:49:18 +02:00
Matthew Honnibal
b0f5c39084
* Fix handling of exclusion entities
2015-08-06 17:28:43 +02:00