spaCy/spacy/syntax/iterators.pyx

from spacy.parts_of_speech cimport NOUN, PROPN, PRON


def english_noun_chunks(obj):
    '''Detect base noun phrases from a dependency parse.
    Works on both Doc and Span.'''
    labels = ['nsubj', 'dobj', 'nsubjpass', 'pcomp', 'pobj',
              'attr', 'ROOT', 'root']
    doc = obj.doc # Ensure works on both Doc and Span.
    np_deps = [doc.vocab.strings[label] for label in labels]
    conj = doc.vocab.strings['conj']
    np_label = doc.vocab.strings['NP']
    for i, word in enumerate(obj):
        if word.pos in (NOUN, PROPN, PRON) and word.dep in np_deps:
            yield word.left_edge.i, word.i+1, np_label
        elif word.pos == NOUN and word.dep == conj:
            head = word.head
            while head.dep == conj and head.head.i < head.i:
                head = head.head
            # If the head is an NP, and we're coordinated to it, we're an NP
            if head.dep in np_deps:
                yield word.left_edge.i, word.i+1, np_label


# this iterator extracts spans headed by NOUNs starting from the left-most
# syntactic dependent until the NOUN itself
# for close apposition and measurement construction, the span is sometimes
# extended to the right of the NOUN
# example: "eine Tasse Tee" (a cup (of) tea) returns "eine Tasse Tee" and not
# just "eine Tasse", same for "das Thema Familie"
def german_noun_chunks(obj):
    labels = ['sb', 'oa', 'da', 'nk', 'mo', 'ag', 'ROOT', 'root', 'cj', 'pd', 'og', 'app']
    doc = obj.doc # Ensure works on both Doc and Span.
    np_label = doc.vocab.strings['NP']
    np_deps = set(doc.vocab.strings[label] for label in labels)
    close_app = doc.vocab.strings['nk']

    rbracket = 0
    for i, word in enumerate(obj):
        if i < rbracket:
            continue
        if word.pos in (NOUN, PROPN, PRON) and word.dep in np_deps:
            rbracket = word.i+1
            # try to extend the span to the right
            # to capture close apposition/measurement constructions
            for rdep in doc[word.i].rights:
                if rdep.pos in (NOUN, PROPN) and rdep.dep == close_app:
                    rbracket = rdep.i+1
            yield word.left_edge.i, rbracket, np_label


CHUNKERS = {'en': english_noun_chunks, 'de': german_noun_chunks}
* Fix Issue #365: Error introduced during noun phrase chunking, due to use of corrected PRON/PROPN/etc tags. 2016-05-05 22:21:05 +00:00			`from spacy.parts_of_speech cimport NOUN, PROPN, PRON`
add baseclass DocIterator for iterators over documents add classes for English and German noun chunks the respective iterators are set for the document when created by the parser as they depend on the annotation scheme of the parsing model 2016-03-16 14:53:35 +00:00

Add noun_chunks to Span 2016-11-24 10:47:20 +00:00			`def english_noun_chunks(obj):`
Allow German noun chunks to work on Span Update the German noun chunks iterator, so that it also works on Span objects. 2016-11-24 12:30:15 +00:00			`'''Detect base noun phrases from a dependency parse.`
			`Works on both Doc and Span.'''`
* Refactor noun chunk iterators, so that they're simple functions. Install the iterator when the Doc is created, but allow users to write to the noun_chunk_iterator attribute. The iterator functions accept an object and yield (int start, int end, int label) triples. 2016-05-02 12:25:10 +00:00			`labels = ['nsubj', 'dobj', 'nsubjpass', 'pcomp', 'pobj',`
Fix Issue #469: Incorrectly cased root label in noun chunk iterator 2016-09-27 11:13:01 +00:00			`'attr', 'ROOT', 'root']`
Allow German noun chunks to work on Span Update the German noun chunks iterator, so that it also works on Span objects. 2016-11-24 12:30:15 +00:00			`doc = obj.doc # Ensure works on both Doc and Span.`
* Refactor noun chunk iterators, so that they're simple functions. Install the iterator when the Doc is created, but allow users to write to the noun_chunk_iterator attribute. The iterator functions accept an object and yield (int start, int end, int label) triples. 2016-05-02 12:25:10 +00:00			`np_deps = [doc.vocab.strings[label] for label in labels]`
			`conj = doc.vocab.strings['conj']`
			`np_label = doc.vocab.strings['NP']`
Add noun_chunks to Span 2016-11-24 10:47:20 +00:00			`for i, word in enumerate(obj):`
* Fix Issue #365: Error introduced during noun phrase chunking, due to use of corrected PRON/PROPN/etc tags. 2016-05-05 22:21:05 +00:00			`if word.pos in (NOUN, PROPN, PRON) and word.dep in np_deps:`
* Refactor noun chunk iterators, so that they're simple functions. Install the iterator when the Doc is created, but allow users to write to the noun_chunk_iterator attribute. The iterator functions accept an object and yield (int start, int end, int label) triples. 2016-05-02 12:25:10 +00:00			`yield word.left_edge.i, word.i+1, np_label`
			`elif word.pos == NOUN and word.dep == conj:`
			`head = word.head`
			`while head.dep == conj and head.head.i < head.i:`
			`head = head.head`
			`# If the head is an NP, and we're coordinated to it, we're an NP`
			`if head.dep in np_deps:`
			`yield word.left_edge.i, word.i+1, np_label`
add baseclass DocIterator for iterators over documents add classes for English and German noun chunks the respective iterators are set for the document when created by the parser as they depend on the annotation scheme of the parsing model 2016-03-16 14:53:35 +00:00

			`# this iterator extracts spans headed by NOUNs starting from the left-most`
			`# syntactic dependent until the NOUN itself`
			`# for close apposition and measurement construction, the span is sometimes`
			`# extended to the right of the NOUN`
			`# example: "eine Tasse Tee" (a cup (of) tea) returns "eine Tasse Tee" and not`
			`# just "eine Tasse", same for "das Thema Familie"`
Allow German noun chunks to work on Span Update the German noun chunks iterator, so that it also works on Span objects. 2016-11-24 12:30:15 +00:00			`def german_noun_chunks(obj):`
Fix Issue #469: Incorrectly cased root label in noun chunk iterator 2016-09-27 11:13:01 +00:00			`labels = ['sb', 'oa', 'da', 'nk', 'mo', 'ag', 'ROOT', 'root', 'cj', 'pd', 'og', 'app']`
Allow German noun chunks to work on Span Update the German noun chunks iterator, so that it also works on Span objects. 2016-11-24 12:30:15 +00:00			`doc = obj.doc # Ensure works on both Doc and Span.`
* Refactor noun chunk iterators, so that they're simple functions. Install the iterator when the Doc is created, but allow users to write to the noun_chunk_iterator attribute. The iterator functions accept an object and yield (int start, int end, int label) triples. 2016-05-02 12:25:10 +00:00			`np_label = doc.vocab.strings['NP']`
			`np_deps = set(doc.vocab.strings[label] for label in labels)`
			`close_app = doc.vocab.strings['nk']`

make the code less cryptic 2016-05-03 15:19:05 +00:00			`rbracket = 0`
Allow German noun chunks to work on Span Update the German noun chunks iterator, so that it also works on Span objects. 2016-11-24 12:30:15 +00:00			`for i, word in enumerate(obj):`
make the code less cryptic 2016-05-03 15:19:05 +00:00			`if i < rbracket:`
			`continue`
add fix for German noun chunk iterator (issue #365) 2016-05-05 23:41:26 +00:00			`if word.pos in (NOUN, PROPN, PRON) and word.dep in np_deps:`
* Refactor noun chunk iterators, so that they're simple functions. Install the iterator when the Doc is created, but allow users to write to the noun_chunk_iterator attribute. The iterator functions accept an object and yield (int start, int end, int label) triples. 2016-05-02 12:25:10 +00:00			`rbracket = word.i+1`
			`# try to extend the span to the right`
			`# to capture close apposition/measurement constructions`
			`for rdep in doc[word.i].rights:`
add fix for German noun chunk iterator (issue #365) 2016-05-05 23:41:26 +00:00			`if rdep.pos in (NOUN, PROPN) and rdep.dep == close_app:`
* Refactor noun chunk iterators, so that they're simple functions. Install the iterator when the Doc is created, but allow users to write to the noun_chunk_iterator attribute. The iterator functions accept an object and yield (int start, int end, int label) triples. 2016-05-02 12:25:10 +00:00			`rbracket = rdep.i+1`
fix whitespace 2016-05-04 05:40:38 +00:00			`yield word.left_edge.i, rbracket, np_label`

* Refactor noun chunk iterators, so that they're simple functions. Install the iterator when the Doc is created, but allow users to write to the noun_chunk_iterator attribute. The iterator functions accept an object and yield (int start, int end, int label) triples. 2016-05-02 12:25:10 +00:00
			`CHUNKERS = {'en': english_noun_chunks, 'de': german_noun_chunks}`