spaCy/spacy/tests/tokens/test_noun_chunks.py

import numpy as np

from spacy.attrs import HEAD, DEP
from spacy.symbols import nsubj, dobj, punct, amod, nmod, conj, cc, root
from spacy.en import English
from spacy.syntax.iterators import english_noun_chunks


def test_not_nested():
    nlp = English(parser=False, entity=False)
    sent = u'''Peter has chronic command and control issues'''.strip()
    tokens = nlp(sent)
    tokens.from_array(
        [HEAD, DEP],
        np.asarray(
            [
                [1, nsubj],
                [0, root],
                [4, amod],
                [3, nmod],
                [-1, cc],
                [-2, conj],
                [-5, dobj]
            ], dtype='int32'))
    tokens.noun_chunks_iterator = english_noun_chunks
    word_occurred = {}
    for chunk in tokens.noun_chunks:
        for word in chunk:
            word_occurred.setdefault(word.text, 0)
            word_occurred[word.text] += 1
    for word, freq in word_occurred.items():
        assert freq == 1, (word, [chunk.text for chunk in tokens.noun_chunks])
* Add test for Issue #203: noun chunks should be flat, but sometimes are nested 2016-01-16 16:41:25 +00:00			`import numpy as np`

			`from spacy.attrs import HEAD, DEP`
			`from spacy.symbols import nsubj, dobj, punct, amod, nmod, conj, cc, root`
			`from spacy.en import English`
* Refactor noun chunk iterators, so that they're simple functions. Install the iterator when the Doc is created, but allow users to write to the noun_chunk_iterator attribute. The iterator functions accept an object and yield (int start, int end, int label) triples. 2016-05-02 12:25:10 +00:00			`from spacy.syntax.iterators import english_noun_chunks`
* Add test for Issue #203: noun chunks should be flat, but sometimes are nested 2016-01-16 16:41:25 +00:00

			`def test_not_nested():`
bugfix in unit test 2016-04-08 14:45:27 +00:00			`nlp = English(parser=False, entity=False)`
* Add test for Issue #203: noun chunks should be flat, but sometimes are nested 2016-01-16 16:41:25 +00:00			`sent = u'''Peter has chronic command and control issues'''.strip()`
			`tokens = nlp(sent)`
			`tokens.from_array(`
			`[HEAD, DEP],`
			`np.asarray(`
			`[`
			`[1, nsubj],`
			`[0, root],`
			`[4, amod],`
			`[3, nmod],`
			`[-1, cc],`
			`[-2, conj],`
			`[-5, dobj]`
			`], dtype='int32'))`
* Refactor noun chunk iterators, so that they're simple functions. Install the iterator when the Doc is created, but allow users to write to the noun_chunk_iterator attribute. The iterator functions accept an object and yield (int start, int end, int label) triples. 2016-05-02 12:25:10 +00:00			`tokens.noun_chunks_iterator = english_noun_chunks`
* Add test for Issue #203: noun chunks should be flat, but sometimes are nested 2016-01-16 16:41:25 +00:00			`word_occurred = {}`
			`for chunk in tokens.noun_chunks:`
			`for word in chunk:`
			`word_occurred.setdefault(word.text, 0)`
			`word_occurred[word.text] += 1`
			`for word, freq in word_occurred.items():`
			`assert freq == 1, (word, [chunk.text for chunk in tokens.noun_chunks])`
bugfix in unit test 2016-04-08 14:45:27 +00:00