spaCy/website/docs/usage/dependency-parse.jade

//- 💫 DOCS > USAGE > DEPENDENCY PARSE

include ../../_includes/_mixins

p
    |  spaCy features a fast and accurate syntactic dependency parser, and has
    |  a rich API for navigating the tree. The parser also powers the sentence
    |  boundary detection, and lets you iterate over base noun phrases, or
    |  "chunks".

+aside-code("Example").
    import spacy
    nlp = spacy.load('en')
    doc = nlp(u'I like green eggs and ham.')
    for np in doc.noun_chunks:
        print(np.text, np.root.text, np.root.dep_, np.root.head.text)
        # I I nsubj like
        # green eggs eggs dobj like
        # ham ham conj eggs

p
    |  You can check whether a #[+api("doc") #[code Doc]] object has been
    |  parsed with the #[code doc.is_parsed] attribute, which returns a boolean
    |  value. If this attribute is #[code False], the default sentence iterator
    |  will raise an exception.

+h(2, "displacy") The displaCy visualizer

p
    |  The best way to understand spaCy's dependency parser is interactively,
    |  through the displacy visualizer. If you want to know how to write rules
    |  that hook into some type of syntactic construction, just plug the
    |  sentence into the visualizer and see how spaCy annotates it.

+h(2, "navigating") Navigating the parse tree

p
    |  spaCy uses the terms #[em head] and #[em child] to describe the words
    |  connected by a single arc in the dependency tree. The term #[em dep] is
    |  used for the arc label, which describes the type of syntactic relation
    |  that connects the child to the head. As with other attributes, the value
    |  of #[code token.dep] is an integer. You can get the string value with
    |  #[code token.dep_].

+aside-code("Example").
    from spacy.symbols import DET
    the, dog = nlp(u'the dog')
    assert the.dep == det
    assert the.dep_ == 'det'

p
    |  Because the syntactic relations form a tree, every word has exactly one
    |  head. You can therefore iterate over the arcs in the tree by iterating
    |  over the words in the sentence. This is usually the best way to match an
    |  arc of interest — from below:

+code.
    from spacy.symbols import nsubj, VERB
    # Finding a verb with a subject from below — good
    verbs = set()
    for possible_subject in doc:
        if possible_subject.dep == nsubj and possible_subject.head.pos == VERB:
            verbs.add(possible_subject.head)

p
    |  If you try to match from above, you'll have to iterate twice: once for
    |  the head, and then again through the children:

+code.
    # Finding a verb with a subject from above — less good
    verbs = []
    for possible_verb in doc:
        if possible_verb.pos == VERB:
            for possible_subject in possible_verb.children:
                if possible_subject.dep == nsubj:
                    verbs.append(possible_verb)
                    break

p
    |  To iterate through the children, use the #[code token.children]
    |  attribute, which provides a sequence of #[+api("token") #[code Token]]
    |  objects.

p
    |  A few more convenience attributes are provided for iterating around the
    |  local tree from the token. The #[code .lefts] and #[code .rights]
    |  attributes provide sequences of syntactic children that occur before and
    |  after the token. Both sequences are in sentences order. There are also
    |  two integer-typed attributes, #[code .n_rights] and #[code .n_lefts],
    |  that give the number of left and right children.

+aside-code("Examples").
    apples = nlp(u'bright red apples on the tree')[2]
    print([w.text for w in apples.lefts])
    # ['bright', 'red']
    print([w.text for w in apples.rights])
    # ['on']
    assert apples.n_lefts == 2
    assert apples.n_rights == 3

    from spacy.symbols import nsubj
    doc = nlp(u'Credit and mortgage account holders must submit their requests within 30 days.')
    root = [w for w in doc if w.head is w][0]
    subject = list(root.lefts)[0]
    for descendant in subject.subtree:
        assert subject.is_ancestor(descendant)

    from spacy.symbols import nsubj
    doc = nlp(u'Credit and mortgage account holders must submit their requests.')
    holders = doc[4]
    span = doc[holders.left_edge.i : holders.right_edge.i + 1]
    span.merge()
    for word in doc:
        print(word.text, word.pos_, word.dep_, word.head.text)
        # Credit and mortgage account holders nsubj NOUN submit
        # must VERB aux submit
        # submit VERB ROOT submit
        # their DET det requests
        # requests NOUN dobj submit

p
    |  You can get a whole phrase by its syntactic head using the
    |  #[code .subtree] attribute. This returns an ordered sequence of tokens.
    |  For the default English model, the parse tree is #[em projective], which
    |  means that there are no crossing brackets. The tokens returned by
    |  #[code .subtree] are therefore guaranteed to be contiguous. This is not
    |  true for the German model, which has many
    |  #[+a("https://explosion.ai/blog/german-model#word-order", true) non-projective dependencies].
    |  You can walk up the tree with the #[code .ancestors] attribute, and
    |  check dominance with the #[code .is_ancestor()] method.

p
    |  Finally, I often find the #[code .left_edge] and #[code right_edge]
    |  attributes especially useful. They give you the first and right tokens
    |  of the subtree. This is the easiest way to create a #[code Span] object
    |  for a syntactic phrase — a useful operation.

p
    |  Note that #[code .right_edge] gives a token #[em within] the subtree —
    |  so if you use it as the end-point of a range, don't forget to #[code +1]!

+h(2, "disabling") Disabling the parser

p
    |  The parser is loaded and enabled by default. If you don't need any of
    |  the syntactic information, you should disable the parser. Disabling the
    |  parser will make spaCy load and run much faster. Here's how to prevent
    |  the parser from being loaded:

+code.
    import spacy

    nlp = spacy.load('en', parser=False)

p
    |  If you need to load the parser, but need to disable it for specific
    |  documents, you can control its use with the #[code parser] keyword
    |  argument:

+code.
    nlp = spacy.load('en')
    doc1 = nlp(u'Text I do want parsed.')
    doc2 = nlp(u'Text I don't want parsed', parser=False)