spaCy/linguistic-features.md at e626df959fdcbf7a5fbc9d24a86af8e093238c82

81 KiB

Raw Blame History

title

Linguistic Features

/usage/rule-based-matching

POS Tagging

pos-tagging

Morphology

morphology

Lemmatization

lemmatization

Dependency Parse

dependency-parse

Named Entities

named-entities

Entity Linking

entity-linking

Tokenization

tokenization

Merging & Splitting

retokenization

Sentence Segmentation

sbd

Vectors & Similarity

vectors-similarity

Mappings & Exceptions

mappings-exceptions

Language Data

language-data

Processing raw text intelligently is difficult: most words are rare, and it's common for words that look completely different to mean almost the same thing. The same words in a different order can mean something completely different. Even splitting text into useful word-like units can be difficult in many languages. While it's possible to solve some problems starting from only the raw characters, it's usually better to use linguistic knowledge to add useful information. That's exactly what spaCy is designed to do: you put in raw text, and get back a Doc object, that comes with a variety of annotations.

Part-of-speech tagging

import PosDeps101 from 'usage/101/_pos-deps.md'

For a list of the fine-grained and coarse-grained part-of-speech tags assigned by spaCy's models across different languages, see the label schemes documented in the models directory.

Morphology

Inflectional morphology is the process by which a root form of a word is modified by adding prefixes or suffixes that specify its grammatical function but do not change its part-of-speech. We say that a lemma (root form) is inflected (modified/combined) with one or more morphological features to create a surface form. Here are some examples:

Context	Surface	Lemma	POS	Morphological Features
I was reading the paper	reading	read	`VERB`	`VerbForm=Ger`
I don't watch the news, I read the paper	read	read	`VERB`	`VerbForm=Fin`, `Mood=Ind`, `Tense=Pres`
I read the paper yesterday	read	read	`VERB`	`VerbForm=Fin`, `Mood=Ind`, `Tense=Past`

Morphological features are stored in the MorphAnalysis under Token.morph, which allows you to access individual morphological features.

📝 Things to try

Change "I" to "She". You should see that the morphological features change and express that it's a pronoun in the third person.

Inspect token.morph for the other tokens.

### {executable="true"}
import spacy

nlp = spacy.load("en_core_web_sm")
print("Pipeline:", nlp.pipe_names)
doc = nlp("I was reading the paper.")
token = doc[0]  # 'I'
print(token.morph)  # 'Case=Nom|Number=Sing|Person=1|PronType=Prs'
print(token.morph.get("PronType"))  # ['Prs']

Statistical morphology

spaCy's statistical Morphologizer component assigns the morphological features and coarse-grained part-of-speech tags as Token.morph and Token.pos.

### {executable="true"}
import spacy

nlp = spacy.load("de_core_news_sm")
doc = nlp("Wo bist du?") # English: 'Where are you?'
print(doc[2].morph)  # 'Case=Nom|Number=Sing|Person=2|PronType=Prs'
print(doc[2].pos_) # 'PRON'

Rule-based morphology

For languages with relatively simple morphological systems like English, spaCy can assign morphological features through a rule-based approach, which uses the token text and fine-grained part-of-speech tags to produce coarse-grained part-of-speech tags and morphological features.

The part-of-speech tagger assigns each token a fine-grained part-of-speech tag. In the API, these tags are known as Token.tag. They express the part-of-speech (e.g. verb) and some amount of morphological information, e.g. that the verb is past tense (e.g. VBD for a past tense verb in the Penn Treebank) .
For words whose coarse-grained POS is not set by a prior process, a mapping table maps the fine-grained tags to a coarse-grained POS tags and morphological features.

### {executable="true"}
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Where are you?")
print(doc[2].morph)  # 'Case=Nom|Person=2|PronType=Prs'
print(doc[2].pos_)  # 'PRON'

Lemmatization

spaCy provides two pipeline components for lemmatization:

The Lemmatizer component provides lookup and rule-based lemmatization methods in a configurable component. An individual language can extend the Lemmatizer as part of its language data.
The EditTreeLemmatizer 3.3 component provides a trainable lemmatizer.

### {executable="true"}
import spacy

# English pipelines include a rule-based lemmatizer
nlp = spacy.load("en_core_web_sm")
lemmatizer = nlp.get_pipe("lemmatizer")
print(lemmatizer.mode)  # 'rule'

doc = nlp("I was reading the paper.")
print([token.lemma_ for token in doc])
# ['I', 'be', 'read', 'the', 'paper', '.']

Unlike spaCy v2, spaCy v3 models do not provide lemmas by default or switch automatically between lookup and rule-based lemmas depending on whether a tagger is in the pipeline. To have lemmas in a Doc, the pipeline needs to include a Lemmatizer component. The lemmatizer component is configured to use a single mode such as "lookup" or "rule" on initialization. The "rule" mode requires Token.pos to be set by a previous component.

The data for spaCy's lemmatizers is distributed in the package spacy-lookups-data. The provided trained pipelines already include all the required tables, but if you are creating new pipelines, you'll probably want to install spacy-lookups-data to provide the data when the lemmatizer is initialized.

Lookup lemmatizer

For pipelines without a tagger or morphologizer, a lookup lemmatizer can be added to the pipeline as long as a lookup table is provided, typically through spacy-lookups-data. The lookup lemmatizer looks up the token surface form in the lookup table without reference to the token's part-of-speech or context.

# pip install -U %%SPACY_PKG_NAME[lookups]%%SPACY_PKG_FLAGS
import spacy

nlp = spacy.blank("sv")
nlp.add_pipe("lemmatizer", config={"mode": "lookup"})

Rule-based lemmatizer

When training pipelines that include a component that assigns part-of-speech tags (a morphologizer or a tagger with a POS mapping), a rule-based lemmatizer can be added using rule tables from spacy-lookups-data:

# pip install -U %%SPACY_PKG_NAME[lookups]%%SPACY_PKG_FLAGS
import spacy

nlp = spacy.blank("de")
# Morphologizer (note: model is not yet trained!)
nlp.add_pipe("morphologizer")
# Rule-based lemmatizer
nlp.add_pipe("lemmatizer", config={"mode": "rule"})

The rule-based deterministic lemmatizer maps the surface form to a lemma in light of the previously assigned coarse-grained part-of-speech and morphological information, without consulting the context of the token. The rule-based lemmatizer also accepts list-based exception files. For English, these are acquired from WordNet.

Trainable lemmatizer

The EditTreeLemmatizer can learn form-to-lemma transformations from a training corpus that includes lemma annotations. This removes the need to write language-specific rules and can (in many cases) provide higher accuracies than lookup and rule-based lemmatizers.

import spacy

nlp = spacy.blank("de")
nlp.add_pipe("trainable_lemmatizer", name="lemmatizer")

Dependency Parsing

spaCy features a fast and accurate syntactic dependency parser, and has a rich API for navigating the tree. The parser also powers the sentence boundary detection, and lets you iterate over base noun phrases, or "chunks". You can check whether a Doc object has been parsed by calling doc.has_annotation("DEP"), which checks whether the attribute Token.dep has been set returns a boolean value. If the result is False, the default sentence iterator will raise an exception.

For a list of the syntactic dependency labels assigned by spaCy's models across different languages, see the label schemes documented in the models directory.

Noun chunks

Noun chunks are "base noun phrases" – flat phrases that have a noun as their head. You can think of noun chunks as a noun plus the words describing the noun – for example, "the lavish green grass" or "the world’s largest tech fund". To get the noun chunks in a document, simply iterate over Doc.noun_chunks.

### {executable="true"}
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Autonomous cars shift insurance liability toward manufacturers")
for chunk in doc.noun_chunks:
    print(chunk.text, chunk.root.text, chunk.root.dep_,
            chunk.root.head.text)

Text: The original noun chunk text.

Root text: The original text of the word connecting the noun chunk to the rest of the parse.

Root dep: Dependency relation connecting the root to its head.

Root head text: The text of the root token's head.

Text	root.text	root.dep_	root.head.text
Autonomous cars	cars	`nsubj`	shift
insurance liability	liability	`dobj`	shift
manufacturers	manufacturers	`pobj`	toward

Navigating the parse tree

spaCy uses the terms head and child to describe the words connected by a single arc in the dependency tree. The term dep is used for the arc label, which describes the type of syntactic relation that connects the child to the head. As with other attributes, the value of .dep is a hash value. You can get the string value with .dep_.

### {executable="true"}
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Autonomous cars shift insurance liability toward manufacturers")
for token in doc:
    print(token.text, token.dep_, token.head.text, token.head.pos_,
            [child for child in token.children])

Text: The original token text.

Dep: The syntactic relation connecting child to head.

Head text: The original text of the token head.

Head POS: The part-of-speech tag of the token head.

Children: The immediate syntactic dependents of the token.

Text	Dep	Head text	Head POS	Children
Autonomous	`amod`	cars	`NOUN`
cars	`nsubj`	shift	`VERB`	Autonomous
shift	`ROOT`	shift	`VERB`	cars, liability, toward
insurance	`compound`	liability	`NOUN`
liability	`dobj`	shift	`VERB`	insurance
toward	`prep`	shift	`NOUN`	manufacturers
manufacturers	`pobj`	toward	`ADP`

import DisplaCyLong2Html from 'images/displacy-long2.html'

81 KiB Raw Blame History Unescape Escape

Part-of-speech tagging

Morphology

📝 Things to try

Statistical morphology

Rule-based morphology

Lemmatization

Lookup lemmatizer

Rule-based lemmatizer

Trainable lemmatizer

Dependency Parsing

Noun chunks

Navigating the parse tree

81 KiB

Raw Blame History