81 KiB
title | next | menu | ||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Linguistic Features | /usage/rule-based-matching |
|
Processing raw text intelligently is difficult: most words are rare, and it's
common for words that look completely different to mean almost the same thing.
The same words in a different order can mean something completely different.
Even splitting text into useful word-like units can be difficult in many
languages. While it's possible to solve some problems starting from only the raw
characters, it's usually better to use linguistic knowledge to add useful
information. That's exactly what spaCy is designed to do: you put in raw text,
and get back a Doc
object, that comes with a variety of
annotations.
Part-of-speech tagging
import PosDeps101 from 'usage/101/_pos-deps.md'
For a list of the fine-grained and coarse-grained part-of-speech tags assigned by spaCy's models across different languages, see the label schemes documented in the models directory.
Morphology
Inflectional morphology is the process by which a root form of a word is modified by adding prefixes or suffixes that specify its grammatical function but do not change its part-of-speech. We say that a lemma (root form) is inflected (modified/combined) with one or more morphological features to create a surface form. Here are some examples:
Context | Surface | Lemma | POS | Morphological Features |
---|---|---|---|---|
I was reading the paper | reading | read | VERB |
VerbForm=Ger |
I don't watch the news, I read the paper | read | read | VERB |
VerbForm=Fin , Mood=Ind , Tense=Pres |
I read the paper yesterday | read | read | VERB |
VerbForm=Fin , Mood=Ind , Tense=Past |
Morphological features are stored in the
MorphAnalysis
under Token.morph
, which
allows you to access individual morphological features.
📝 Things to try
- Change "I" to "She". You should see that the morphological features change and express that it's a pronoun in the third person.
- Inspect
token.morph
for the other tokens.
### {executable="true"}
import spacy
nlp = spacy.load("en_core_web_sm")
print("Pipeline:", nlp.pipe_names)
doc = nlp("I was reading the paper.")
token = doc[0] # 'I'
print(token.morph) # 'Case=Nom|Number=Sing|Person=1|PronType=Prs'
print(token.morph.get("PronType")) # ['Prs']
Statistical morphology
spaCy's statistical Morphologizer
component assigns the
morphological features and coarse-grained part-of-speech tags as Token.morph
and Token.pos
.
### {executable="true"}
import spacy
nlp = spacy.load("de_core_news_sm")
doc = nlp("Wo bist du?") # English: 'Where are you?'
print(doc[2].morph) # 'Case=Nom|Number=Sing|Person=2|PronType=Prs'
print(doc[2].pos_) # 'PRON'
Rule-based morphology
For languages with relatively simple morphological systems like English, spaCy can assign morphological features through a rule-based approach, which uses the token text and fine-grained part-of-speech tags to produce coarse-grained part-of-speech tags and morphological features.
- The part-of-speech tagger assigns each token a fine-grained part-of-speech
tag. In the API, these tags are known as
Token.tag
. They express the part-of-speech (e.g. verb) and some amount of morphological information, e.g. that the verb is past tense (e.g.VBD
for a past tense verb in the Penn Treebank) . - For words whose coarse-grained POS is not set by a prior process, a mapping table maps the fine-grained tags to a coarse-grained POS tags and morphological features.
### {executable="true"}
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("Where are you?")
print(doc[2].morph) # 'Case=Nom|Person=2|PronType=Prs'
print(doc[2].pos_) # 'PRON'
Lemmatization
The Lemmatizer
is a pipeline component that provides lookup
and rule-based lemmatization methods in a configurable component. An individual
language can extend the Lemmatizer
as part of its
language data.
### {executable="true"}
import spacy
# English pipelines include a rule-based lemmatizer
nlp = spacy.load("en_core_web_sm")
lemmatizer = nlp.get_pipe("lemmatizer")
print(lemmatizer.mode) # 'rule'
doc = nlp("I was reading the paper.")
print([token.lemma_ for token in doc])
# ['I', 'be', 'read', 'the', 'paper', '.']
Unlike spaCy v2, spaCy v3 models do not provide lemmas by default or switch
automatically between lookup and rule-based lemmas depending on whether a tagger
is in the pipeline. To have lemmas in a Doc
, the pipeline needs to include a
Lemmatizer
component. The lemmatizer component is
configured to use a single mode such as "lookup"
or "rule"
on
initialization. The "rule"
mode requires Token.pos
to be set by a previous
component.
The data for spaCy's lemmatizers is distributed in the package
spacy-lookups-data
. The
provided trained pipelines already include all the required tables, but if you
are creating new pipelines, you'll probably want to install spacy-lookups-data
to provide the data when the lemmatizer is initialized.
Lookup lemmatizer
For pipelines without a tagger or morphologizer, a lookup lemmatizer can be
added to the pipeline as long as a lookup table is provided, typically through
spacy-lookups-data
. The
lookup lemmatizer looks up the token surface form in the lookup table without
reference to the token's part-of-speech or context.
# pip install -U %%SPACY_PKG_NAME[lookups]%%SPACY_PKG_FLAGS
import spacy
nlp = spacy.blank("sv")
nlp.add_pipe("lemmatizer", config={"mode": "lookup"})
Rule-based lemmatizer
When training pipelines that include a component that assigns part-of-speech
tags (a morphologizer or a tagger with a POS mapping), a
rule-based lemmatizer can be added using rule tables from
spacy-lookups-data
:
# pip install -U %%SPACY_PKG_NAME[lookups]%%SPACY_PKG_FLAGS
import spacy
nlp = spacy.blank("de")
# Morphologizer (note: model is not yet trained!)
nlp.add_pipe("morphologizer")
# Rule-based lemmatizer
nlp.add_pipe("lemmatizer", config={"mode": "rule"})
The rule-based deterministic lemmatizer maps the surface form to a lemma in light of the previously assigned coarse-grained part-of-speech and morphological information, without consulting the context of the token. The rule-based lemmatizer also accepts list-based exception files. For English, these are acquired from WordNet.
Dependency Parsing
spaCy features a fast and accurate syntactic dependency parser, and has a rich
API for navigating the tree. The parser also powers the sentence boundary
detection, and lets you iterate over base noun phrases, or "chunks". You can
check whether a Doc
object has been parsed by calling
doc.has_annotation("DEP")
, which checks whether the attribute Token.dep
has
been set returns a boolean value. If the result is False
, the default sentence
iterator will raise an exception.
For a list of the syntactic dependency labels assigned by spaCy's models across different languages, see the label schemes documented in the models directory.
Noun chunks
Noun chunks are "base noun phrases" – flat phrases that have a noun as their
head. You can think of noun chunks as a noun plus the words describing the noun
– for example, "the lavish green grass" or "the world’s largest tech fund". To
get the noun chunks in a document, simply iterate over
Doc.noun_chunks
.
### {executable="true"}
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("Autonomous cars shift insurance liability toward manufacturers")
for chunk in doc.noun_chunks:
print(chunk.text, chunk.root.text, chunk.root.dep_,
chunk.root.head.text)
- Text: The original noun chunk text.
- Root text: The original text of the word connecting the noun chunk to the rest of the parse.
- Root dep: Dependency relation connecting the root to its head.
- Root head text: The text of the root token's head.
Text | root.text | root.dep_ | root.head.text |
---|---|---|---|
Autonomous cars | cars | nsubj |
shift |
insurance liability | liability | dobj |
shift |
manufacturers | manufacturers | pobj |
toward |
Navigating the parse tree
spaCy uses the terms head and child to describe the words connected by
a single arc in the dependency tree. The term dep is used for the arc
label, which describes the type of syntactic relation that connects the child to
the head. As with other attributes, the value of .dep
is a hash value. You can
get the string value with .dep_
.
### {executable="true"}
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("Autonomous cars shift insurance liability toward manufacturers")
for token in doc:
print(token.text, token.dep_, token.head.text, token.head.pos_,
[child for child in token.children])
- Text: The original token text.
- Dep: The syntactic relation connecting child to head.
- Head text: The original text of the token head.
- Head POS: The part-of-speech tag of the token head.
- Children: The immediate syntactic dependents of the token.
Text | Dep | Head text | Head POS | Children |
---|---|---|---|---|
Autonomous | amod |
cars | NOUN |
|
cars | nsubj |
shift | VERB |
Autonomous |
shift | ROOT |
shift | VERB |
cars, liability, toward |
insurance | compound |
liability | NOUN |
|
liability | dobj |
shift | VERB |
insurance |
toward | prep |
shift | NOUN |
manufacturers |
manufacturers | pobj |
toward | ADP |
import DisplaCyLong2Html from 'images/displacy-long2.html'