4.5 KiB
title | teaser | source | menu | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Pipeline Functions | Other built-in pipeline components and helpers | spacy/pipeline/functions.py |
|
merge_noun_chunks
Merge noun chunks into a single token. Also available via the string name
"merge_noun_chunks"
. After initialization, the component is typically added to
the processing pipeline using nlp.add_pipe
.
Example
texts = [t.token for t in nlp(u"I have a blue car")] assert texts = ["I", "have", "a", "blue", "car"] merge_nps = nlp.create_pipe("merge_noun_chunks") nlp.add_pipe(merge_nps) texts = [t.token for t in nlp(u"I have a blue car")] assert texts == ["I", "have", "a blue car"]
Since noun chunks require part-of-speech tags and the dependency parse, make
sure to add this component after the "tagger"
and "parser"
components. By
default, nlp.add_pipe
will add components to the end of the pipeline and after
all other components.
Name | Type | Description |
---|---|---|
doc |
Doc |
The Doc object to process, e.g. the Doc in the pipeline. |
RETURNS | Doc |
The modified Doc with merged noun chunks. |
merge_entities
Merge named entities into a single token. Also available via the string name
"merge_entities"
. After initialization, the component is typically added to
the processing pipeline using nlp.add_pipe
.
Example
texts = [t.token for t in nlp(u"I like David Bowie")] assert texts = ["I", "like", "David", "Bowie"] merge_ents = nlp.create_pipe("merge_entities") nlp.add_pipe(merge_ents) texts = [t.token for t in nlp(u"I like David Bowie")] assert texts == ["I", "like", "David Bowie"]
Since named entities are set by the entity recognizer, make sure to add this
component after the "ner"
component. By default, nlp.add_pipe
will add
components to the end of the pipeline and after all other components.
Name | Type | Description |
---|---|---|
doc |
Doc |
The Doc object to process, e.g. the Doc in the pipeline. |
RETURNS | Doc |
The modified Doc with merged entities. |
merge_subtokens
Merge subtokens into a single token. Also available via the string name
"merge_subtokens"
. After initialization, the component is typically added to
the processing pipeline using nlp.add_pipe
.
As of v2.1, the parser is able to predict "subtokens" that should be merged into
one single token later on. This is especially relevant for languages like
Chinese, Japanese or Korean, where a "word" isn't defined as a
whitespace-delimited sequence of characters. Under the hood, this component uses
the Matcher
to find sequences of tokens with the dependency
label "subtok"
and then merges them into a single token.
Example
Note that this example assumes a custom Chinese model that oversegments and was trained to predict subtokens.
doc = nlp("拜托") print([(token.text, token.dep_) for token in doc]) # [('拜', 'subtok'), ('托', 'subtok')] merge_subtok = nlp.create_pipe("merge_subtokens") nlp.add_pipe(merge_subtok) doc = nlp("拜托") print([token.text for token in doc]) # ['拜托']
Since subtokens are set by the parser, make sure to add this component after
the "parser"
component. By default, nlp.add_pipe
will add components to the
end of the pipeline and after all other components.
Name | Type | Description |
---|---|---|
doc |
Doc |
The Doc object to process, e.g. the Doc in the pipeline. |
label |
unicode | The subtoken dependency label. Defaults to "subtok" . |
RETURNS | Doc |
The modified Doc with merged subtokens. |