5.0 KiB
title | teaser | source | menu | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Pipeline Functions | Other built-in pipeline components and helpers | spacy/pipeline/functions.py |
|
merge_noun_chunks
Merge noun chunks into a single token. Also available via the string name
"merge_noun_chunks"
.
Example
texts = [t.text for t in nlp("I have a blue car")] assert texts == ["I", "have", "a", "blue", "car"] nlp.add_pipe("merge_noun_chunks") texts = [t.text for t in nlp("I have a blue car")] assert texts == ["I", "have", "a blue car"]
Since noun chunks require part-of-speech tags and the dependency parse, make
sure to add this component after the "tagger"
and "parser"
components. By
default, nlp.add_pipe
will add components to the end of the pipeline and after
all other components.
Name | Description |
---|---|
doc |
The Doc object to process, e.g. the Doc in the pipeline. |
RETURNS | The modified Doc with merged noun chunks. |
merge_entities
Merge named entities into a single token. Also available via the string name
"merge_entities"
.
Example
texts = [t.text for t in nlp("I like David Bowie")] assert texts == ["I", "like", "David", "Bowie"] nlp.add_pipe("merge_entities") texts = [t.text for t in nlp("I like David Bowie")] assert texts == ["I", "like", "David Bowie"]
Since named entities are set by the entity recognizer, make sure to add this
component after the "ner"
component. By default, nlp.add_pipe
will add
components to the end of the pipeline and after all other components.
Name | Description |
---|---|
doc |
The Doc object to process, e.g. the Doc in the pipeline. |
RETURNS | The modified Doc with merged entities. |
merge_subtokens
Merge subtokens into a single token. Also available via the string name
"merge_subtokens"
. As of v2.1, the parser is able to predict "subtokens" that
should be merged into one single token later on. This is especially relevant for
languages like Chinese, Japanese or Korean, where a "word" isn't defined as a
whitespace-delimited sequence of characters. Under the hood, this component uses
the Matcher
to find sequences of tokens with the dependency
label "subtok"
and then merges them into a single token.
Example
Note that this example assumes a custom Chinese model that oversegments and was trained to predict subtokens.
doc = nlp("拜托") print([(token.text, token.dep_) for token in doc]) # [('拜', 'subtok'), ('托', 'subtok')] nlp.add_pipe("merge_subtokens") doc = nlp("拜托") print([token.text for token in doc]) # ['拜托']
Since subtokens are set by the parser, make sure to add this component after
the "parser"
component. By default, nlp.add_pipe
will add components to the
end of the pipeline and after all other components.
Name | Description |
---|---|
doc |
The Doc object to process, e.g. the Doc in the pipeline. |
label |
The subtoken dependency label. Defaults to "subtok" . |
RETURNS | The modified Doc with merged subtokens. |
token_splitter
Split tokens longer than a minimum length into shorter tokens. Intended for use with transformer pipelines where long spaCy tokens lead to input text that exceed the transformer model max length. See managing transformer model max length limitations.
Example
config={"min_length": 20, "split_length": 5} nlp.add_pipe("token_splitter", config=config, first=True) doc = nlp("aaaaabbbbbcccccdddddee") print([token.text for token in doc]) # ['aaaaa', 'bbbbb', 'ccccc', 'ddddd', 'ee']
Setting | Description |
---|---|
min_length |
The minimum length for a token to be split. Defaults to 25 . |
split_length |
The length of the split tokens. Defaults to 5 . |