7.9 KiB
title | teaser | source | menu | |||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Pipeline Functions | Other built-in pipeline components and helpers | spacy/pipeline/functions.py |
|
merge_noun_chunks
Merge noun chunks into a single token. Also available via the string name
"merge_noun_chunks"
.
Example
texts = [t.text for t in nlp("I have a blue car")] assert texts == ["I", "have", "a", "blue", "car"] nlp.add_pipe("merge_noun_chunks") texts = [t.text for t in nlp("I have a blue car")] assert texts == ["I", "have", "a blue car"]
Since noun chunks require part-of-speech tags and the dependency parse, make
sure to add this component after the "tagger"
and "parser"
components. By
default, nlp.add_pipe
will add components to the end of the pipeline and after
all other components.
Name | Description |
---|---|
doc |
The Doc object to process, e.g. the Doc in the pipeline. |
RETURNS | The modified Doc with merged noun chunks. |
merge_entities
Merge named entities into a single token. Also available via the string name
"merge_entities"
.
Example
texts = [t.text for t in nlp("I like David Bowie")] assert texts == ["I", "like", "David", "Bowie"] nlp.add_pipe("merge_entities") texts = [t.text for t in nlp("I like David Bowie")] assert texts == ["I", "like", "David Bowie"]
Since named entities are set by the entity recognizer, make sure to add this
component after the "ner"
component. By default, nlp.add_pipe
will add
components to the end of the pipeline and after all other components.
Name | Description |
---|---|
doc |
The Doc object to process, e.g. the Doc in the pipeline. |
RETURNS | The modified Doc with merged entities. |
merge_subtokens
Merge subtokens into a single token. Also available via the string name
"merge_subtokens"
. As of v2.1, the parser is able to predict "subtokens" that
should be merged into one single token later on. This is especially relevant for
languages like Chinese, Japanese or Korean, where a "word" isn't defined as a
whitespace-delimited sequence of characters. Under the hood, this component uses
the Matcher
to find sequences of tokens with the dependency
label "subtok"
and then merges them into a single token.
Example
Note that this example assumes a custom Chinese model that oversegments and was trained to predict subtokens.
doc = nlp("拜托") print([(token.text, token.dep_) for token in doc]) # [('拜', 'subtok'), ('托', 'subtok')] nlp.add_pipe("merge_subtokens") doc = nlp("拜托") print([token.text for token in doc]) # ['拜托']
Since subtokens are set by the parser, make sure to add this component after
the "parser"
component. By default, nlp.add_pipe
will add components to the
end of the pipeline and after all other components.
Name | Description |
---|---|
doc |
The Doc object to process, e.g. the Doc in the pipeline. |
label |
The subtoken dependency label. Defaults to "subtok" . |
RETURNS | The modified Doc with merged subtokens. |
token_splitter
Split tokens longer than a minimum length into shorter tokens. Intended for use with transformer pipelines where long spaCy tokens lead to input text that exceed the transformer model max length.
Example
config = {"min_length": 20, "split_length": 5} nlp.add_pipe("token_splitter", config=config, first=True) doc = nlp("aaaaabbbbbcccccdddddee") print([token.text for token in doc]) # ['aaaaa', 'bbbbb', 'ccccc', 'ddddd', 'ee']
Setting | Description |
---|---|
min_length |
The minimum length for a token to be split. Defaults to 25 . |
split_length |
The length of the split tokens. Defaults to 5 . |
RETURNS | The modified Doc with the split tokens. |
doc_cleaner
Clean up Doc
attributes. Intended for use at the end of pipelines with
tok2vec
or transformer
pipeline components that store tensors and other
values that can require a lot of memory and frequently aren't needed after the
whole pipeline has run.
Example
config = {"attrs": {"tensor": None}} nlp.add_pipe("doc_cleaner", config=config) doc = nlp("text") assert doc.tensor is None
Setting | Description |
---|---|
attrs |
A dict of the Doc attributes and the values to set them to. Defaults to {"tensor": None, "_.trf_data": None} to clean up after tok2vec and transformer components. |
silent |
If False , show warnings if attributes aren't found or can't be set. Defaults to True . |
RETURNS | The modified Doc with the modified attributes. |
span_cleaner
Remove SpanGroup
s from doc.spans
based on a key prefix. This is used to
clean up after the CoreferenceResolver
when it's paired with a
SpanResolver
.
This pipeline function is not yet integrated into spaCy core, and is available
via the extension package
spacy-experimental
starting
in version 0.6.0. It exposes the component via
entry points, so if you have the package
installed, using factory = "span_cleaner"
in your
training config or nlp.add_pipe("span_cleaner")
will
work out-of-the-box.
Example
config = {"prefix": "coref_head_clusters"} nlp.add_pipe("span_cleaner", config=config) doc = nlp("text") assert "coref_head_clusters_1" not in doc.spans
Setting | Description |
---|---|
prefix |
A prefix to check SpanGroup keys for. Any matching groups will be removed. Defaults to "coref_head_clusters" . |
RETURNS | The modified Doc with any matching spans removed. |