spaCy/website/docs/api/pipeline-functions.md

4.0 KiB

title teaser source menu
Pipeline Functions Other built-in pipeline components and helpers spacy/pipeline/functions.py
merge_noun_chunks
merge_noun_chunks
merge_entities
merge_entities
merge_subtokens
merge_subtokens

merge_noun_chunks

Merge noun chunks into a single token. Also available via the string name "merge_noun_chunks".

Example

texts = [t.text for t in nlp("I have a blue car")]
assert texts == ["I", "have", "a", "blue", "car"]

nlp.add_pipe("merge_noun_chunks")
texts = [t.text for t in nlp("I have a blue car")]
assert texts == ["I", "have", "a blue car"]

Since noun chunks require part-of-speech tags and the dependency parser, make sure to add this component after the "tagger" and "parser" components. By default, nlp.add_pipe will add components to the end of the pipeline and after all other components.

Name Description
doc The Doc object to process, e.g. the Doc in the pipeline. Doc
RETURNS The modified Doc with merged noun chunks. Doc

merge_entities

Merge named entities into a single token. Also available via the string name "merge_entities".

Example

texts = [t.text for t in nlp("I like David Bowie")]
assert texts == ["I", "like", "David", "Bowie"]

nlp.add_pipe("merge_entities")

texts = [t.text for t in nlp("I like David Bowie")]
assert texts == ["I", "like", "David Bowie"]

Since named entities are set by the entity recognizer, make sure to add this component after the "ner" component. By default, nlp.add_pipe will add components to the end of the pipeline and after all other components.

Name Description
doc The Doc object to process, e.g. the Doc in the pipeline. Doc
RETURNS The modified Doc with merged entities. Doc

merge_subtokens

Merge subtokens into a single token. Also available via the string name "merge_subtokens". As of v2.1, the parser is able to predict "subtokens" that should be merged into one single token later on. This is especially relevant for languages like Chinese, Japanese or Korean, where a "word" isn't defined as a whitespace-delimited sequence of characters. Under the hood, this component uses the Matcher to find sequences of tokens with the dependency label "subtok" and then merges them into a single token.

Example

Note that this example assumes a custom Chinese model that oversegments and was trained to predict subtokens.

doc = nlp("拜托")
print([(token.text, token.dep_) for token in doc])
# [('拜', 'subtok'), ('托', 'subtok')]

nlp.add_pipe("merge_subtokens")
doc = nlp("拜托")
print([token.text for token in doc])
# ['拜托']

Since subtokens are set by the parser, make sure to add this component after the "parser" component. By default, nlp.add_pipe will add components to the end of the pipeline and after all other components.

Name Description
doc The Doc object to process, e.g. the Doc in the pipeline. Doc
label The subtoken dependency label. Defaults to "subtok". str
RETURNS The modified Doc with merged subtokens. Doc