spaCy/pipeline-functions.md at 2ae8dfbb930c02241832447e7da72b923a0b0df7

5.0 KiB

Raw Blame History

title

teaser

source

Pipeline Functions

Other built-in pipeline components and helpers

spacy/pipeline/functions.py

merge_noun_chunks

merge_entities

merge_subtokens

token_splitter

merge_noun_chunks

Merge noun chunks into a single token. Also available via the string name "merge_noun_chunks".

Example

texts = [t.text for t in nlp("I have a blue car")]
assert texts == ["I", "have", "a", "blue", "car"]

nlp.add_pipe("merge_noun_chunks")
texts = [t.text for t in nlp("I have a blue car")]
assert texts == ["I", "have", "a blue car"]

Since noun chunks require part-of-speech tags and the dependency parse, make sure to add this component after the "tagger" and "parser" components. By default, nlp.add_pipe will add components to the end of the pipeline and after all other components.

Name	Description
`doc`	The `Doc` object to process, e.g. the `Doc` in the pipeline. ~~Doc~~
RETURNS	The modified `Doc` with merged noun chunks. ~~Doc~~

merge_entities

Merge named entities into a single token. Also available via the string name "merge_entities".

Example

texts = [t.text for t in nlp("I like David Bowie")]
assert texts == ["I", "like", "David", "Bowie"]

nlp.add_pipe("merge_entities")

texts = [t.text for t in nlp("I like David Bowie")]
assert texts == ["I", "like", "David Bowie"]

Since named entities are set by the entity recognizer, make sure to add this component after the "ner" component. By default, nlp.add_pipe will add components to the end of the pipeline and after all other components.

Name	Description
`doc`	The `Doc` object to process, e.g. the `Doc` in the pipeline. ~~Doc~~
RETURNS	The modified `Doc` with merged entities. ~~Doc~~

merge_subtokens

Merge subtokens into a single token. Also available via the string name "merge_subtokens". As of v2.1, the parser is able to predict "subtokens" that should be merged into one single token later on. This is especially relevant for languages like Chinese, Japanese or Korean, where a "word" isn't defined as a whitespace-delimited sequence of characters. Under the hood, this component uses the Matcher to find sequences of tokens with the dependency label "subtok" and then merges them into a single token.

Example

Note that this example assumes a custom Chinese model that oversegments and was trained to predict subtokens.
doc = nlp("拜托")
print([(token.text, token.dep_) for token in doc])
# [('拜', 'subtok'), ('托', 'subtok')]

nlp.add_pipe("merge_subtokens")
doc = nlp("拜托")
print([token.text for token in doc])
# ['拜托']

Since subtokens are set by the parser, make sure to add this component after the "parser" component. By default, nlp.add_pipe will add components to the end of the pipeline and after all other components.

Name	Description
`doc`	The `Doc` object to process, e.g. the `Doc` in the pipeline. ~~Doc~~
`label`	The subtoken dependency label. Defaults to `"subtok"`. ~~str~~
RETURNS	The modified `Doc` with merged subtokens. ~~Doc~~

token_splitter

Split tokens longer than a minimum length into shorter tokens. Intended for use with transformer pipelines where long spaCy tokens lead to input text that exceed the transformer model max length. See managing transformer model max length limitations.

Example

config={"min_length": 20, "split_length": 5}
nlp.add_pipe("token_splitter", config=config, first=True)
doc = nlp("aaaaabbbbbcccccdddddee")
print([token.text for token in doc])
# ['aaaaa', 'bbbbb', 'ccccc', 'ddddd', 'ee']

Setting	Description
`min_length`	The minimum length for a token to be split. Defaults to `25`. ~~int~~
`split_length`	The length of the split tokens. Defaults to `5`. ~~int~~

5.0 KiB Raw Blame History

merge_noun_chunks

Example

merge_entities

Example

merge_subtokens

Example

token_splitter

Example

5.0 KiB

Raw Blame History