spaCy/website/docs/api/sentencizer.md

9.1 KiB
Raw Blame History

title tag source teaser api_string_name api_trainable
Sentencizer class spacy/pipeline/sentencizer.pyx Pipeline component for rule-based sentence boundary detection sentencizer false

A simple pipeline component to allow custom sentence boundary detection logic that doesn't require the dependency parse. By default, sentence segmentation is performed by the DependencyParser, so the Sentencizer lets you implement a simpler, rule-based strategy that doesn't require a statistical model to be loaded.

Assigned Attributes

Calculated values will be assigned to Token.is_sent_start. The resulting sentences can be accessed using Doc.sents.

Location Value
Token.is_sent_start A boolean value indicating whether the token starts a sentence. This will be either True or False for all tokens. bool
Doc.sents An iterator over sentences in the Doc, determined by Token.is_sent_start values. Iterator[Span]

Config and implementation

The default config is defined by the pipeline component factory and describes how the component should be configured. You can override its settings via the config argument on nlp.add_pipe or in your config.cfg for training.

Example

config = {"punct_chars": None}
nlp.add_pipe("sentencizer", config=config)
Setting Description
punct_chars Optional custom list of punctuation characters that mark sentence ends. See below for defaults if not set. Defaults to None. Optional[List[str]]
%%GITHUB_SPACY/spacy/pipeline/sentencizer.pyx

Sentencizer.__init__

Initialize the sentencizer.

Example

# Construction via add_pipe
sentencizer = nlp.add_pipe("sentencizer")

# Construction from class
from spacy.pipeline import Sentencizer
sentencizer = Sentencizer()
Name Description
keyword-only
punct_chars Optional custom list of punctuation characters that mark sentence ends. See below for defaults. Optional[List[str]]
### punct_chars defaults
['!', '.', '?', '։', '؟', '۔', '܀', '܁', '܂', '߹', '।', '॥', '၊', '။', '።',
 '፧', '፨', '', '', '᜶', '', '', '᥄', '᥅', '᪨', '᪩', '᪪', '᪫',
 '᭚', '᭛', '᭞', '᭟', '᰻', '᰼', '᱾', '᱿', '‼', '‽', '⁇', '⁈', '⁉',
 '⸮', '⸼', '', '', '꘏', '꛳', '꛷', '꡶', '꡷', '꣎', '꣏', '꤯', '꧈',
 '꧉', '꩝', '꩞', '꩟', '꫰', '꫱', '꯫', '﹒', '﹖', '﹗', '', '', '',
 '𐩖', '𐩗', '𑁇', '𑁈', '𑂾', '𑂿', '𑃀', '𑃁', '𑅁', '𑅂', '𑅃', '𑇅',
 '𑇆', '𑇍', '𑇞', '𑇟', '𑈸', '𑈹', '𑈻', '𑈼', '𑊩', '𑑋', '𑑌', '𑗂',
 '𑗃', '𑗉', '𑗊', '𑗋', '𑗌', '𑗍', '𑗎', '𑗏', '𑗐', '𑗑', '𑗒', '𑗓',
 '𑗔', '𑗕', '𑗖', '𑗗', '𑙁', '𑙂', '𑜼', '𑜽', '𑜾', '𑩂', '𑩃', '𑪛',
 '𑪜', '𑱁', '𑱂', '𖩮', '𖩯', '𖫵', '𖬷', '𖬸', '𖭄', '𛲟', '𝪈', '。', '。']

Sentencizer.__call__

Apply the sentencizer on a Doc. Typically, this happens automatically after the component has been added to the pipeline using nlp.add_pipe.

Example

from spacy.lang.en import English

nlp = English()
nlp.add_pipe("sentencizer")
doc = nlp("This is a sentence. This is another sentence.")
assert len(list(doc.sents)) == 2
Name Description
doc The Doc object to process, e.g. the Doc in the pipeline. Doc
RETURNS The modified Doc with added sentence boundaries. Doc

Sentencizer.pipe

Apply the pipe to a stream of documents. This usually happens under the hood when the nlp object is called on a text and all pipeline components are applied to the Doc in order.

Example

sentencizer = nlp.add_pipe("sentencizer")
for doc in sentencizer.pipe(docs, batch_size=50):
    pass
Name Description
stream A stream of documents. Iterable[Doc]
keyword-only
batch_size The number of documents to buffer. Defaults to 128. int
YIELDS The processed documents in order. Doc

Sentencizer.score

Score a batch of examples.

Example

scores = sentencizer.score(examples)
Name Description
examples The examples to score. Iterable[Example]
RETURNS The scores, produced by Scorer.score_spans. Dict[str, Union[float, Dict[str, float]]

Sentencizer.to_disk

Save the sentencizer settings (punctuation characters) to a directory. Will create a file sentencizer.json. This also happens automatically when you save an nlp object with a sentencizer added to its pipeline.

Example

config = {"punct_chars": [".", "?", "!", "。"]}
sentencizer = nlp.add_pipe("sentencizer", config=config)
sentencizer.to_disk("/path/to/sentencizer.json")
Name Description
path A path to a JSON file, which will be created if it doesn't exist. Paths may be either strings or Path-like objects. Union[str, Path]

Sentencizer.from_disk

Load the sentencizer settings from a file. Expects a JSON file. This also happens automatically when you load an nlp object or model with a sentencizer added to its pipeline.

Example

sentencizer = nlp.add_pipe("sentencizer")
sentencizer.from_disk("/path/to/sentencizer.json")
Name Description
path A path to a JSON file. Paths may be either strings or Path-like objects. Union[str, Path]
RETURNS The modified Sentencizer object. Sentencizer

Sentencizer.to_bytes

Serialize the sentencizer settings to a bytestring.

Example

config = {"punct_chars": [".", "?", "!", "。"]}
sentencizer = nlp.add_pipe("sentencizer", config=config)
sentencizer_bytes = sentencizer.to_bytes()
Name Description
RETURNS The serialized data. bytes

Sentencizer.from_bytes

Load the pipe from a bytestring. Modifies the object in place and returns it.

Example

sentencizer_bytes = sentencizer.to_bytes()
sentencizer = nlp.add_pipe("sentencizer")
sentencizer.from_bytes(sentencizer_bytes)
Name Description
bytes_data The bytestring to load. bytes
RETURNS The modified Sentencizer object. Sentencizer