spaCy/website/docs/api/sentencizer.md

6.1 KiB
Raw Blame History

title tag source
Sentencizer class spacy/pipeline/pipes.pyx

A simple pipeline component, to allow custom sentence boundary detection logic that doesn't require the dependency parse. By default, sentence segmentation is performed by the DependencyParser, so the Sentencizer lets you implement a simpler, rule-based strategy that doesn't require a statistical model to be loaded. The component is also available via the string name "sentencizer". After initialization, it is typically added to the processing pipeline using nlp.add_pipe.

Compared to the previous SentenceSegmenter class, the Sentencizer component doesn't add a hook to doc.user_hooks["sents"]. Instead, it iterates over the tokens in the Doc and sets the Token.is_sent_start property. The SentenceSegmenter is still available if you import it directly:

from spacy.pipeline import SentenceSegmenter

Sentencizer.__init__

Initialize the sentencizer.

Example

# Construction via create_pipe
sentencizer = nlp.create_pipe("sentencizer")

# Construction from class
from spacy.pipeline import Sentencizer
sentencizer = Sentencizer()
Name Type Description
punct_chars list Optional custom list of punctuation characters that mark sentence ends. Defaults to ['!', '.', '?', '։', '؟', '۔', '܀', '܁', '܂', '߹', '।', '॥', '၊', '။', '።', '፧', '፨', '', '', '᜶', '', '', '᥄', '᥅', '᪨', '᪩', '᪪', '᪫', '᭚', '᭛', '᭞', '᭟', '᰻', '᰼', '᱾', '᱿', '‼', '‽', '⁇', '⁈', '⁉', '⸮', '⸼', '', '', '꘏', '꛳', '꛷', '꡶', '꡷', '꣎', '꣏', '꤯', '꧈', '꧉', '꩝', '꩞', '꩟', '꫰', '꫱', '꯫', '﹒', '﹖', '﹗', '', '', '', '𐩖', '𐩗', '𑁇', '𑁈', '𑂾', '𑂿', '𑃀', '𑃁', '𑅁', '𑅂', '𑅃', '𑇅', '𑇆', '𑇍', '𑇞', '𑇟', '𑈸', '𑈹', '𑈻', '𑈼', '𑊩', '𑑋', '𑑌', '𑗂', '𑗃', '𑗉', '𑗊', '𑗋', '𑗌', '𑗍', '𑗎', '𑗏', '𑗐', '𑗑', '𑗒', '𑗓', '𑗔', '𑗕', '𑗖', '𑗗', '𑙁', '𑙂', '𑜼', '𑜽', '𑜾', '𑩂', '𑩃', '𑪛', '𑪜', '𑱁', '𑱂', '𖩮', '𖩯', '𖫵', '𖬷', '𖬸', '𖭄', '𛲟', '𝪈', '。', '。'].
RETURNS Sentencizer The newly constructed object.

Sentencizer.__call__

Apply the sentencizer on a Doc. Typically, this happens automatically after the component has been added to the pipeline using nlp.add_pipe.

Example

from spacy.lang.en import English

nlp = English()
sentencizer = nlp.create_pipe("sentencizer")
nlp.add_pipe(sentencizer)
doc = nlp("This is a sentence. This is another sentence.")
assert len(list(doc.sents)) == 2
Name Type Description
doc Doc The Doc object to process, e.g. the Doc in the pipeline.
RETURNS Doc The modified Doc with added sentence boundaries.

Sentencizer.to_disk

Save the sentencizer settings (punctuation characters) a directory. Will create a file sentencizer.json. This also happens automatically when you save an nlp object with a sentencizer added to its pipeline.

Example

sentencizer = Sentencizer(punct_chars=[".", "?", "!", "。"])
sentencizer.to_disk("/path/to/sentencizer.jsonl")
Name Type Description
path unicode / Path A path to a file, which will be created if it doesn't exist. Paths may be either strings or Path-like objects.

Sentencizer.from_disk

Load the sentencizer settings from a file. Expects a JSON file. This also happens automatically when you load an nlp object or model with a sentencizer added to its pipeline.

Example

sentencizer = Sentencizer()
sentencizer.from_disk("/path/to/sentencizer.json")
Name Type Description
path unicode / Path A path to a JSON file. Paths may be either strings or Path-like objects.
RETURNS Sentencizer The modified Sentencizer object.

Sentencizer.to_bytes

Serialize the sentencizer settings to a bytestring.

Example

sentencizer = Sentencizer(punct_chars=[".", "?", "!", "。"])
sentencizer_bytes = sentencizer.to_bytes()
Name Type Description
RETURNS bytes The serialized data.

Sentencizer.from_bytes

Load the pipe from a bytestring. Modifies the object in place and returns it.

Example

sentencizer_bytes = sentencizer.to_bytes()
sentencizer = Sentencizer()
sentencizer.from_bytes(sentencizer_bytes)
Name Type Description
bytes_data bytes The bytestring to load.
RETURNS Sentencizer The modified Sentencizer object.