6.1 KiB
title | tag | source |
---|---|---|
Sentencizer | class | spacy/pipeline/pipes.pyx |
A simple pipeline component, to allow custom sentence boundary detection logic
that doesn't require the dependency parse. By default, sentence segmentation is
performed by the DependencyParser
, so the
Sentencizer
lets you implement a simpler, rule-based strategy that doesn't
require a statistical model to be loaded. The component is also available via
the string name "sentencizer"
. After initialization, it is typically added to
the processing pipeline using nlp.add_pipe
.
Compared to the previous SentenceSegmenter
class, the Sentencizer
component
doesn't add a hook to doc.user_hooks["sents"]
. Instead, it iterates over the
tokens in the Doc
and sets the Token.is_sent_start
property. The
SentenceSegmenter
is still available if you import it directly:
from spacy.pipeline import SentenceSegmenter
Sentencizer.__init__
Initialize the sentencizer.
Example
# Construction via create_pipe sentencizer = nlp.create_pipe("sentencizer") # Construction from class from spacy.pipeline import Sentencizer sentencizer = Sentencizer()
Name | Type | Description |
---|---|---|
punct_chars |
list | Optional custom list of punctuation characters that mark sentence ends. Defaults to ['!', '.', '?', '։', '؟', '۔', '܀', '܁', '܂', '߹', '।', '॥', '၊', '။', '።', '፧', '፨', '᙮', '᜵', '᜶', '᠃', '᠉', '᥄', '᥅', '᪨', '᪩', '᪪', '᪫', '᭚', '᭛', '᭞', '᭟', '᰻', '᰼', '᱾', '᱿', '‼', '‽', '⁇', '⁈', '⁉', '⸮', '⸼', '꓿', '꘎', '꘏', '꛳', '꛷', '꡶', '꡷', '꣎', '꣏', '꤯', '꧈', '꧉', '꩝', '꩞', '꩟', '꫰', '꫱', '꯫', '﹒', '﹖', '﹗', '!', '.', '?', '𐩖', '𐩗', '𑁇', '𑁈', '𑂾', '𑂿', '𑃀', '𑃁', '𑅁', '𑅂', '𑅃', '𑇅', '𑇆', '𑇍', '𑇞', '𑇟', '𑈸', '𑈹', '𑈻', '𑈼', '𑊩', '𑑋', '𑑌', '𑗂', '𑗃', '𑗉', '𑗊', '𑗋', '𑗌', '𑗍', '𑗎', '𑗏', '𑗐', '𑗑', '𑗒', '𑗓', '𑗔', '𑗕', '𑗖', '𑗗', '𑙁', '𑙂', '𑜼', '𑜽', '𑜾', '𑩂', '𑩃', '𑪛', '𑪜', '𑱁', '𑱂', '𖩮', '𖩯', '𖫵', '𖬷', '𖬸', '𖭄', '𛲟', '𝪈', '。', '。'] . |
RETURNS | Sentencizer |
The newly constructed object. |
Sentencizer.__call__
Apply the sentencizer on a Doc
. Typically, this happens automatically after
the component has been added to the pipeline using
nlp.add_pipe
.
Example
from spacy.lang.en import English nlp = English() sentencizer = nlp.create_pipe("sentencizer") nlp.add_pipe(sentencizer) doc = nlp("This is a sentence. This is another sentence.") assert len(list(doc.sents)) == 2
Name | Type | Description |
---|---|---|
doc |
Doc |
The Doc object to process, e.g. the Doc in the pipeline. |
RETURNS | Doc |
The modified Doc with added sentence boundaries. |
Sentencizer.to_disk
Save the sentencizer settings (punctuation characters) a directory. Will create
a file sentencizer.json
. This also happens automatically when you save an
nlp
object with a sentencizer added to its pipeline.
Example
sentencizer = Sentencizer(punct_chars=[".", "?", "!", "。"]) sentencizer.to_disk("/path/to/sentencizer.jsonl")
Name | Type | Description |
---|---|---|
path |
unicode / Path |
A path to a file, which will be created if it doesn't exist. Paths may be either strings or Path -like objects. |
Sentencizer.from_disk
Load the sentencizer settings from a file. Expects a JSON file. This also
happens automatically when you load an nlp
object or model with a sentencizer
added to its pipeline.
Example
sentencizer = Sentencizer() sentencizer.from_disk("/path/to/sentencizer.json")
Name | Type | Description |
---|---|---|
path |
unicode / Path |
A path to a JSON file. Paths may be either strings or Path -like objects. |
RETURNS | Sentencizer |
The modified Sentencizer object. |
Sentencizer.to_bytes
Serialize the sentencizer settings to a bytestring.
Example
sentencizer = Sentencizer(punct_chars=[".", "?", "!", "。"]) sentencizer_bytes = sentencizer.to_bytes()
Name | Type | Description |
---|---|---|
RETURNS | bytes | The serialized data. |
Sentencizer.from_bytes
Load the pipe from a bytestring. Modifies the object in place and returns it.
Example
sentencizer_bytes = sentencizer.to_bytes() sentencizer = Sentencizer() sentencizer.from_bytes(sentencizer_bytes)
Name | Type | Description |
---|---|---|
bytes_data |
bytes | The bytestring to load. |
RETURNS | Sentencizer |
The modified Sentencizer object. |