spaCy/website/docs/api/sentencizer.md

---
title: Sentencizer
tag: class
source: spacy/pipeline/sentencizer.pyx
teaser: 'Pipeline component for rule-based sentence boundary detection'
api_base_class: /api/pipe
api_string_name: sentencizer
api_trainable: false
---

A simple pipeline component to allow custom sentence boundary detection logic
that doesn't require the dependency parse. By default, sentence segmentation is
performed by the [`DependencyParser`](/api/dependencyparser), so the
`Sentencizer` lets you implement a simpler, rule-based strategy that doesn't
require a statistical model to be loaded.

## Config and implementation {#config}

The default config is defined by the pipeline component factory and describes
how the component should be configured. You can override its settings via the
`config` argument on [`nlp.add_pipe`](/api/language#add_pipe) or in your
[`config.cfg` for training](/usage/training#config).

> #### Example
>
> ```python
> config = {"punct_chars": None}
> nlp.add_pipe("entity_ruler", config=config)
> ```

| Setting       | Description                                                                                                                                            |
| ------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `punct_chars` | Optional custom list of punctuation characters that mark sentence ends. See below for defaults if not set. Defaults to `None`. ~~Optional[List[str]]~~ | `None` |

```python
%%GITHUB_SPACY/spacy/pipeline/sentencizer.pyx
```

## Sentencizer.\_\_init\_\_ {#init tag="method"}

Initialize the sentencizer.

> #### Example
>
> ```python
> # Construction via add_pipe
> sentencizer = nlp.add_pipe("sentencizer")
>
> # Construction from class
> from spacy.pipeline import Sentencizer
> sentencizer = Sentencizer()
> ```

| Name           | Description                                                                                                             |
| -------------- | ----------------------------------------------------------------------------------------------------------------------- |
| _keyword-only_ |                                                                                                                         |
| `punct_chars`  | Optional custom list of punctuation characters that mark sentence ends. See below for defaults. ~~Optional[List[str]]~~ |

```python
### punct_chars defaults
['!', '.', '?', '։', '؟', '۔', '܀', '܁', '܂', '߹', '।', '॥', '၊', '။', '።',
 '፧', '፨', '᙮', '᜵', '᜶', '᠃', '᠉', '᥄', '᥅', '᪨', '᪩', '᪪', '᪫',
 '᭚', '᭛', '᭞', '᭟', '᰻', '᰼', '᱾', '᱿', '‼', '‽', '⁇', '⁈', '⁉',
 '⸮', '⸼', '꓿', '꘎', '꘏', '꛳', '꛷', '꡶', '꡷', '꣎', '꣏', '꤯', '꧈',
 '꧉', '꩝', '꩞', '꩟', '꫰', '꫱', '꯫', '﹒', '﹖', '﹗', '！', '．', '？',
 '𐩖', '𐩗', '𑁇', '𑁈', '𑂾', '𑂿', '𑃀', '𑃁', '𑅁', '𑅂', '𑅃', '𑇅',
 '𑇆', '𑇍', '𑇞', '𑇟', '𑈸', '𑈹', '𑈻', '𑈼', '𑊩', '𑑋', '𑑌', '𑗂',
 '𑗃', '𑗉', '𑗊', '𑗋', '𑗌', '𑗍', '𑗎', '𑗏', '𑗐', '𑗑', '𑗒', '𑗓',
 '𑗔', '𑗕', '𑗖', '𑗗', '𑙁', '𑙂', '𑜼', '𑜽', '𑜾', '𑩂', '𑩃', '𑪛',
 '𑪜', '𑱁', '𑱂', '𖩮', '𖩯', '𖫵', '𖬷', '𖬸', '𖭄', '𛲟', '𝪈', '｡', '。']
```

## Sentencizer.\_\_call\_\_ {#call tag="method"}

Apply the sentencizer on a `Doc`. Typically, this happens automatically after
the component has been added to the pipeline using
[`nlp.add_pipe`](/api/language#add_pipe).

> #### Example
>
> ```python
> from spacy.lang.en import English
>
> nlp = English()
> nlp.add_pipe("sentencizer")
> doc = nlp("This is a sentence. This is another sentence.")
> assert len(list(doc.sents)) == 2
> ```

| Name        | Description                                                          |
| ----------- | -------------------------------------------------------------------- |
| `doc`       | The `Doc` object to process, e.g. the `Doc` in the pipeline. ~~Doc~~ |
| **RETURNS** | The modified `Doc` with added sentence boundaries. ~~Doc~~           |

## Sentencizer.pipe {#pipe tag="method"}

Apply the pipe to a stream of documents. This usually happens under the hood
when the `nlp` object is called on a text and all pipeline components are
applied to the `Doc` in order.

> #### Example
>
> ```python
> sentencizer = nlp.add_pipe("sentencizer")
> for doc in sentencizer.pipe(docs, batch_size=50):
>     pass
> ```

| Name           | Description                                                   |
| -------------- | ------------------------------------------------------------- |
| `stream`       | A stream of documents. ~~Iterable[Doc]~~                      |
| _keyword-only_ |                                                               |
| `batch_size`   | The number of documents to buffer. Defaults to `128`. ~~int~~ |
| **YIELDS**     | The processed documents in order. ~~Doc~~                     |

## Sentencizer.score {#score tag="method" new="3"}

Score a batch of examples.

> #### Example
>
> ```python
> scores = sentencizer.score(examples)
> ```

| Name        | Description                                                                                                           |
| ----------- | --------------------------------------------------------------------------------------------------------------------- |
| `examples`  | The examples to score. ~~Iterable[Example]~~                                                                          |
| **RETURNS** | The scores, produced by [`Scorer.score_spans`](/api/scorer#score_spans). ~~Dict[str, Union[float, Dict[str, float]]~~ |

## Sentencizer.to_disk {#to_disk tag="method"}

Save the sentencizer settings (punctuation characters) to a directory. Will create
a file `sentencizer.json`. This also happens automatically when you save an
`nlp` object with a sentencizer added to its pipeline.

> #### Example
>
> ```python
> config = {"punct_chars": [".", "?", "!", "。"]}
> sentencizer = nlp.add_pipe("sentencizer", config=config)
> sentencizer.to_disk("/path/to/sentencizer.json")
> ```

| Name   | Description                                                                                                                                |
| ------ | ------------------------------------------------------------------------------------------------------------------------------------------ |
| `path` | A path to a JSON file, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. ~~Union[str, Path]~~ |

## Sentencizer.from_disk {#from_disk tag="method"}

Load the sentencizer settings from a file. Expects a JSON file. This also
happens automatically when you load an `nlp` object or model with a sentencizer
added to its pipeline.

> #### Example
>
> ```python
> sentencizer = nlp.add_pipe("sentencizer")
> sentencizer.from_disk("/path/to/sentencizer.json")
> ```

| Name        | Description                                                                                     |
| ----------- | ----------------------------------------------------------------------------------------------- |
| `path`      | A path to a JSON file. Paths may be either strings or `Path`-like objects. ~~Union[str, Path]~~ |
| **RETURNS** | The modified `Sentencizer` object. ~~Sentencizer~~                                              |

## Sentencizer.to_bytes {#to_bytes tag="method"}

Serialize the sentencizer settings to a bytestring.

> #### Example
>
> ```python
> config = {"punct_chars": [".", "?", "!", "。"]}
> sentencizer = nlp.add_pipe("sentencizer", config=config)
> sentencizer_bytes = sentencizer.to_bytes()
> ```

| Name        | Description                    |
| ----------- | ------------------------------ |
| **RETURNS** | The serialized data. ~~bytes~~ |

## Sentencizer.from_bytes {#from_bytes tag="method"}

Load the pipe from a bytestring. Modifies the object in place and returns it.

> #### Example
>
> ```python
> sentencizer_bytes = sentencizer.to_bytes()
> sentencizer = nlp.add_pipe("sentencizer")
> sentencizer.from_bytes(sentencizer_bytes)
> ```

| Name         | Description                                        |
| ------------ | -------------------------------------------------- |
| `bytes_data` | The bytestring to load. ~~bytes~~                  |
| **RETURNS**  | The modified `Sentencizer` object. ~~Sentencizer~~ |
-												💫 Add better and serializable sentencizer (#3471)

* Add better serializable sentencizer component

* Replace default factory

* Add tests

* Tidy up

* Pass test

* Update docs

											
										
										
											2019-03-23 14:45:02 +00:00
+								---
 								title: Sentencizer
 								tag: class
-												API docs, docstrings and argument consistency

											
										
										
											2020-07-27 16:11:45 +00:00
+								source: spacy/pipeline/sentencizer.pyx
 								teaser: 'Pipeline component for rule-based sentence boundary detection'
 								api_base_class: /api/pipe
 								api_string_name: sentencizer
 								api_trainable: false
-												💫 Add better and serializable sentencizer (#3471)

* Add better serializable sentencizer component

* Replace default factory

* Add tests

* Tidy up

* Pass test

* Update docs

											
										
										
											2019-03-23 14:45:02 +00:00
+								---
-												Proofreading

Proofread some API docs

											
										
										
											2020-09-24 11:15:28 +00:00
+								A simple pipeline component to allow custom sentence boundary detection logic
-												💫 Add better and serializable sentencizer (#3471)

* Add better serializable sentencizer component

* Replace default factory

* Add tests

* Tidy up

* Pass test

* Update docs

											
										
										
											2019-03-23 14:45:02 +00:00
+								that doesn't require the dependency parse. By default, sentence segmentation is
 								performed by the [`DependencyParser`](/api/dependencyparser), so the
 								`Sentencizer` lets you implement a simpler, rule-based strategy that doesn't
-												API docs, docstrings and argument consistency

											
										
										
											2020-07-27 16:11:45 +00:00
+								require a statistical model to be loaded.
 								## Config and implementation {#config}
 								The default config is defined by the pipeline component factory and describes
 								how the component should be configured. You can override its settings via the
 								`config` argument on [`nlp.add_pipe`](/api/language#add_pipe) or in your
 								[`config.cfg` for training](/usage/training#config).
 								> #### Example
 								>
 								> ```python
 								> config = {"punct_chars": None}
 								> nlp.add_pipe("entity_ruler", config=config)
 								> ```
-												Update docs, types and API consistency

											
										
										
											2020-08-17 14:45:24 +00:00
+								| Setting       | Description                                                                                                                                            |
 								| ------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------ |
 								| `punct_chars` | Optional custom list of punctuation characters that mark sentence ends. See below for defaults if not set. Defaults to `None`. ~~Optional[List[str]]~~ | `None` |
-												API docs, docstrings and argument consistency

											
										
										
											2020-07-27 16:11:45 +00:00
 								```python
-												Update docs [ci skip]

											
										
										
											2020-09-12 15:05:10 +00:00
+								%%GITHUB_SPACY/spacy/pipeline/sentencizer.pyx
-												API docs, docstrings and argument consistency

											
										
										
											2020-07-27 16:11:45 +00:00
+								```
-												💫 Add better and serializable sentencizer (#3471)

* Add better serializable sentencizer component

* Replace default factory

* Add tests

* Tidy up

* Pass test

* Update docs

											
										
										
											2019-03-23 14:45:02 +00:00
 								## Sentencizer.\_\_init\_\_ {#init tag="method"}
 								Initialize the sentencizer.
 								> #### Example
 								>
 								> ```python
-												Update docs [ci skip]

											
										
										
											2020-07-26 22:29:45 +00:00
+								> # Construction via add_pipe
 								> sentencizer = nlp.add_pipe("sentencizer")
-												API docs, docstrings and argument consistency

											
										
										
											2020-07-27 16:11:45 +00:00
+								>
 								> # Construction from class
 								> from spacy.pipeline import Sentencizer
 								> sentencizer = Sentencizer()
-												💫 Add better and serializable sentencizer (#3471)

* Add better serializable sentencizer component

* Replace default factory

* Add tests

* Tidy up

* Pass test

* Update docs

											
										
										
											2019-03-23 14:45:02 +00:00
+								> ```
-												Update docs, types and API consistency

											
										
										
											2020-08-17 14:45:24 +00:00
+								| Name           | Description                                                                                                             |
 								| -------------- | ----------------------------------------------------------------------------------------------------------------------- |
-												doc fixes

											
										
										
											2020-09-12 15:38:54 +00:00
+								| _keyword-only_ |                                                                                                                         |
-												Update docs, types and API consistency

											
										
										
											2020-08-17 14:45:24 +00:00
+								| `punct_chars`  | Optional custom list of punctuation characters that mark sentence ends. See below for defaults. ~~Optional[List[str]]~~ |
-												Update docs

											
										
										
											2020-07-04 12:23:10 +00:00
 								```python
 								### punct_chars defaults
 								['!', '.', '?', '։', '؟', '۔', '܀', '܁', '܂', '߹', '।', '॥', '၊', '။', '።',
 								 '፧', '፨', '᙮', '᜵', '᜶', '᠃', '᠉', '᥄', '᥅', '᪨', '᪩', '᪪', '᪫',
 								 '᭚', '᭛', '᭞', '᭟', '᰻', '᰼', '᱾', '᱿', '‼', '‽', '⁇', '⁈', '⁉',
 								 '⸮', '⸼', '꓿', '꘎', '꘏', '꛳', '꛷', '꡶', '꡷', '꣎', '꣏', '꤯', '꧈',
 								 '꧉', '꩝', '꩞', '꩟', '꫰', '꫱', '꯫', '﹒', '﹖', '﹗', '！', '．', '？',
 								 '𐩖', '𐩗', '𑁇', '𑁈', '𑂾', '𑂿', '𑃀', '𑃁', '𑅁', '𑅂', '𑅃', '𑇅',
 								 '𑇆', '𑇍', '𑇞', '𑇟', '𑈸', '𑈹', '𑈻', '𑈼', '𑊩', '𑑋', '𑑌', '𑗂',
 								 '𑗃', '𑗉', '𑗊', '𑗋', '𑗌', '𑗍', '𑗎', '𑗏', '𑗐', '𑗑', '𑗒', '𑗓',
 								 '𑗔', '𑗕', '𑗖', '𑗗', '𑙁', '𑙂', '𑜼', '𑜽', '𑜾', '𑩂', '𑩃', '𑪛',
 								 '𑪜', '𑱁', '𑱂', '𖩮', '𖩯', '𖫵', '𖬷', '𖬸', '𖭄', '𛲟', '𝪈', '｡', '。']
 								```
-												💫 Add better and serializable sentencizer (#3471)

* Add better serializable sentencizer component

* Replace default factory

* Add tests

* Tidy up

* Pass test

* Update docs

											
										
										
											2019-03-23 14:45:02 +00:00
 								## Sentencizer.\_\_call\_\_ {#call tag="method"}
 								Apply the sentencizer on a `Doc`. Typically, this happens automatically after
 								the component has been added to the pipeline using
 								[`nlp.add_pipe`](/api/language#add_pipe).
 								> #### Example
 								>
 								> ```python
 								> from spacy.lang.en import English
 								>
 								> nlp = English()
-												Update docs [ci skip]

											
										
										
											2020-07-26 22:29:45 +00:00
+								> nlp.add_pipe("sentencizer")
-												Remove u-strings and fix formatting [ci skip]

											
										
										
											2019-09-12 14:11:15 +00:00
+								> doc = nlp("This is a sentence. This is another sentence.")
-												Fix assert in sentencizer documentation. (#4639)


											
										
										
											2019-11-13 14:24:14 +00:00
+								> assert len(list(doc.sents)) == 2
-												💫 Add better and serializable sentencizer (#3471)

* Add better serializable sentencizer component

* Replace default factory

* Add tests

* Tidy up

* Pass test

* Update docs

											
										
										
											2019-03-23 14:45:02 +00:00
+								> ```
-												Update docs, types and API consistency

											
										
										
											2020-08-17 14:45:24 +00:00
+								| Name        | Description                                                          |
 								| ----------- | -------------------------------------------------------------------- |
 								| `doc`       | The `Doc` object to process, e.g. the `Doc` in the pipeline. ~~Doc~~ |
 								| **RETURNS** | The modified `Doc` with added sentence boundaries. ~~Doc~~           |
-												💫 Add better and serializable sentencizer (#3471)

* Add better serializable sentencizer component

* Replace default factory

* Add tests

* Tidy up

* Pass test

* Update docs

											
										
										
											2019-03-23 14:45:02 +00:00
-												API docs, docstrings and argument consistency

											
										
										
											2020-07-27 16:11:45 +00:00
+								## Sentencizer.pipe {#pipe tag="method"}
 								Apply the pipe to a stream of documents. This usually happens under the hood
 								when the `nlp` object is called on a text and all pipeline components are
 								applied to the `Doc` in order.
 								> #### Example
 								>
 								> ```python
 								> sentencizer = nlp.add_pipe("sentencizer")
 								> for doc in sentencizer.pipe(docs, batch_size=50):
 								>     pass
 								> ```
-												Update docs, types and API consistency

											
										
										
											2020-08-17 14:45:24 +00:00
+								| Name           | Description                                                   |
 								| -------------- | ------------------------------------------------------------- |
 								| `stream`       | A stream of documents. ~~Iterable[Doc]~~                      |
 								| _keyword-only_ |                                                               |
 								| `batch_size`   | The number of documents to buffer. Defaults to `128`. ~~int~~ |
 								| **YIELDS**     | The processed documents in order. ~~Doc~~                     |
-												API docs, docstrings and argument consistency

											
										
										
											2020-07-27 16:11:45 +00:00
 								## Sentencizer.score {#score tag="method" new="3"}
 								Score a batch of examples.
 								> #### Example
 								>
 								> ```python
 								> scores = sentencizer.score(examples)
 								> ```
-												Update docs, types and API consistency

											
										
										
											2020-08-17 14:45:24 +00:00
+								| Name        | Description                                                                                                           |
 								| ----------- | --------------------------------------------------------------------------------------------------------------------- |
 								| `examples`  | The examples to score. ~~Iterable[Example]~~                                                                          |
 								| **RETURNS** | The scores, produced by [`Scorer.score_spans`](/api/scorer#score_spans). ~~Dict[str, Union[float, Dict[str, float]]~~ |
-												API docs, docstrings and argument consistency

											
										
										
											2020-07-27 16:11:45 +00:00
-												💫 Add better and serializable sentencizer (#3471)

* Add better serializable sentencizer component

* Replace default factory

* Add tests

* Tidy up

* Pass test

* Update docs

											
										
										
											2019-03-23 14:45:02 +00:00
+								## Sentencizer.to_disk {#to_disk tag="method"}
-												Proofreading

Proofread some API docs

											
										
										
											2020-09-24 11:15:28 +00:00
+								Save the sentencizer settings (punctuation characters) to a directory. Will create
-												💫 Add better and serializable sentencizer (#3471)

* Add better serializable sentencizer component

* Replace default factory

* Add tests

* Tidy up

* Pass test

* Update docs

											
										
										
											2019-03-23 14:45:02 +00:00
+								a file `sentencizer.json`. This also happens automatically when you save an
 								`nlp` object with a sentencizer added to its pipeline.
 								> #### Example
 								>
 								> ```python
-												API docs, docstrings and argument consistency

											
										
										
											2020-07-27 16:11:45 +00:00
+								> config = {"punct_chars": [".", "?", "!", "。"]}
 								> sentencizer = nlp.add_pipe("sentencizer", config=config)
 								> sentencizer.to_disk("/path/to/sentencizer.json")
-												💫 Add better and serializable sentencizer (#3471)

* Add better serializable sentencizer component

* Replace default factory

* Add tests

* Tidy up

* Pass test

* Update docs

											
										
										
											2019-03-23 14:45:02 +00:00
+								> ```
-												Update docs, types and API consistency

											
										
										
											2020-08-17 14:45:24 +00:00
+								| Name   | Description                                                                                                                                |
 								| ------ | ------------------------------------------------------------------------------------------------------------------------------------------ |
 								| `path` | A path to a JSON file, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. ~~Union[str, Path]~~ |
-												💫 Add better and serializable sentencizer (#3471)

* Add better serializable sentencizer component

* Replace default factory

* Add tests

* Tidy up

* Pass test

* Update docs

											
										
										
											2019-03-23 14:45:02 +00:00
 								## Sentencizer.from_disk {#from_disk tag="method"}
 								Load the sentencizer settings from a file. Expects a JSON file. This also
 								happens automatically when you load an `nlp` object or model with a sentencizer
 								added to its pipeline.
 								> #### Example
 								>
 								> ```python
-												API docs, docstrings and argument consistency

											
										
										
											2020-07-27 16:11:45 +00:00
+								> sentencizer = nlp.add_pipe("sentencizer")
-												💫 Add better and serializable sentencizer (#3471)

* Add better serializable sentencizer component

* Replace default factory

* Add tests

* Tidy up

* Pass test

* Update docs

											
										
										
											2019-03-23 14:45:02 +00:00
+								> sentencizer.from_disk("/path/to/sentencizer.json")
 								> ```
-												Update docs, types and API consistency

											
										
										
											2020-08-17 14:45:24 +00:00
+								| Name        | Description                                                                                     |
 								| ----------- | ----------------------------------------------------------------------------------------------- |
 								| `path`      | A path to a JSON file. Paths may be either strings or `Path`-like objects. ~~Union[str, Path]~~ |
 								| **RETURNS** | The modified `Sentencizer` object. ~~Sentencizer~~                                              |
-												💫 Add better and serializable sentencizer (#3471)

* Add better serializable sentencizer component

* Replace default factory

* Add tests

* Tidy up

* Pass test

* Update docs

											
										
										
											2019-03-23 14:45:02 +00:00
 								## Sentencizer.to_bytes {#to_bytes tag="method"}
 								Serialize the sentencizer settings to a bytestring.
 								> #### Example
 								>
 								> ```python
-												API docs, docstrings and argument consistency

											
										
										
											2020-07-27 16:11:45 +00:00
+								> config = {"punct_chars": [".", "?", "!", "。"]}
 								> sentencizer = nlp.add_pipe("sentencizer", config=config)
-												💫 Add better and serializable sentencizer (#3471)

* Add better serializable sentencizer component

* Replace default factory

* Add tests

* Tidy up

* Pass test

* Update docs

											
										
										
											2019-03-23 14:45:02 +00:00
+								> sentencizer_bytes = sentencizer.to_bytes()
 								> ```
-												Update docs, types and API consistency

											
										
										
											2020-08-17 14:45:24 +00:00
+								| Name        | Description                    |
 								| ----------- | ------------------------------ |
 								| **RETURNS** | The serialized data. ~~bytes~~ |
-												💫 Add better and serializable sentencizer (#3471)

* Add better serializable sentencizer component

* Replace default factory

* Add tests

* Tidy up

* Pass test

* Update docs

											
										
										
											2019-03-23 14:45:02 +00:00
 								## Sentencizer.from_bytes {#from_bytes tag="method"}
 								Load the pipe from a bytestring. Modifies the object in place and returns it.
 								> #### Example
 								>
 								> ```python
 								> sentencizer_bytes = sentencizer.to_bytes()
-												API docs, docstrings and argument consistency

											
										
										
											2020-07-27 16:11:45 +00:00
+								> sentencizer = nlp.add_pipe("sentencizer")
-												💫 Add better and serializable sentencizer (#3471)

* Add better serializable sentencizer component

* Replace default factory

* Add tests

* Tidy up

* Pass test

* Update docs

											
										
										
											2019-03-23 14:45:02 +00:00
+								> sentencizer.from_bytes(sentencizer_bytes)
 								> ```
-												Update docs, types and API consistency

											
										
										
											2020-08-17 14:45:24 +00:00
+								| Name         | Description                                        |
 								| ------------ | -------------------------------------------------- |
 								| `bytes_data` | The bytestring to load. ~~bytes~~                  |
 								| **RETURNS**  | The modified `Sentencizer` object. ~~Sentencizer~~ |