2019-03-23 14:45:02 +00:00
|
|
|
|
---
|
|
|
|
|
title: Sentencizer
|
|
|
|
|
tag: class
|
2020-07-27 16:11:45 +00:00
|
|
|
|
source: spacy/pipeline/sentencizer.pyx
|
|
|
|
|
teaser: 'Pipeline component for rule-based sentence boundary detection'
|
|
|
|
|
api_base_class: /api/pipe
|
|
|
|
|
api_string_name: sentencizer
|
|
|
|
|
api_trainable: false
|
2019-03-23 14:45:02 +00:00
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
A simple pipeline component, to allow custom sentence boundary detection logic
|
|
|
|
|
that doesn't require the dependency parse. By default, sentence segmentation is
|
|
|
|
|
performed by the [`DependencyParser`](/api/dependencyparser), so the
|
|
|
|
|
`Sentencizer` lets you implement a simpler, rule-based strategy that doesn't
|
2020-07-27 16:11:45 +00:00
|
|
|
|
require a statistical model to be loaded.
|
|
|
|
|
|
|
|
|
|
## Config and implementation {#config}
|
|
|
|
|
|
|
|
|
|
The default config is defined by the pipeline component factory and describes
|
|
|
|
|
how the component should be configured. You can override its settings via the
|
|
|
|
|
`config` argument on [`nlp.add_pipe`](/api/language#add_pipe) or in your
|
|
|
|
|
[`config.cfg` for training](/usage/training#config).
|
|
|
|
|
|
|
|
|
|
> #### Example
|
|
|
|
|
>
|
|
|
|
|
> ```python
|
|
|
|
|
> config = {"punct_chars": None}
|
|
|
|
|
> nlp.add_pipe("entity_ruler", config=config)
|
|
|
|
|
> ```
|
|
|
|
|
|
|
|
|
|
| Setting | Type | Description | Default |
|
|
|
|
|
| ------------- | ----------- | ---------------------------------------------------------------------------------------------------------- | ------- |
|
|
|
|
|
| `punct_chars` | `List[str]` | Optional custom list of punctuation characters that mark sentence ends. See below for defaults if not set. | `None` |
|
|
|
|
|
|
|
|
|
|
```python
|
|
|
|
|
https://github.com/explosion/spaCy/blob/develop/spacy/pipeline/sentencizer.pyx
|
|
|
|
|
```
|
2019-03-23 14:45:02 +00:00
|
|
|
|
|
|
|
|
|
## Sentencizer.\_\_init\_\_ {#init tag="method"}
|
|
|
|
|
|
|
|
|
|
Initialize the sentencizer.
|
|
|
|
|
|
|
|
|
|
> #### Example
|
|
|
|
|
>
|
|
|
|
|
> ```python
|
2020-07-26 22:29:45 +00:00
|
|
|
|
> # Construction via add_pipe
|
|
|
|
|
> sentencizer = nlp.add_pipe("sentencizer")
|
2020-07-27 16:11:45 +00:00
|
|
|
|
>
|
|
|
|
|
> # Construction from class
|
|
|
|
|
> from spacy.pipeline import Sentencizer
|
|
|
|
|
> sentencizer = Sentencizer()
|
2019-03-23 14:45:02 +00:00
|
|
|
|
> ```
|
|
|
|
|
|
2020-07-27 16:11:45 +00:00
|
|
|
|
| Name | Type | Description |
|
|
|
|
|
| -------------- | ----------- | ----------------------------------------------------------------------------------------------- |
|
|
|
|
|
| _keyword-only_ | | |
|
|
|
|
|
| `punct_chars` | `List[str]` | Optional custom list of punctuation characters that mark sentence ends. See below for defaults. |
|
2020-07-04 12:23:10 +00:00
|
|
|
|
|
|
|
|
|
```python
|
|
|
|
|
### punct_chars defaults
|
|
|
|
|
['!', '.', '?', '։', '؟', '۔', '܀', '܁', '܂', '߹', '।', '॥', '၊', '။', '።',
|
|
|
|
|
'፧', '፨', '᙮', '᜵', '᜶', '᠃', '᠉', '᥄', '᥅', '᪨', '᪩', '᪪', '᪫',
|
|
|
|
|
'᭚', '᭛', '᭞', '᭟', '᰻', '᰼', '᱾', '᱿', '‼', '‽', '⁇', '⁈', '⁉',
|
|
|
|
|
'⸮', '⸼', '꓿', '꘎', '꘏', '꛳', '꛷', '꡶', '꡷', '꣎', '꣏', '꤯', '꧈',
|
|
|
|
|
'꧉', '꩝', '꩞', '꩟', '꫰', '꫱', '꯫', '﹒', '﹖', '﹗', '!', '.', '?',
|
|
|
|
|
'𐩖', '𐩗', '𑁇', '𑁈', '𑂾', '𑂿', '𑃀', '𑃁', '𑅁', '𑅂', '𑅃', '𑇅',
|
|
|
|
|
'𑇆', '𑇍', '𑇞', '𑇟', '𑈸', '𑈹', '𑈻', '𑈼', '𑊩', '𑑋', '𑑌', '𑗂',
|
|
|
|
|
'𑗃', '𑗉', '𑗊', '𑗋', '𑗌', '𑗍', '𑗎', '𑗏', '𑗐', '𑗑', '𑗒', '𑗓',
|
|
|
|
|
'𑗔', '𑗕', '𑗖', '𑗗', '𑙁', '𑙂', '𑜼', '𑜽', '𑜾', '𑩂', '𑩃', '𑪛',
|
|
|
|
|
'𑪜', '𑱁', '𑱂', '𖩮', '𖩯', '𖫵', '𖬷', '𖬸', '𖭄', '𛲟', '𝪈', '。', '。']
|
|
|
|
|
```
|
2019-03-23 14:45:02 +00:00
|
|
|
|
|
|
|
|
|
## Sentencizer.\_\_call\_\_ {#call tag="method"}
|
|
|
|
|
|
|
|
|
|
Apply the sentencizer on a `Doc`. Typically, this happens automatically after
|
|
|
|
|
the component has been added to the pipeline using
|
|
|
|
|
[`nlp.add_pipe`](/api/language#add_pipe).
|
|
|
|
|
|
|
|
|
|
> #### Example
|
|
|
|
|
>
|
|
|
|
|
> ```python
|
|
|
|
|
> from spacy.lang.en import English
|
|
|
|
|
>
|
|
|
|
|
> nlp = English()
|
2020-07-26 22:29:45 +00:00
|
|
|
|
> nlp.add_pipe("sentencizer")
|
2019-09-12 14:11:15 +00:00
|
|
|
|
> doc = nlp("This is a sentence. This is another sentence.")
|
2019-11-13 14:24:14 +00:00
|
|
|
|
> assert len(list(doc.sents)) == 2
|
2019-03-23 14:45:02 +00:00
|
|
|
|
> ```
|
|
|
|
|
|
|
|
|
|
| Name | Type | Description |
|
|
|
|
|
| ----------- | ----- | ------------------------------------------------------------ |
|
|
|
|
|
| `doc` | `Doc` | The `Doc` object to process, e.g. the `Doc` in the pipeline. |
|
|
|
|
|
| **RETURNS** | `Doc` | The modified `Doc` with added sentence boundaries. |
|
|
|
|
|
|
2020-07-27 16:11:45 +00:00
|
|
|
|
## Sentencizer.pipe {#pipe tag="method"}
|
|
|
|
|
|
|
|
|
|
Apply the pipe to a stream of documents. This usually happens under the hood
|
|
|
|
|
when the `nlp` object is called on a text and all pipeline components are
|
|
|
|
|
applied to the `Doc` in order.
|
|
|
|
|
|
|
|
|
|
> #### Example
|
|
|
|
|
>
|
|
|
|
|
> ```python
|
|
|
|
|
> sentencizer = nlp.add_pipe("sentencizer")
|
|
|
|
|
> for doc in sentencizer.pipe(docs, batch_size=50):
|
|
|
|
|
> pass
|
|
|
|
|
> ```
|
|
|
|
|
|
|
|
|
|
| Name | Type | Description |
|
|
|
|
|
| -------------- | --------------- | ----------------------------------------------------- |
|
|
|
|
|
| `stream` | `Iterable[Doc]` | A stream of documents. |
|
|
|
|
|
| _keyword-only_ | | |
|
|
|
|
|
| `batch_size` | int | The number of documents to buffer. Defaults to `128`. |
|
|
|
|
|
| **YIELDS** | `Doc` | The processed documents in order. |
|
|
|
|
|
|
|
|
|
|
## Sentencizer.score {#score tag="method" new="3"}
|
|
|
|
|
|
|
|
|
|
Score a batch of examples.
|
|
|
|
|
|
|
|
|
|
> #### Example
|
|
|
|
|
>
|
|
|
|
|
> ```python
|
|
|
|
|
> scores = sentencizer.score(examples)
|
|
|
|
|
> ```
|
|
|
|
|
|
|
|
|
|
| Name | Type | Description |
|
|
|
|
|
| ----------- | ------------------- | ------------------------------------------------------------------------ |
|
|
|
|
|
| `examples` | `Iterable[Example]` | The examples to score. |
|
|
|
|
|
| **RETURNS** | `Dict[str, Any]` | The scores, produced by [`Scorer.score_spans`](/api/scorer#score_spans). |
|
|
|
|
|
|
2019-03-23 14:45:02 +00:00
|
|
|
|
## Sentencizer.to_disk {#to_disk tag="method"}
|
|
|
|
|
|
|
|
|
|
Save the sentencizer settings (punctuation characters) a directory. Will create
|
|
|
|
|
a file `sentencizer.json`. This also happens automatically when you save an
|
|
|
|
|
`nlp` object with a sentencizer added to its pipeline.
|
|
|
|
|
|
|
|
|
|
> #### Example
|
|
|
|
|
>
|
|
|
|
|
> ```python
|
2020-07-27 16:11:45 +00:00
|
|
|
|
> config = {"punct_chars": [".", "?", "!", "。"]}
|
|
|
|
|
> sentencizer = nlp.add_pipe("sentencizer", config=config)
|
|
|
|
|
> sentencizer.to_disk("/path/to/sentencizer.json")
|
2019-03-23 14:45:02 +00:00
|
|
|
|
> ```
|
|
|
|
|
|
2020-07-27 16:11:45 +00:00
|
|
|
|
| Name | Type | Description |
|
|
|
|
|
| ------ | ------------ | --------------------------------------------------------------------------------------------------------------------- |
|
|
|
|
|
| `path` | str / `Path` | A path to a JSON file, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
|
2019-03-23 14:45:02 +00:00
|
|
|
|
|
|
|
|
|
## Sentencizer.from_disk {#from_disk tag="method"}
|
|
|
|
|
|
|
|
|
|
Load the sentencizer settings from a file. Expects a JSON file. This also
|
|
|
|
|
happens automatically when you load an `nlp` object or model with a sentencizer
|
|
|
|
|
added to its pipeline.
|
|
|
|
|
|
|
|
|
|
> #### Example
|
|
|
|
|
>
|
|
|
|
|
> ```python
|
2020-07-27 16:11:45 +00:00
|
|
|
|
> sentencizer = nlp.add_pipe("sentencizer")
|
2019-03-23 14:45:02 +00:00
|
|
|
|
> sentencizer.from_disk("/path/to/sentencizer.json")
|
|
|
|
|
> ```
|
|
|
|
|
|
2020-05-24 15:23:00 +00:00
|
|
|
|
| Name | Type | Description |
|
|
|
|
|
| ----------- | ------------- | -------------------------------------------------------------------------- |
|
|
|
|
|
| `path` | str / `Path` | A path to a JSON file. Paths may be either strings or `Path`-like objects. |
|
|
|
|
|
| **RETURNS** | `Sentencizer` | The modified `Sentencizer` object. |
|
2019-03-23 14:45:02 +00:00
|
|
|
|
|
|
|
|
|
## Sentencizer.to_bytes {#to_bytes tag="method"}
|
|
|
|
|
|
|
|
|
|
Serialize the sentencizer settings to a bytestring.
|
|
|
|
|
|
|
|
|
|
> #### Example
|
|
|
|
|
>
|
|
|
|
|
> ```python
|
2020-07-27 16:11:45 +00:00
|
|
|
|
> config = {"punct_chars": [".", "?", "!", "。"]}
|
|
|
|
|
> sentencizer = nlp.add_pipe("sentencizer", config=config)
|
2019-03-23 14:45:02 +00:00
|
|
|
|
> sentencizer_bytes = sentencizer.to_bytes()
|
|
|
|
|
> ```
|
|
|
|
|
|
|
|
|
|
| Name | Type | Description |
|
|
|
|
|
| ----------- | ----- | -------------------- |
|
|
|
|
|
| **RETURNS** | bytes | The serialized data. |
|
|
|
|
|
|
|
|
|
|
## Sentencizer.from_bytes {#from_bytes tag="method"}
|
|
|
|
|
|
|
|
|
|
Load the pipe from a bytestring. Modifies the object in place and returns it.
|
|
|
|
|
|
|
|
|
|
> #### Example
|
|
|
|
|
>
|
|
|
|
|
> ```python
|
|
|
|
|
> sentencizer_bytes = sentencizer.to_bytes()
|
2020-07-27 16:11:45 +00:00
|
|
|
|
> sentencizer = nlp.add_pipe("sentencizer")
|
2019-03-23 14:45:02 +00:00
|
|
|
|
> sentencizer.from_bytes(sentencizer_bytes)
|
|
|
|
|
> ```
|
|
|
|
|
|
|
|
|
|
| Name | Type | Description |
|
|
|
|
|
| ------------ | ------------- | ---------------------------------- |
|
|
|
|
|
| `bytes_data` | bytes | The bytestring to load. |
|
|
|
|
|
| **RETURNS** | `Sentencizer` | The modified `Sentencizer` object. |
|