mirror of https://github.com/explosion/spaCy.git
79 lines
3.2 KiB
Markdown
79 lines
3.2 KiB
Markdown
---
|
|
title: SentenceSegmenter
|
|
tag: class
|
|
source: spacy/pipeline/hooks.py
|
|
---
|
|
|
|
A simple spaCy hook, to allow custom sentence boundary detection logic that
|
|
doesn't require the dependency parse. By default, sentence segmentation is
|
|
performed by the [`DependencyParser`](/api/dependencyparser), so the
|
|
`SentenceSegmenter` lets you implement a simpler, rule-based strategy that
|
|
doesn't require a statistical model to be loaded. The component is also
|
|
available via the string name `"sentencizer"`. After initialization, it is
|
|
typically added to the processing pipeline using
|
|
[`nlp.add_pipe`](/api/language#add_pipe).
|
|
|
|
## SentenceSegmenter.\_\_init\_\_ {#init tag="method"}
|
|
|
|
Initialize the sentence segmenter. To change the sentence boundary detection
|
|
strategy, pass a generator function `strategy` on initialization, or assign a
|
|
new strategy to the `.strategy` attribute. Sentence detection strategies should
|
|
be generators that take `Doc` objects and yield `Span` objects for each
|
|
sentence.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> # Construction via create_pipe
|
|
> sentencizer = nlp.create_pipe("sentencizer")
|
|
>
|
|
> # Construction from class
|
|
> from spacy.pipeline import SentenceSegmenter
|
|
> sentencizer = SentenceSegmenter(nlp.vocab)
|
|
> ```
|
|
|
|
| Name | Type | Description |
|
|
| ----------- | ------------------- | ----------------------------------------------------------- |
|
|
| `vocab` | `Vocab` | The shared vocabulary. |
|
|
| `strategy` | unicode / callable | The segmentation strategy to use. Defaults to `"on_punct"`. |
|
|
| **RETURNS** | `SentenceSegmenter` | The newly constructed object. |
|
|
|
|
## SentenceSegmenter.\_\_call\_\_ {#call tag="method"}
|
|
|
|
Apply the sentence segmenter on a `Doc`. Typically, this happens automatically
|
|
after the component has been added to the pipeline using
|
|
[`nlp.add_pipe`](/api/language#add_pipe).
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> from spacy.lang.en import English
|
|
>
|
|
> nlp = English()
|
|
> sentencizer = nlp.create_pipe("sentencizer")
|
|
> nlp.add_pipe(sentencizer)
|
|
> doc = nlp(u"This is a sentence. This is another sentence.")
|
|
> assert list(doc.sents) == 2
|
|
> ```
|
|
|
|
| Name | Type | Description |
|
|
| ----------- | ----- | ------------------------------------------------------------ |
|
|
| `doc` | `Doc` | The `Doc` object to process, e.g. the `Doc` in the pipeline. |
|
|
| **RETURNS** | `Doc` | The modified `Doc` with added sentence boundaries. |
|
|
|
|
## SentenceSegmenter.split_on_punct {#split_on_punct tag="staticmethod"}
|
|
|
|
Split the `Doc` on punctuation characters `.`, `!` and `?`. This is the default
|
|
strategy used by the `SentenceSegmenter.`
|
|
|
|
| Name | Type | Description |
|
|
| ---------- | ------ | ------------------------------ |
|
|
| `doc` | `Doc` | The `Doc` object to process. |
|
|
| **YIELDS** | `Span` | The sentences in the document. |
|
|
|
|
## Attributes {#attributes}
|
|
|
|
| Name | Type | Description |
|
|
| ---------- | -------- | ------------------------------------------------------------------- |
|
|
| `strategy` | callable | The segmentation strategy. Can be overwritten after initialization. |
|