@@ -1187,9 +1214,11 @@ adding it to the pipeline using [`nlp.add_pipe`](/api/language#add_pipe).
Here's an example of a component that implements a pre-processing rule for
splitting on `'...'` tokens. The component is added before the parser, which is
-then used to further segment the text. This approach can be useful if you want
-to implement **additional** rules specific to your data, while still being able
-to take advantage of dependency-based sentence segmentation.
+then used to further segment the text. That's possible, because `is_sent_start`
+is only set to `True` for some of the tokens – all others still specify `None`
+for unset sentence boundaries. This approach can be useful if you want to
+implement **additional** rules specific to your data, while still being able to
+take advantage of dependency-based sentence segmentation.
```python
### {executable="true"}
@@ -1212,62 +1241,6 @@ doc = nlp(text)
print("After:", [sent.text for sent in doc.sents])
```
-### Rule-based pipeline component {#sbd-component}
-
-The `sentencizer` component is a
-[pipeline component](/usage/processing-pipelines) that splits sentences on
-punctuation like `.`, `!` or `?`. You can plug it into your pipeline if you only
-need sentence boundaries without the dependency parse. Note that `Doc.sents`
-will **raise an error** if no sentence boundaries are set.
-
-```python
-### {executable="true"}
-import spacy
-from spacy.lang.en import English
-
-nlp = English() # just the language with no model
-sentencizer = nlp.create_pipe("sentencizer")
-nlp.add_pipe(sentencizer)
-doc = nlp(u"This is a sentence. This is another sentence.")
-for sent in doc.sents:
- print(sent.text)
-```
-
-### Custom rule-based strategy {#sbd-custom}
-
-If you want to implement your own strategy that differs from the default
-rule-based approach of splitting on sentences, you can also instantiate the
-`SentenceSegmenter` directly and pass in your own strategy. The strategy should
-be a function that takes a `Doc` object and yields a `Span` for each sentence.
-Here's an example of a custom segmentation strategy for splitting on newlines
-only:
-
-```python
-### {executable="true"}
-from spacy.lang.en import English
-from spacy.pipeline import SentenceSegmenter
-
-def split_on_newlines(doc):
- start = 0
- seen_newline = False
- for word in doc:
- if seen_newline and not word.is_space:
- yield doc[start:word.i]
- start = word.i
- seen_newline = False
- elif word.text == '\\n':
- seen_newline = True
- if start < len(doc):
- yield doc[start:len(doc)]
-
-nlp = English() # Just the language with no model
-sentencizer = SentenceSegmenter(nlp.vocab, strategy=split_on_newlines)
-nlp.add_pipe(sentencizer)
-doc = nlp(u"This is a sentence\\n\\nThis is another sentence\\nAnd more")
-for sent in doc.sents:
- print([token.text for token in sent])
-```
-
## Rule-based matching {#rule-based-matching hidden="true"}
diff --git a/website/docs/usage/processing-pipelines.md b/website/docs/usage/processing-pipelines.md
index 16bedce50..8eaf81652 100644
--- a/website/docs/usage/processing-pipelines.md
+++ b/website/docs/usage/processing-pipelines.md
@@ -138,7 +138,7 @@ require them in the pipeline settings in your model's `meta.json`.
| `ner` | [`EntityRecognizer`](/api/entityrecognizer) | Assign named entities. |
| `textcat` | [`TextCategorizer`](/api/textcategorizer) | Assign text categories. |
| `entity_ruler` | [`EntityRuler`](/api/entityruler) | Assign named entities based on pattern rules. |
-| `sentencizer` | [`SentenceSegmenter`](/api/sentencesegmenter) | Add rule-based sentence segmentation without the dependency parse. |
+| `sentencizer` | [`Sentencizer`](/api/sentencizer) | Add rule-based sentence segmentation without the dependency parse. |
| `merge_noun_chunks` | [`merge_noun_chunks`](/api/pipeline-functions#merge_noun_chunks) | Merge all noun chunks into a single token. Should be added after the tagger and parser. |
| `merge_entities` | [`merge_entities`](/api/pipeline-functions#merge_entities) | Merge all entities into a single token. Should be added after the entity recognizer. |
| `merge_subtokens` | [`merge_subtokens`](/api/pipeline-functions#merge_subtokens) | Merge subtokens predicted by the parser into single tokens. Should be added after the parser. |
diff --git a/website/docs/usage/v2-1.md b/website/docs/usage/v2-1.md
index 271440dba..0ba6fa407 100644
--- a/website/docs/usage/v2-1.md
+++ b/website/docs/usage/v2-1.md
@@ -195,7 +195,7 @@ the existing pages and added some new content:
- **Universe:** [Videos](/universe/category/videos) and
[Podcasts](/universe/category/podcasts)
- **API:** [`EntityRuler`](/api/entityruler)
-- **API:** [`SentenceSegmenter`](/api/sentencesegmenter)
+- **API:** [`Sentencizer`](/api/sentencizer)
- **API:** [Pipeline functions](/api/pipeline-functions)
## Backwards incompatibilities {#incompat}
diff --git a/website/meta/sidebars.json b/website/meta/sidebars.json
index fb4075ee5..bc8a70ea0 100644
--- a/website/meta/sidebars.json
+++ b/website/meta/sidebars.json
@@ -79,7 +79,7 @@
{ "text": "Matcher", "url": "/api/matcher" },
{ "text": "PhraseMatcher", "url": "/api/phrasematcher" },
{ "text": "EntityRuler", "url": "/api/entityruler" },
- { "text": "SentenceSegmenter", "url": "/api/sentencesegmenter" },
+ { "text": "Sentencizer", "url": "/api/sentencizer" },
{ "text": "Other Functions", "url": "/api/pipeline-functions" }
]
},