diff --git a/website/docs/api/matcher.md b/website/docs/api/matcher.md index f259174e2..136bac3c8 100644 --- a/website/docs/api/matcher.md +++ b/website/docs/api/matcher.md @@ -116,10 +116,12 @@ Find all token sequences matching the supplied patterns on the `Doc` or `Span`. > matches = matcher(doc) > ``` -| Name | Description | -| ----------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `doclike` | The `Doc` or `Span` to match over. ~~Union[Doc, Span]~~ | -| **RETURNS** | A list of `(match_id, start, end)` tuples, describing the matches. A match tuple describes a span `doc[start:end`]. The `match_id` is the ID of the added match pattern. ~~List[Tuple[int, int, int]]~~ | +| Name | Description | +| ------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `doclike` | The `Doc` or `Span` to match over. ~~Union[Doc, Span]~~ | +| _keyword-only_ | | +| `as_spans` 3 | Instead of tuples, return a list of [`Span`](/api/span) objects of the matches, with the `match_id` assigned as the span label. Defaults to `False`. ~~bool~~ | +| **RETURNS** | A list of `(match_id, start, end)` tuples, describing the matches. A match tuple describes a span `doc[start:end`]. The `match_id` is the ID of the added match pattern. If `as_spans` is set to `True`, a list of `Span` objects is returned instead. ~~Union[List[Tuple[int, int, int]], List[Span]]~~ | ## Matcher.pipe {#pipe tag="method"} diff --git a/website/docs/api/phrasematcher.md b/website/docs/api/phrasematcher.md index 143eb9edf..8064a621e 100644 --- a/website/docs/api/phrasematcher.md +++ b/website/docs/api/phrasematcher.md @@ -57,10 +57,12 @@ Find all token sequences matching the supplied patterns on the `Doc`. > matches = matcher(doc) > ``` -| Name | Description | -| ----------- | ----------------------------------- | -| `doc` | The document to match over. ~~Doc~~ | -| **RETURNS** | list | A list of `(match_id, start, end)` tuples, describing the matches. A match tuple describes a span `doc[start:end]`. The `match_id` is the ID of the added match pattern. ~~List[Tuple[int, int, int]]~~ | +| Name | Description | +| ------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `doc` | The document to match over. ~~Doc~~ | +| _keyword-only_ | | +| `as_spans` 3 | Instead of tuples, return a list of [`Span`](/api/span) objects of the matches, with the `match_id` assigned as the span label. Defaults to `False`. ~~bool~~ | +| **RETURNS** | A list of `(match_id, start, end)` tuples, describing the matches. A match tuple describes a span `doc[start:end`]. The `match_id` is the ID of the added match pattern. If `as_spans` is set to `True`, a list of `Span` objects is returned instead. ~~Union[List[Tuple[int, int, int]], List[Span]]~~ | diff --git a/website/docs/usage/rule-based-matching.md b/website/docs/usage/rule-based-matching.md index 7fdce032e..e3e0f2c19 100644 --- a/website/docs/usage/rule-based-matching.md +++ b/website/docs/usage/rule-based-matching.md @@ -493,6 +493,39 @@ you prefer. | `i` | Index of the current match (`matches[i`]). ~~int~~ | | `matches` | A list of `(match_id, start, end)` tuples, describing the matches. A match tuple describes a span `doc[start:end`]. ~~ List[Tuple[int, int int]]~~ | +### Creating spans from matches {#matcher-spans} + +Creating [`Span`](/api/span) objects from the returned matches is a very common +use case. spaCy makes this easy by giving you access to the `start` and `end` +token of each match, which you can use to construct a new span with an optional +label. As of spaCy v3.0, you can also set `as_spans=True` when calling the +matcher on a `Doc`, which will return a list of [`Span`](/api/span) objects +using the `match_id` as the span label. + +```python +### {executable="true"} +import spacy +from spacy.matcher import Matcher +from spacy.tokens import Span + +nlp = spacy.blank("en") +matcher = Matcher(nlp.vocab) +matcher.add("PERSON", [[{"lower": "barack"}, {"lower": "obama"}]]) +doc = nlp("Barack Obama was the 44th president of the United States") + +# 1. Return (match_id, start, end) tuples +matches = matcher(doc) +for match_id, start, end in matches: + # Create the matched span and assign the match_id as a label + span = Span(doc, start, end, label=match_id) + print(span.text, span.label_) + +# 2. Return Span objects directly +matches = matcher(doc, as_spans=True) +for span in matches: + print(span.text, span.label_) +``` + ### Using custom pipeline components {#matcher-pipeline} Let's say your data also contains some annoying pre-processing artifacts, like