Support match alignments (#7321)

* Support match alignments * change naming from match_alignments to with_alignments, add conditional flow if with_alignments is given, validate with_alignments, add related test case * remove added errors, utilize bint type, cleanup whitespace * fix no new line in end of file * Minor formatting * Skip alignments processing if as_spans is set * Add with_alignments to Matcher API docs * Update website/docs/api/matcher.md Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2021-04-08 17:10:14 +09:00 · 2021-04-08 17:10:14 +09:00 · ee159b8543
parent ff84075839
commit ee159b8543
5 changed files with 321 additions and 28 deletions
--- a/.github/contributors/broaddeep.md
+++ b/.github/contributors/broaddeep.md
@ -0,0 +1,106 @@
 # spaCy contributor agreement
 This spaCy Contributor Agreement (**"SCA"**) is based on the
 [Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
 The SCA applies to any contribution that you make to any product or project
 managed by us (the **"project"**), and sets out the intellectual property rights
 you grant to us in the contributed materials. The term **"us"** shall mean
 [ExplosionAI GmbH](https://explosion.ai/legal). The term
 **"you"** shall mean the person or entity identified below.
 If you agree to be bound by these terms, fill in the information requested
 below and include the filled-in version with your first pull request, under the
 folder [`.github/contributors/`](/.github/contributors/). The name of the file
 should be your GitHub username, with the extension `.md`. For example, the user
 example_user would create the file `.github/contributors/example_user.md`.
 Read this agreement carefully before signing. These terms and conditions
 constitute a binding legal agreement.
 ## Contributor Agreement
 1. The term "contribution" or "contributed materials" means any source code,
 object code, patch, tool, sample, graphic, specification, manual,
 documentation, or any other material posted or submitted by you to the project.
 2. With respect to any worldwide copyrights, or copyright applications and
 registrations, in your contribution:
    * you hereby assign to us joint ownership, and to the extent that such
    assignment is or becomes invalid, ineffective or unenforceable, you hereby
    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
    royalty-free, unrestricted license to exercise all rights under those
    copyrights. This includes, at our option, the right to sublicense these same
    rights to third parties through multiple levels of sublicensees or other
    licensing arrangements;
    * you agree that each of us can do all things in relation to your
    contribution as if each of us were the sole owners, and if one of us makes
    a derivative work of your contribution, the one who makes the derivative
    work (or has it made will be the sole owner of that derivative work;
    * you agree that you will not assert any moral rights in your contribution
    against us, our licensees or transferees;
    * you agree that we may register a copyright in your contribution and
    exercise all ownership rights associated with it; and
    * you agree that neither of us has any duty to consult with, obtain the
    consent of, pay or render an accounting to the other for any use or
    distribution of your contribution.
 3. With respect to any patents you own, or that you can license without payment
 to any third party, you hereby grant to us a perpetual, irrevocable,
 non-exclusive, worldwide, no-charge, royalty-free license to:
    * make, have made, use, sell, offer to sell, import, and otherwise transfer
    your contribution in whole or in part, alone or in combination with or
    included in any product, work or materials arising out of the project to
    which your contribution was submitted, and
    * at our option, to sublicense these same rights to third parties through
    multiple levels of sublicensees or other licensing arrangements.
 4. Except as set out above, you keep all right, title, and interest in your
 contribution. The rights that you grant to us under these terms are effective
 on the date you first submitted a contribution to us, even if your submission
 took place before the date you sign these terms.
 5. You covenant, represent, warrant and agree that:
    * Each contribution that you submit is and shall be an original work of
    authorship and you can legally grant the rights set out in this SCA;
    * to the best of your knowledge, each contribution will not violate any
    third party's copyrights, trademarks, patents, or other intellectual
    property rights; and
    * each contribution shall be in compliance with U.S. export control laws and
    other applicable export and import laws. You agree to notify us if you
    become aware of any circumstance which would make any of the foregoing
    representations inaccurate in any respect. We may publicly disclose your
    participation in the project, including the fact that you have signed the SCA.
 6. This SCA is governed by the laws of the State of California and applicable
 U.S. Federal law. Any choice of law rules will not apply.
 7. Please place an “x” on one of the applicable statement below. Please do NOT
 mark both statements:
    * [x] I am signing on behalf of myself as an individual and no other person
    or entity, including my employer, has or will have rights with respect to my
    contributions.
    * [ ] I am signing on behalf of my employer or a legal entity and I have the
    actual authority to contractually bind that entity.
 ## Contributor Details
 | Field                          | Entry                |
 |------------------------------- | -------------------- |
 | Name                           | Dongjun Park         |
 | Company name (if applicable)   |                      |
 | Title or role (if applicable)  |                      |
 | Date                           | 2021-03-06           |
 | GitHub username                | broaddeep            |
 | Website (optional)             |                      |
--- a/spacy/matcher/matcher.pxd
+++ b/spacy/matcher/matcher.pxd
@ -46,6 +46,12 @@ cdef struct TokenPatternC:
    int32_t nr_py
    quantifier_t quantifier
    hash_t key
    int32_t token_idx
 cdef struct MatchAlignmentC:
    int32_t token_idx
    int32_t length
 cdef struct PatternStateC:
--- a/spacy/matcher/matcher.pyx
+++ b/spacy/matcher/matcher.pyx
@ -196,7 +196,7 @@ cdef class Matcher:
                else:
                    yield doc
-    def __call__(self, object doclike, *, as_spans=False, allow_missing=False):
+    def __call__(self, object doclike, *, as_spans=False, allow_missing=False, with_alignments=False):
        """Find all token sequences matching the supplied pattern.
        doclike (Doc or Span): The document to match over.
@ -204,10 +204,16 @@ cdef class Matcher:
            start, end) tuples.
        allow_missing (bool): Whether to skip checks for missing annotation for
            attributes included in patterns. Defaults to False.
        with_alignments (bool): Return match alignment information, which is
            `List[int]` with length of matched span. Each entry denotes the
            corresponding index of token pattern. If as_spans is set to True,
            this setting is ignored.
        RETURNS (list): A list of `(match_id, start, end)` tuples,
            describing the matches. A match tuple describes a span
            `doc[start:end]`. The `match_id` is an integer. If as_spans is set
            to True, a list of Span objects is returned.
            If with_alignments is set to True and as_spans is set to False,
            A list of `(match_id, start, end, alignments)` tuples is returned.
        """
        if isinstance(doclike, Doc):
            doc = doclike
@ -217,6 +223,9 @@ cdef class Matcher:
            length = doclike.end - doclike.start
        else:
            raise ValueError(Errors.E195.format(good="Doc or Span", got=type(doclike).__name__))
        # Skip alignments calculations if as_spans is set
        if as_spans:
            with_alignments = False
        cdef Pool tmp_pool = Pool()
        if not allow_missing:
            for attr in (TAG, POS, MORPH, LEMMA, DEP):
@ -232,18 +241,20 @@ cdef class Matcher:
                    error_msg = Errors.E155.format(pipe=pipe, attr=self.vocab.strings.as_string(attr))
                    raise ValueError(error_msg)
        matches = find_matches(&self.patterns[0], self.patterns.size(), doclike, length,
-                                extensions=self._extensions, predicates=self._extra_predicates)
+                                extensions=self._extensions, predicates=self._extra_predicates, with_alignments=with_alignments)
        final_matches = []
        pairs_by_id = {}
-        # For each key, either add all matches, or only the filtered, non-overlapping ones
+        # For each key, either add all matches, or only the filtered,
-        for (key, start, end) in matches:
+        # non-overlapping ones this `match` can be either (start, end) or
        # (start, end, alignments) depending on `with_alignments=` option.
        for key, *match in matches:
            span_filter = self._filter.get(key)
            if span_filter is not None:
                pairs = pairs_by_id.get(key, [])
-                pairs.append((start,end))
+                pairs.append(match)
                pairs_by_id[key] = pairs
            else:
-                final_matches.append((key, start, end))
+                final_matches.append((key, *match))
        matched = <char*>tmp_pool.alloc(length, sizeof(char))
        empty = <char*>tmp_pool.alloc(length, sizeof(char))
        for key, pairs in pairs_by_id.items():
@ -255,14 +266,18 @@ cdef class Matcher:
                sorted_pairs = sorted(pairs, key=lambda x: (x[1]-x[0], -x[0]), reverse=True) # reverse sort by length
            else:
                raise ValueError(Errors.E947.format(expected=["FIRST", "LONGEST"], arg=span_filter))
-            for (start, end) in sorted_pairs:
+            for match in sorted_pairs:
                start, end = match[:2]
                assert 0 <= start < end  # Defend against segfaults
                span_len = end-start
                # If no tokens in the span have matched
                if memcmp(&matched[start], &empty[start], span_len * sizeof(matched[0])) == 0:
-                    final_matches.append((key, start, end))
+                    final_matches.append((key, *match))
                    # Mark tokens that have matched
                    memset(&matched[start], 1, span_len * sizeof(matched[0]))
        if with_alignments:
            final_matches_with_alignments = final_matches
            final_matches = [(key, start, end) for key, start, end, alignments in final_matches]
        # perform the callbacks on the filtered set of results
        for i, (key, start, end) in enumerate(final_matches):
            on_match = self._callbacks.get(key, None)
@ -270,6 +285,22 @@ cdef class Matcher:
                on_match(self, doc, i, final_matches)
        if as_spans:
            return [Span(doc, start, end, label=key) for key, start, end in final_matches]
        elif with_alignments:
            # convert alignments List[Dict[str, int]] --> List[int]
            final_matches = []
            # when multiple alignment (belongs to the same length) is found,
            # keeps the alignment that has largest token_idx
            for key, start, end, alignments in final_matches_with_alignments:
                sorted_alignments = sorted(alignments, key=lambda x: (x['length'], x['token_idx']), reverse=False)
                alignments = [0] * (end-start)
                for align in sorted_alignments:
                    if align['length'] >= end-start:
                        continue
                    # Since alignments are sorted in order of (length, token_idx)
                    # this overwrites smaller token_idx when they have same length.
                    alignments[align['length']] = align['token_idx']
                final_matches.append((key, start, end, alignments))
            return final_matches
        else:
            return final_matches
@ -288,9 +319,9 @@ def unpickle_matcher(vocab, patterns, callbacks):
    return matcher
-cdef find_matches(TokenPatternC** patterns, int n, object doclike, int length, extensions=None, predicates=tuple()):
+cdef find_matches(TokenPatternC** patterns, int n, object doclike, int length, extensions=None, predicates=tuple(), bint with_alignments=0):
    """Find matches in a doc, with a compiled array of patterns. Matches are
-    returned as a list of (id, start, end) tuples.
+    returned as a list of (id, start, end) tuples or (id, start, end, alignments) tuples (if with_alignments != 0)
    To augment the compiled patterns, we optionally also take two Python lists.
@ -302,6 +333,8 @@ cdef find_matches(TokenPatternC** patterns, int n, object doclike, int length, e
    """
    cdef vector[PatternStateC] states
    cdef vector[MatchC] matches
    cdef vector[vector[MatchAlignmentC]] align_states
    cdef vector[vector[MatchAlignmentC]] align_matches
    cdef PatternStateC state
    cdef int i, j, nr_extra_attr
    cdef Pool mem = Pool()
@ -328,12 +361,14 @@ cdef find_matches(TokenPatternC** patterns, int n, object doclike, int length, e
    for i in range(length):
        for j in range(n):
            states.push_back(PatternStateC(patterns[j], i, 0))
-        transition_states(states, matches, predicate_cache,
+        if with_alignments != 0:
-            doclike[i], extra_attr_values, predicates)
+            align_states.resize(states.size())
        transition_states(states, matches, align_states, align_matches, predicate_cache,
            doclike[i], extra_attr_values, predicates, with_alignments)
        extra_attr_values += nr_extra_attr
        predicate_cache += len(predicates)
    # Handle matches that end in 0-width patterns
-    finish_states(matches, states)
+    finish_states(matches, states, align_matches, align_states, with_alignments)
    seen = set()
    for i in range(matches.size()):
        match = (
@ -346,16 +381,22 @@ cdef find_matches(TokenPatternC** patterns, int n, object doclike, int length, e
        # first .?, or the second .? -- it doesn't matter, it's just one match.
        # Skip 0-length matches. (TODO: fix algorithm)
        if match not in seen and matches[i].length > 0:
-            output.append(match)
+            if with_alignments != 0:
                # since the length of align_matches equals to that of match, we can share same 'i'
                output.append(match + (align_matches[i],))
            else:
                output.append(match)
            seen.add(match)
    return output
 cdef void transition_states(vector[PatternStateC]& states, vector[MatchC]& matches,
                            vector[vector[MatchAlignmentC]]& align_states, vector[vector[MatchAlignmentC]]& align_matches,
                            int8_t* cached_py_predicates,
-        Token token, const attr_t* extra_attrs, py_predicates) except *:
+        Token token, const attr_t* extra_attrs, py_predicates, bint with_alignments) except *:
    cdef int q = 0
    cdef vector[PatternStateC] new_states
    cdef vector[vector[MatchAlignmentC]] align_new_states
    cdef int nr_predicate = len(py_predicates)
    for i in range(states.size()):
        if states[i].pattern.nr_py >= 1:
@ -370,23 +411,39 @@ cdef void transition_states(vector[PatternStateC]& states, vector[MatchC]& match
        # it in the states list, because q doesn't advance.
        state = states[i]
        states[q] = state
        # Separate from states, performance is guaranteed for users who only need basic options (without alignments).
        # `align_states` always corresponds to `states` 1:1.
        if with_alignments != 0:
            align_state = align_states[i]
            align_states[q] = align_state
        while action in (RETRY, RETRY_ADVANCE, RETRY_EXTEND):
            # Update alignment before the transition of current state
            # 'MatchAlignmentC' maps 'original token index of current pattern' to 'current matching length'
            if with_alignments != 0:
                align_states[q].push_back(MatchAlignmentC(states[q].pattern.token_idx, states[q].length))
            if action == RETRY_EXTEND:
                # This handles the 'extend'
                new_states.push_back(
                    PatternStateC(pattern=states[q].pattern, start=state.start,
                                  length=state.length+1))
                if with_alignments != 0:
                    align_new_states.push_back(align_states[q])
            if action == RETRY_ADVANCE:
                # This handles the 'advance'
                new_states.push_back(
                    PatternStateC(pattern=states[q].pattern+1, start=state.start,
                                  length=state.length+1))
                if with_alignments != 0:
                    align_new_states.push_back(align_states[q])
            states[q].pattern += 1
            if states[q].pattern.nr_py != 0:
                update_predicate_cache(cached_py_predicates,
                    states[q].pattern, token, py_predicates)
            action = get_action(states[q], token.c, extra_attrs,
                                cached_py_predicates)
        # Update alignment before the transition of current state
        if with_alignments != 0:
            align_states[q].push_back(MatchAlignmentC(states[q].pattern.token_idx, states[q].length))
        if action == REJECT:
            pass
        elif action == ADVANCE:
@ -399,29 +456,50 @@ cdef void transition_states(vector[PatternStateC]& states, vector[MatchC]& match
                matches.push_back(
                    MatchC(pattern_id=ent_id, start=state.start,
                            length=state.length+1))
                # `align_matches` always corresponds to `matches` 1:1
                if with_alignments != 0:
                    align_matches.push_back(align_states[q])
            elif action == MATCH_DOUBLE:
                # push match without last token if length > 0
                if state.length > 0:
                    matches.push_back(
                        MatchC(pattern_id=ent_id, start=state.start,
                                length=state.length))
                    # MATCH_DOUBLE emits matches twice,
                    # add one more to align_matches in order to keep 1:1 relationship
                    if with_alignments != 0:
                        align_matches.push_back(align_states[q])
                # push match with last token
                matches.push_back(
                    MatchC(pattern_id=ent_id, start=state.start,
                            length=state.length+1))
                # `align_matches` always corresponds to `matches` 1:1
                if with_alignments != 0:
                    align_matches.push_back(align_states[q])
            elif action == MATCH_REJECT:
                matches.push_back(
                    MatchC(pattern_id=ent_id, start=state.start,
                            length=state.length))
                # `align_matches` always corresponds to `matches` 1:1
                if with_alignments != 0:
                    align_matches.push_back(align_states[q])
            elif action == MATCH_EXTEND:
                matches.push_back(
                    MatchC(pattern_id=ent_id, start=state.start,
                           length=state.length))
                # `align_matches` always corresponds to `matches` 1:1
                if with_alignments != 0:
                    align_matches.push_back(align_states[q])
                states[q].length += 1
                q += 1
    states.resize(q)
    for i in range(new_states.size()):
        states.push_back(new_states[i])
    # `align_states` always corresponds to `states` 1:1
    if with_alignments != 0:
        align_states.resize(q)
        for i in range(align_new_states.size()):
            align_states.push_back(align_new_states[i])
 cdef int update_predicate_cache(int8_t* cache,
@ -444,15 +522,27 @@ cdef int update_predicate_cache(int8_t* cache,
                raise ValueError(Errors.E125.format(value=result))
-cdef void finish_states(vector[MatchC]& matches, vector[PatternStateC]& states) except *:
+cdef void finish_states(vector[MatchC]& matches, vector[PatternStateC]& states,
                        vector[vector[MatchAlignmentC]]& align_matches,
                        vector[vector[MatchAlignmentC]]& align_states,
                        bint with_alignments) except *:
    """Handle states that end in zero-width patterns."""
    cdef PatternStateC state
    cdef vector[MatchAlignmentC] align_state
    for i in range(states.size()):
        state = states[i]
        if with_alignments != 0:
            align_state = align_states[i]
        while get_quantifier(state) in (ZERO_PLUS, ZERO_ONE):
            # Update alignment before the transition of current state
            if with_alignments != 0:
                align_state.push_back(MatchAlignmentC(state.pattern.token_idx, state.length))
            is_final = get_is_final(state)
            if is_final:
                ent_id = get_ent_id(state.pattern)
                # `align_matches` always corresponds to `matches` 1:1
                if with_alignments != 0:
                    align_matches.push_back(align_state)
                matches.push_back(
                    MatchC(pattern_id=ent_id, start=state.start, length=state.length))
                break
@ -607,7 +697,7 @@ cdef int8_t get_quantifier(PatternStateC state) nogil:
 cdef TokenPatternC* init_pattern(Pool mem, attr_t entity_id, object token_specs) except NULL:
    pattern = <TokenPatternC*>mem.alloc(len(token_specs) + 1, sizeof(TokenPatternC))
    cdef int i, index
-    for i, (quantifier, spec, extensions, predicates) in enumerate(token_specs):
+    for i, (quantifier, spec, extensions, predicates, token_idx) in enumerate(token_specs):
        pattern[i].quantifier = quantifier
        # Ensure attrs refers to a null pointer if nr_attr == 0
        if len(spec) > 0:
@ -628,6 +718,7 @@ cdef TokenPatternC* init_pattern(Pool mem, attr_t entity_id, object token_specs)
            pattern[i].py_predicates[j] = index
        pattern[i].nr_py = len(predicates)
        pattern[i].key = hash64(pattern[i].attrs, pattern[i].nr_attr * sizeof(AttrValueC), 0)
        pattern[i].token_idx = token_idx
    i = len(token_specs)
    # Use quantifier to identify final ID pattern node (rather than previous
    # uninitialized quantifier == 0/ZERO + nr_attr == 0 + non-zero-length attrs)
@ -638,6 +729,7 @@ cdef TokenPatternC* init_pattern(Pool mem, attr_t entity_id, object token_specs)
    pattern[i].nr_attr = 1
    pattern[i].nr_extra_attr = 0
    pattern[i].nr_py = 0
    pattern[i].token_idx = -1
    return pattern
@ -655,7 +747,7 @@ def _preprocess_pattern(token_specs, vocab, extensions_table, extra_predicates):
    """This function interprets the pattern, converting the various bits of
    syntactic sugar before we compile it into a struct with init_pattern.
-    We need to split the pattern up into three parts:
+    We need to split the pattern up into four parts:
    * Normal attribute/value pairs, which are stored on either the token or lexeme,
        can be handled directly.
    * Extension attributes are handled specially, as we need to prefetch the
@ -664,13 +756,14 @@ def _preprocess_pattern(token_specs, vocab, extensions_table, extra_predicates):
        functions and store them. So we store these specially as well.
    * Extension attributes that have extra predicates are stored within the
        extra_predicates.
    * Token index that this pattern belongs to.
    """
    tokens = []
    string_store = vocab.strings
-    for spec in token_specs:
+    for token_idx, spec in enumerate(token_specs):
        if not spec:
            # Signifier for 'any token'
-            tokens.append((ONE, [(NULL_ATTR, 0)], [], []))
+            tokens.append((ONE, [(NULL_ATTR, 0)], [], [], token_idx))
            continue
        if not isinstance(spec, dict):
            raise ValueError(Errors.E154.format())
@ -679,7 +772,7 @@ def _preprocess_pattern(token_specs, vocab, extensions_table, extra_predicates):
        extensions = _get_extensions(spec, string_store, extensions_table)
        predicates = _get_extra_predicates(spec, extra_predicates, vocab)
        for op in ops:
-            tokens.append((op, list(attr_values), list(extensions), list(predicates)))
+            tokens.append((op, list(attr_values), list(extensions), list(predicates), token_idx))
    return tokens
--- a/spacy/tests/matcher/test_matcher_logic.py
+++ b/spacy/tests/matcher/test_matcher_logic.py
@ -204,3 +204,90 @@ def test_matcher_remove():
    # removing again should throw an error
    with pytest.raises(ValueError):
        matcher.remove("Rule")
 def test_matcher_with_alignments_greedy_longest(en_vocab):
    cases = [
        ("aaab", "a* b", [0, 0, 0, 1]),
        ("baab", "b a* b", [0, 1, 1, 2]),
        ("aaab", "a a a b", [0, 1, 2, 3]),
        ("aaab", "a+ b", [0, 0, 0, 1]),
        ("aaba", "a+ b a+", [0, 0, 1, 2]),
        ("aabaa", "a+ b a+", [0, 0, 1, 2, 2]),
        ("aaba", "a+ b a*", [0, 0, 1, 2]),
        ("aaaa", "a*", [0, 0, 0, 0]),
        ("baab", "b a* b b*", [0, 1, 1, 2]),
        ("aabb", "a* b* a*", [0, 0, 1, 1]),
        ("aaab", "a+ a+ a b", [0, 1, 2, 3]),
        ("aaab", "a+ a+ a+ b", [0, 1, 2, 3]),
        ("aaab", "a+ a a b", [0, 1, 2, 3]),
        ("aaab", "a+ a a", [0, 1, 2]),
        ("aaab", "a+ a a?", [0, 1, 2]),
        ("aaaa", "a a a a a?", [0, 1, 2, 3]),
        ("aaab", "a+ a b", [0, 0, 1, 2]),
        ("aaab", "a+ a+ b", [0, 0, 1, 2]),
    ]
    for string, pattern_str, result in cases:
        matcher = Matcher(en_vocab)
        doc = Doc(matcher.vocab, words=list(string))
        pattern = []
        for part in pattern_str.split():
            if part.endswith("+"):
                pattern.append({"ORTH": part[0], "OP": "+"})
            elif part.endswith("*"):
                pattern.append({"ORTH": part[0], "OP": "*"})
            elif part.endswith("?"):
                pattern.append({"ORTH": part[0], "OP": "?"})
            else:
                pattern.append({"ORTH": part})
        matcher.add("PATTERN", [pattern], greedy="LONGEST")
        matches = matcher(doc, with_alignments=True)
        n_matches = len(matches)
        _, s, e, expected = matches[0]
        assert expected == result, (string, pattern_str, s, e, n_matches)
 def test_matcher_with_alignments_nongreedy(en_vocab):
    cases = [
        (0, "aaab", "a* b", [[0, 1], [0, 0, 1], [0, 0, 0, 1], [1]]),
        (1, "baab", "b a* b", [[0, 1, 1, 2]]),
        (2, "aaab", "a a a b", [[0, 1, 2, 3]]),
        (3, "aaab", "a+ b",   [[0, 1], [0, 0, 1], [0, 0, 0, 1]]),
        (4, "aaba", "a+ b a+", [[0, 1, 2], [0, 0, 1, 2]]),
        (5, "aabaa", "a+ b a+", [[0, 1, 2], [0, 0, 1, 2], [0, 0, 1, 2, 2], [0, 1, 2, 2] ]),
        (6, "aaba", "a+ b a*", [[0, 1], [0, 0, 1], [0, 0, 1, 2], [0, 1, 2]]),
        (7, "aaaa", "a*", [[0], [0, 0], [0, 0, 0], [0, 0, 0, 0]]),
        (8, "baab", "b a* b b*", [[0, 1, 1, 2]]),
        (9, "aabb", "a* b* a*", [[1], [2], [2, 2], [0, 1], [0, 0, 1], [0, 0, 1, 1], [0, 1, 1], [1, 1]]),
        (10, "aaab", "a+ a+ a b", [[0, 1, 2, 3]]),
        (11, "aaab", "a+ a+ a+ b", [[0, 1, 2, 3]]),
        (12, "aaab", "a+ a a b", [[0, 1, 2, 3]]),
        (13, "aaab", "a+ a a", [[0, 1, 2]]),
        (14, "aaab", "a+ a a?", [[0, 1], [0, 1, 2]]),
        (15, "aaaa", "a a a a a?", [[0, 1, 2, 3]]),
        (16, "aaab", "a+ a b", [[0, 1, 2], [0, 0, 1, 2]]),
        (17, "aaab", "a+ a+ b", [[0, 1, 2], [0, 0, 1, 2]]),
    ]
    for case_id, string, pattern_str, results in cases:
        matcher = Matcher(en_vocab)
        doc = Doc(matcher.vocab, words=list(string))
        pattern = []
        for part in pattern_str.split():
            if part.endswith("+"):
                pattern.append({"ORTH": part[0], "OP": "+"})
            elif part.endswith("*"):
                pattern.append({"ORTH": part[0], "OP": "*"})
            elif part.endswith("?"):
                pattern.append({"ORTH": part[0], "OP": "?"})
            else:
                pattern.append({"ORTH": part})
        matcher.add("PATTERN", [pattern])
        matches = matcher(doc, with_alignments=True)
        n_matches = len(matches)
        for _, s, e, expected in matches:
            assert expected in results, (case_id, string, pattern_str, s, e, n_matches)
            assert len(expected) == e - s
--- a/website/docs/api/matcher.md
+++ b/website/docs/api/matcher.md
@ -120,13 +120,14 @@ Find all token sequences matching the supplied patterns on the `Doc` or `Span`.
 > matches = matcher(doc)
 > ```
-| Name                                       | Description                                                                                                                                                                                                                                                                                              |
+| Name                                           | Description                                                                                                                                                                                                                                                                                              |
-| ------------------------------------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| ---------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
-| `doclike`                                  | The `Doc` or `Span` to match over. ~~Union[Doc, Span]~~                                                                                                                                                                                                                                                  |
+| `doclike`                                      | The `Doc` or `Span` to match over. ~~Union[Doc, Span]~~                                                                                                                                                                                                                                                  |
-| _keyword-only_                             |                                                                                                                                                                                                                                                                                                          |
+| _keyword-only_                                 |                                                                                                                                                                                                                                                                                                          |
-| `as_spans` <Tag variant="new">3</Tag>      | Instead of tuples, return a list of [`Span`](/api/span) objects of the matches, with the `match_id` assigned as the span label. Defaults to `False`. ~~bool~~                                                                                                                                            |
+| `as_spans` <Tag variant="new">3</Tag>          | Instead of tuples, return a list of [`Span`](/api/span) objects of the matches, with the `match_id` assigned as the span label. Defaults to `False`. ~~bool~~                                                                                                                                            |
-| `allow_missing` <Tag variant="new">3</Tag> | Whether to skip checks for missing annotation for attributes included in patterns. Defaults to `False`. ~~bool~~                                                                                                                                                                                         |
+| `allow_missing` <Tag variant="new">3</Tag>     | Whether to skip checks for missing annotation for attributes included in patterns. Defaults to `False`. ~~bool~~                                                                                                                                                                                         |
-| **RETURNS**                                | A list of `(match_id, start, end)` tuples, describing the matches. A match tuple describes a span `doc[start:end`]. The `match_id` is the ID of the added match pattern. If `as_spans` is set to `True`, a list of `Span` objects is returned instead. ~~Union[List[Tuple[int, int, int]], List[Span]]~~ |
+| `with_alignments` <Tag variant="new">3.1</Tag> | Return match alignment information as part of the match tuple as `List[int]` with the same length as the matched span. Each entry denotes the corresponding index of the token pattern. If `as_spans` is set to `True`, this setting is ignored. Defaults to `False`. ~~bool~~                             |
 | **RETURNS**                                    | A list of `(match_id, start, end)` tuples, describing the matches. A match tuple describes a span `doc[start:end`]. The `match_id` is the ID of the added match pattern. If `as_spans` is set to `True`, a list of `Span` objects is returned instead. ~~Union[List[Tuple[int, int, int]], List[Span]]~~ |
 ## Matcher.\_\_len\_\_ {#len tag="method" new="2"}