From 3134a9b6e072adbdccc138c3ad7c6bcd0b17d364 Mon Sep 17 00:00:00 2001 From: Ines Montani Date: Wed, 21 Aug 2019 12:53:31 +0200 Subject: [PATCH] Add section on expanding regex match to token boundaries (see #4158) [ci skip] --- website/docs/usage/rule-based-matching.md | 48 +++++++++++++++++++++++ 1 file changed, 48 insertions(+) diff --git a/website/docs/usage/rule-based-matching.md b/website/docs/usage/rule-based-matching.md index 109e2279b..fbe8bdc67 100644 --- a/website/docs/usage/rule-based-matching.md +++ b/website/docs/usage/rule-based-matching.md @@ -304,6 +304,54 @@ for match in re.finditer(expression, doc.text): print("Found match:", span.text) ``` + + +In some cases, you might want to expand the match to the closest token +boundaries, so you can create a `Span` for `"USA"`, even though only the +substring `"US"` is matched. You can calculate this using the character offsets +of the tokens in the document, available as +[`Token.idx`](/api/token#attributes). This lets you create a list of valid token +start and end boundaries and leaves you with a rather basic algorithmic problem: +Given a number, find the next lowest (start token) or the next highest (end +token) number that's part of a given list of numbers. This will be the closest +valid token boundary. + +There are many ways to do this and the most straightforward one is to create a +dict keyed by characters in the `Doc`, mapped to the token they're part of. It's +easy to write and less error-prone, and gives you a constant lookup time: you +only ever need to create the dict once per `Doc`. + +```python +chars_to_tokens = {} +for token in doc: + for i in range(token.idx, token.idx + len(token.text)): + chars_to_tokens[i] = token.i +``` + +You can then look up character at a given position, and get the index of the +corresponding token that the character is part of. Your span would then be +`doc[token_start:token_end]`. If a character isn't in the dict, it means it's +the (white)space tokens are split on. That hopefully shouldn't happen, though, +because it'd mean your regex is producing matches with leading or trailing +whitespace. + +```python +### {highlight="5-8"} +span = doc.char_span(start, end) +if span is not None: + print("Found match:", span.text) +else: + start_token = chars_to_tokens.get(start) + end_token = chars_to_tokens.get(end) + if start_token is not None and end_token is not None: + span = doc[start_token:end_token + 1] + print("Found closest match:", span.text) +``` + + + +--- + #### Operators and quantifiers {#quantifiers} The matcher also lets you use quantifiers, specified as the `'OP'` key.