From 223bde5cf66f13025171800185ac67a9fbd2df64 Mon Sep 17 00:00:00 2001 From: Ines Montani Date: Tue, 6 Aug 2019 12:13:42 +0200 Subject: [PATCH] Improve docs on matcher attributes [ci skip] (closes #4063) --- website/docs/usage/rule-based-matching.md | 34 +++++++++++++++++++++-- 1 file changed, 32 insertions(+), 2 deletions(-) diff --git a/website/docs/usage/rule-based-matching.md b/website/docs/usage/rule-based-matching.md index e15f3f0a4..3801f7b7a 100644 --- a/website/docs/usage/rule-based-matching.md +++ b/website/docs/usage/rule-based-matching.md @@ -153,8 +153,8 @@ processes. #### Available token attributes {#adding-patterns-attributes} -The available token pattern keys are uppercase versions of the -[`Token` attributes](/api/token#attributes). The most relevant ones for +The available token pattern keys correspond to a number of +[`Token` attributes](/api/token#attributes). The supported attributes for rule-based matching are: | Attribute | Type |  Description | @@ -171,6 +171,36 @@ rule-based matching are: | `ENT_TYPE` | unicode | The token's entity label. | | `_` 2.1 | dict | Properties in [custom extension attributes](/usage/processing-pipelines#custom-components-attributes). | + + +No, it shouldn't. spaCy will normalize the names internally and +`{"LOWER": "text"}` and `{"lower": "text"}` will both produce the same result. +Using the uppercase version is mostly a convention to make it clear that the +attributes are "special" and don't exactly map to the token attributes like +`Token.lower` and `Token.lower_`. + + + + + +spaCy can't provide access to all of the attributes because the `Matcher` loops +over the Cython data, not the Python objects. Inside the matcher, we're dealing +with a [`TokenC` struct](/api/cython-structs#tokenc) – we don't have an instance +of [`Token`](/api/token). This means that all of the attributes that refer to +computed properties can't be accessed. + +The uppercase attribute names like `LOWER` or `IS_PUNCT` refer to symbols from +the +[`spacy.attrs`](https://github.com/explosion/spaCy/tree/master/spacy/attrs.pyx) +enum table. They're passed into a function that essentially is a big case/switch +statement, to figure out which struct field to return. The same attribute +identifiers are used in [`Doc.to_array`](/api/doc#to_array), and a few other +places in the code where you need to describe fields like this. + + + +--- + [![Matcher demo](../images/matcher-demo.jpg)](https://explosion.ai/demos/matcher)