mirror of https://github.com/explosion/spaCy.git
Improve docs on matcher attributes [ci skip] (closes #4063)
This commit is contained in:
parent
2bfae0b167
commit
223bde5cf6
|
@ -153,8 +153,8 @@ processes.
|
||||||
|
|
||||||
#### Available token attributes {#adding-patterns-attributes}
|
#### Available token attributes {#adding-patterns-attributes}
|
||||||
|
|
||||||
The available token pattern keys are uppercase versions of the
|
The available token pattern keys correspond to a number of
|
||||||
[`Token` attributes](/api/token#attributes). The most relevant ones for
|
[`Token` attributes](/api/token#attributes). The supported attributes for
|
||||||
rule-based matching are:
|
rule-based matching are:
|
||||||
|
|
||||||
| Attribute | Type | Description |
|
| Attribute | Type | Description |
|
||||||
|
@ -171,6 +171,36 @@ rule-based matching are:
|
||||||
| `ENT_TYPE` | unicode | The token's entity label. |
|
| `ENT_TYPE` | unicode | The token's entity label. |
|
||||||
| `_` <Tag variant="new">2.1</Tag> | dict | Properties in [custom extension attributes](/usage/processing-pipelines#custom-components-attributes). |
|
| `_` <Tag variant="new">2.1</Tag> | dict | Properties in [custom extension attributes](/usage/processing-pipelines#custom-components-attributes). |
|
||||||
|
|
||||||
|
<Accordion title="Does it matter if the attribute names are uppercase or lowercase?">
|
||||||
|
|
||||||
|
No, it shouldn't. spaCy will normalize the names internally and
|
||||||
|
`{"LOWER": "text"}` and `{"lower": "text"}` will both produce the same result.
|
||||||
|
Using the uppercase version is mostly a convention to make it clear that the
|
||||||
|
attributes are "special" and don't exactly map to the token attributes like
|
||||||
|
`Token.lower` and `Token.lower_`.
|
||||||
|
|
||||||
|
</Accordion>
|
||||||
|
|
||||||
|
<Accordion title="Why are not all token attributes supported?">
|
||||||
|
|
||||||
|
spaCy can't provide access to all of the attributes because the `Matcher` loops
|
||||||
|
over the Cython data, not the Python objects. Inside the matcher, we're dealing
|
||||||
|
with a [`TokenC` struct](/api/cython-structs#tokenc) – we don't have an instance
|
||||||
|
of [`Token`](/api/token). This means that all of the attributes that refer to
|
||||||
|
computed properties can't be accessed.
|
||||||
|
|
||||||
|
The uppercase attribute names like `LOWER` or `IS_PUNCT` refer to symbols from
|
||||||
|
the
|
||||||
|
[`spacy.attrs`](https://github.com/explosion/spaCy/tree/master/spacy/attrs.pyx)
|
||||||
|
enum table. They're passed into a function that essentially is a big case/switch
|
||||||
|
statement, to figure out which struct field to return. The same attribute
|
||||||
|
identifiers are used in [`Doc.to_array`](/api/doc#to_array), and a few other
|
||||||
|
places in the code where you need to describe fields like this.
|
||||||
|
|
||||||
|
</Accordion>
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
<Infobox title="Tip: Try the interactive matcher explorer">
|
<Infobox title="Tip: Try the interactive matcher explorer">
|
||||||
|
|
||||||
[![Matcher demo](../images/matcher-demo.jpg)](https://explosion.ai/demos/matcher)
|
[![Matcher demo](../images/matcher-demo.jpg)](https://explosion.ai/demos/matcher)
|
||||||
|
|
Loading…
Reference in New Issue