mirror of https://github.com/explosion/spaCy.git
Add docs on adding to existing tokenizer rules [ci skip]
This commit is contained in:
parent
1ea1bc98e7
commit
403b9cd58b
|
@ -812,6 +812,40 @@ only be applied at the **end of a token**, so your expression should end with a
|
||||||
|
|
||||||
</Infobox>
|
</Infobox>
|
||||||
|
|
||||||
|
#### Adding to existing rule sets {#native-tokenizer-additions}
|
||||||
|
|
||||||
|
In many situations, you don't necessarily need entirely custom rules. Sometimes
|
||||||
|
you just want to add another character to the prefixes, suffixes or infixes. The
|
||||||
|
default prefix, suffix and infix rules are available via the `nlp` object's
|
||||||
|
`Defaults` and the [`Tokenizer.suffix_search`](/api/tokenizer#attributes)
|
||||||
|
attribute is writable, so you can overwrite it with a compiled regular
|
||||||
|
expression object using of the modified default rules. spaCy ships with utility
|
||||||
|
functions to help you compile the regular expressions – for example,
|
||||||
|
[`compile_suffix_regex`](/api/top-level#util.compile_suffix_regex):
|
||||||
|
|
||||||
|
```python
|
||||||
|
suffixes = nlp.Defaults.suffixes + (r'''-+$''',)
|
||||||
|
suffix_regex = spacy.util.compile_suffix_regex(suffixes)
|
||||||
|
nlp.tokenizer.suffix_search = suffix_regex.search
|
||||||
|
```
|
||||||
|
|
||||||
|
For an overview of the default regular expressions, see
|
||||||
|
[`lang/punctuation.py`](https://github.com/explosion/spaCy/blob/master/spacy/lang/punctuation.py).
|
||||||
|
The `Tokenizer.suffix_search` attribute should be a function which takes a
|
||||||
|
unicode string and returns a **regex match object** or `None`. Usually we use
|
||||||
|
the `.search` attribute of a compiled regex object, but you can use some other
|
||||||
|
function that behaves the same way.
|
||||||
|
|
||||||
|
<Infobox title="Important note" variant="warning">
|
||||||
|
|
||||||
|
If you're using a statistical model, writing to the `nlp.Defaults` or
|
||||||
|
`English.Defaults` directly won't work, since the regular expressions are read
|
||||||
|
from the model and will be compiled when you load it. You'll only see the effect
|
||||||
|
if you call [`spacy.blank`](/api/top-level#spacy.blank) or
|
||||||
|
`Defaults.create_tokenizer()`.
|
||||||
|
|
||||||
|
</Infobox>
|
||||||
|
|
||||||
### Hooking an arbitrary tokenizer into the pipeline {#custom-tokenizer}
|
### Hooking an arbitrary tokenizer into the pipeline {#custom-tokenizer}
|
||||||
|
|
||||||
The tokenizer is the first component of the processing pipeline and the only one
|
The tokenizer is the first component of the processing pipeline and the only one
|
||||||
|
|
Loading…
Reference in New Issue