mirror of https://github.com/explosion/spaCy.git
Auto-format [ci skip]
This commit is contained in:
parent
62e00fd9da
commit
e01a1a237f
|
@ -782,8 +782,8 @@ The algorithm can be summarized as follows:
|
||||||
1. Iterate over whitespace-separated substrings.
|
1. Iterate over whitespace-separated substrings.
|
||||||
2. Check whether we have an explicitly defined rule for this substring. If we
|
2. Check whether we have an explicitly defined rule for this substring. If we
|
||||||
do, use it.
|
do, use it.
|
||||||
3. Otherwise, try to consume one prefix. If we consumed a prefix, go back to
|
3. Otherwise, try to consume one prefix. If we consumed a prefix, go back to #2,
|
||||||
#2, so that special cases always get priority.
|
so that special cases always get priority.
|
||||||
4. If we didn't consume a prefix, try to consume a suffix and then go back to
|
4. If we didn't consume a prefix, try to consume a suffix and then go back to
|
||||||
#2.
|
#2.
|
||||||
5. If we can't consume a prefix or a suffix, look for a special case.
|
5. If we can't consume a prefix or a suffix, look for a special case.
|
||||||
|
@ -805,10 +805,10 @@ domain. There are five things you would need to define:
|
||||||
commas, periods, close quotes, etc.
|
commas, periods, close quotes, etc.
|
||||||
4. A function `infixes_finditer`, to handle non-whitespace separators, such as
|
4. A function `infixes_finditer`, to handle non-whitespace separators, such as
|
||||||
hyphens etc.
|
hyphens etc.
|
||||||
5. An optional boolean function `token_match` matching strings that should
|
5. An optional boolean function `token_match` matching strings that should never
|
||||||
never be split, overriding the infix rules. Useful for things like URLs or
|
be split, overriding the infix rules. Useful for things like URLs or numbers.
|
||||||
numbers. Note that prefixes and suffixes will be split off before
|
Note that prefixes and suffixes will be split off before `token_match` is
|
||||||
`token_match` is applied.
|
applied.
|
||||||
|
|
||||||
You shouldn't usually need to create a `Tokenizer` subclass. Standard usage is
|
You shouldn't usually need to create a `Tokenizer` subclass. Standard usage is
|
||||||
to use `re.compile()` to build a regular expression object, and pass its
|
to use `re.compile()` to build a regular expression object, and pass its
|
||||||
|
@ -858,8 +858,8 @@ only be applied at the **end of a token**, so your expression should end with a
|
||||||
#### Modifying existing rule sets {#native-tokenizer-additions}
|
#### Modifying existing rule sets {#native-tokenizer-additions}
|
||||||
|
|
||||||
In many situations, you don't necessarily need entirely custom rules. Sometimes
|
In many situations, you don't necessarily need entirely custom rules. Sometimes
|
||||||
you just want to add another character to the prefixes, suffixes or infixes.
|
you just want to add another character to the prefixes, suffixes or infixes. The
|
||||||
The default prefix, suffix and infix rules are available via the `nlp` object's
|
default prefix, suffix and infix rules are available via the `nlp` object's
|
||||||
`Defaults` and the `Tokenizer` attributes such as
|
`Defaults` and the `Tokenizer` attributes such as
|
||||||
[`Tokenizer.suffix_search`](/api/tokenizer#attributes) are writable, so you can
|
[`Tokenizer.suffix_search`](/api/tokenizer#attributes) are writable, so you can
|
||||||
overwrite them with compiled regular expression objects using modified default
|
overwrite them with compiled regular expression objects using modified default
|
||||||
|
@ -893,20 +893,19 @@ If you're using a statistical model, writing to the `nlp.Defaults` or
|
||||||
`English.Defaults` directly won't work, since the regular expressions are read
|
`English.Defaults` directly won't work, since the regular expressions are read
|
||||||
from the model and will be compiled when you load it. If you modify
|
from the model and will be compiled when you load it. If you modify
|
||||||
`nlp.Defaults`, you'll only see the effect if you call
|
`nlp.Defaults`, you'll only see the effect if you call
|
||||||
[`spacy.blank`](/api/top-level#spacy.blank) or `Defaults.create_tokenizer()`.
|
[`spacy.blank`](/api/top-level#spacy.blank) or `Defaults.create_tokenizer()`. If
|
||||||
If you want to modify the tokenizer loaded from a statistical model, you should
|
you want to modify the tokenizer loaded from a statistical model, you should
|
||||||
modify `nlp.tokenizer` directly.
|
modify `nlp.tokenizer` directly.
|
||||||
|
|
||||||
</Infobox>
|
</Infobox>
|
||||||
|
|
||||||
The prefix, infix and suffix rule sets include not only individual characters
|
The prefix, infix and suffix rule sets include not only individual characters
|
||||||
but also detailed regular expressions that take the surrounding context into
|
but also detailed regular expressions that take the surrounding context into
|
||||||
account. For example, there is a regular expression that treats a hyphen
|
account. For example, there is a regular expression that treats a hyphen between
|
||||||
between letters as an infix. If you do not want the tokenizer to split on
|
letters as an infix. If you do not want the tokenizer to split on hyphens
|
||||||
hyphens between letters, you can modify the existing infix definition from
|
between letters, you can modify the existing infix definition from
|
||||||
[`lang/punctuation.py`](https://github.com/explosion/spaCy/blob/master/spacy/lang/punctuation.py):
|
[`lang/punctuation.py`](https://github.com/explosion/spaCy/blob/master/spacy/lang/punctuation.py):
|
||||||
|
|
||||||
|
|
||||||
```python
|
```python
|
||||||
### {executable="true"}
|
### {executable="true"}
|
||||||
import spacy
|
import spacy
|
||||||
|
@ -1074,10 +1073,10 @@ can sometimes tokenize things differently – for example, `"I'm"` →
|
||||||
In situations like that, you often want to align the tokenization so that you
|
In situations like that, you often want to align the tokenization so that you
|
||||||
can merge annotations from different sources together, or take vectors predicted
|
can merge annotations from different sources together, or take vectors predicted
|
||||||
by a
|
by a
|
||||||
[pretrained BERT model](https://github.com/huggingface/pytorch-transformers)
|
[pretrained BERT model](https://github.com/huggingface/pytorch-transformers) and
|
||||||
and apply them to spaCy tokens. spaCy's [`gold.align`](/api/goldparse#align)
|
apply them to spaCy tokens. spaCy's [`gold.align`](/api/goldparse#align) helper
|
||||||
helper returns a `(cost, a2b, b2a, a2b_multi, b2a_multi)` tuple describing the
|
returns a `(cost, a2b, b2a, a2b_multi, b2a_multi)` tuple describing the number
|
||||||
number of misaligned tokens, the one-to-one mappings of token indices in both
|
of misaligned tokens, the one-to-one mappings of token indices in both
|
||||||
directions and the indices where multiple tokens align to one single token.
|
directions and the indices where multiple tokens align to one single token.
|
||||||
|
|
||||||
> #### ✏️ Things to try
|
> #### ✏️ Things to try
|
||||||
|
|
Loading…
Reference in New Issue