mirror of https://github.com/explosion/spaCy.git
Improve docs on phrase pattern attributes (closes #4100) [ci skip]
This commit is contained in:
parent
1f4d8bf77e
commit
1362f793cf
|
@ -788,11 +788,11 @@ token pattern covering the exact tokenization of the term.
|
||||||
|
|
||||||
To create the patterns, each phrase has to be processed with the `nlp` object.
|
To create the patterns, each phrase has to be processed with the `nlp` object.
|
||||||
If you have a mode loaded, doing this in a loop or list comprehension can easily
|
If you have a mode loaded, doing this in a loop or list comprehension can easily
|
||||||
become inefficient and slow. If you only need the tokenization and lexical
|
become inefficient and slow. If you **only need the tokenization and lexical
|
||||||
attributes, you can run [`nlp.make_doc`](/api/language#make_doc) instead, which
|
attributes**, you can run [`nlp.make_doc`](/api/language#make_doc) instead,
|
||||||
will only run the tokenizer. For an additional speed boost, you can also use the
|
which will only run the tokenizer. For an additional speed boost, you can also
|
||||||
[`nlp.tokenizer.pipe`](/api/tokenizer#pipe) method, which will process the texts
|
use the [`nlp.tokenizer.pipe`](/api/tokenizer#pipe) method, which will process
|
||||||
as a stream.
|
the texts as a stream.
|
||||||
|
|
||||||
```diff
|
```diff
|
||||||
- patterns = [nlp(term) for term in LOTS_OF_TERMS]
|
- patterns = [nlp(term) for term in LOTS_OF_TERMS]
|
||||||
|
@ -825,6 +825,20 @@ for match_id, start, end in matcher(doc):
|
||||||
print("Matched based on lowercase token text:", doc[start:end])
|
print("Matched based on lowercase token text:", doc[start:end])
|
||||||
```
|
```
|
||||||
|
|
||||||
|
<Infobox title="Important note on creating patterns" variant="warning">
|
||||||
|
|
||||||
|
The examples here use [`nlp.make_doc`](/api/language#make_doc) to create `Doc`
|
||||||
|
object patterns as efficiently as possible and without running any of the other
|
||||||
|
pipeline components. If the token attribute you want to match on are set by a
|
||||||
|
pipeline component, **make sure that the pipeline component runs** when you
|
||||||
|
create the pattern. For example, to match on `POS` or `LEMMA`, the pattern `Doc`
|
||||||
|
objects need to have part-of-speech tags set by the `tagger`. You can either
|
||||||
|
call the `nlp` object on your pattern texts instead of `nlp.make_doc`, or use
|
||||||
|
[`nlp.disable_pipes`](/api/language#disable_pipes) to disable components
|
||||||
|
selectively.
|
||||||
|
|
||||||
|
</Infobox>
|
||||||
|
|
||||||
Another possible use case is matching number tokens like IP addresses based on
|
Another possible use case is matching number tokens like IP addresses based on
|
||||||
their shape. This means that you won't have to worry about how those string will
|
their shape. This means that you won't have to worry about how those string will
|
||||||
be tokenized and you'll be able to find tokens and combinations of tokens based
|
be tokenized and you'll be able to find tokens and combinations of tokens based
|
||||||
|
|
Loading…
Reference in New Issue