mirror of https://github.com/explosion/spaCy.git
6.6 KiB
6.6 KiB
title | teaser | tag | source |
---|---|---|---|
Tokenizer | Segment text into words, punctuations marks etc. | class | spacy/tokenizer.pyx |
Segment text, and create Doc
objects with the discovered segment boundaries.
Tokenizer.__init__
Create a Tokenizer
, to create Doc
objects given unicode text.
Example
# Construction 1 from spacy.tokenizer import Tokenizer tokenizer = Tokenizer(nlp.vocab) # Construction 2 from spacy.lang.en import English tokenizer = English().Defaults.create_tokenizer(nlp)
Name | Type | Description |
---|---|---|
vocab |
Vocab |
A storage container for lexical types. |
rules |
dict | Exceptions and special-cases for the tokenizer. |
prefix_search |
callable | A function matching the signature of re.compile(string).search to match prefixes. |
suffix_search |
callable | A function matching the signature of re.compile(string).search to match suffixes. |
infix_finditer |
callable | A function matching the signature of re.compile(string).finditer to find infixes. |
token_match |
callable | A boolean function matching strings to be recognized as tokens. |
RETURNS | Tokenizer |
The newly constructed object. |
Tokenizer.__call__
Tokenize a string.
Example
tokens = tokenizer(u"This is a sentence") assert len(tokens) == 4
Name | Type | Description |
---|---|---|
string |
unicode | The string to tokenize. |
RETURNS | Doc |
A container for linguistic annotations. |
Tokenizer.pipe
Tokenize a stream of texts.
Example
texts = [u"One document.", u"...", u"Lots of documents"] for doc in tokenizer.pipe(texts, batch_size=50): pass
Name | Type | Description |
---|---|---|
texts |
- | A sequence of unicode texts. |
batch_size |
int | The number of texts to accumulate in an internal buffer. |
YIELDS | Doc |
A sequence of Doc objects, in order. |
Tokenizer.find_infix
Find internal split points of the string.
Name | Type | Description |
---|---|---|
string |
unicode | The string to split. |
RETURNS | list | A list of re.MatchObject objects that have .start() and .end() methods, denoting the placement of internal segment separators, e.g. hyphens. |
Tokenizer.find_prefix
Find the length of a prefix that should be segmented from the string, or None
if no prefix rules match.
Name | Type | Description |
---|---|---|
string |
unicode | The string to segment. |
RETURNS | int | The length of the prefix if present, otherwise None . |
Tokenizer.find_suffix
Find the length of a suffix that should be segmented from the string, or None
if no suffix rules match.
Name | Type | Description |
---|---|---|
string |
unicode | The string to segment. |
RETURNS | int / None |
The length of the suffix if present, otherwise None . |
Tokenizer.add_special_case
Add a special-case tokenization rule. This mechanism is also used to add custom tokenizer exceptions to the language data. See the usage guide on adding languages for more details and examples.
Example
from spacy.attrs import ORTH, LEMMA case = [{ORTH: "do"}, {ORTH: "n't", LEMMA: "not"}] tokenizer.add_special_case("don't", case)
Name | Type | Description |
---|---|---|
string |
unicode | The string to specially tokenize. |
token_attrs |
iterable | A sequence of dicts, where each dict describes a token and its attributes. The ORTH fields of the attributes must exactly match the string when they are concatenated. |
Attributes
Name | Type | Description |
---|---|---|
vocab |
Vocab |
The vocab object of the parent Doc . |
prefix_search |
- | A function to find segment boundaries from the start of a string. Returns the length of the segment, or None . |
suffix_search |
- | A function to find segment boundaries from the end of a string. Returns the length of the segment, or None . |
infix_finditer |
- | A function to find internal segment separators, e.g. hyphens. Returns a (possibly empty) list of re.MatchObject objects. |