spaCy/website/docs/usage/101/_tokenization.mdx

50 lines
2.3 KiB
Plaintext
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

During processing, spaCy first **tokenizes** the text, i.e. segments it into
words, punctuation and so on. This is done by applying rules specific to each
language. For example, punctuation at the end of a sentence should be split off
whereas "U.K." should remain one token. Each `Doc` consists of individual
tokens, and we can iterate over them:
```python {executable="true"}
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
for token in doc:
print(token.text)
```
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
| :---: | :-: | :-----: | :-: | :----: | :--: | :-----: | :-: | :-: | :-: | :-----: |
| Apple | is | looking | at | buying | U.K. | startup | for | \$ | 1 | billion |
First, the raw text is split on whitespace characters, similar to
`text.split(' ')`. Then, the tokenizer processes the text from left to right. On
each substring, it performs two checks:
1. **Does the substring match a tokenizer exception rule?** For example, "don't"
does not contain whitespace, but should be split into two tokens, "do" and
"n't", while "U.K." should always remain one token.
2. **Can a prefix, suffix or infix be split off?** For example punctuation like
commas, periods, hyphens or quotes.
If there's a match, the rule is applied and the tokenizer continues its loop,
starting with the newly split substrings. This way, spaCy can split **complex,
nested tokens** like combinations of abbreviations and multiple punctuation
marks.
> - **Tokenizer exception:** Special-case rule to split a string into several
> tokens or prevent a token from being split when punctuation rules are
> applied.
> - **Prefix:** Character(s) at the beginning, e.g. `$`, `(`, `“`, `¿`.
> - **Suffix:** Character(s) at the end, e.g. `km`, `)`, `”`, `!`.
> - **Infix:** Character(s) in between, e.g. `-`, `--`, `/`, `…`.
![Example of the tokenization process](/images/tokenization.svg)
While punctuation rules are usually pretty general, tokenizer exceptions
strongly depend on the specifics of the individual language. This is why each
[available language](/usage/models#languages) has its own subclass, like
`English` or `German`, that loads in lists of hard-coded data and exception
rules.