mirror of https://github.com/explosion/spaCy.git
Add usage docs for aligning tokenization
This commit is contained in:
parent
f97a555445
commit
1ea472468a
|
@ -963,6 +963,53 @@ Once you have a [`Doc`](/api/doc) object, you can write to its attributes to set
|
||||||
the part-of-speech tags, syntactic dependencies, named entities and other
|
the part-of-speech tags, syntactic dependencies, named entities and other
|
||||||
attributes. For details, see the respective usage pages.
|
attributes. For details, see the respective usage pages.
|
||||||
|
|
||||||
|
### Aligning tokenization {#aligning-tokenization}
|
||||||
|
|
||||||
|
spaCy's tokenization is non-destructive and uses language-specific rules
|
||||||
|
optimized for compatibility with treebank annotations. Other tools and resources
|
||||||
|
can sometimes tokenize things differently – for example, `"I'm"` → `["I", "am"]`
|
||||||
|
instead of `["I", "'m"]`, or `"Obama's"` → `["Obama", "'", "s"]` instead of
|
||||||
|
`["Obama", "'s"]`.
|
||||||
|
|
||||||
|
In cases like that, you often want to align the tokenization so that you can
|
||||||
|
merge annotations from different sources together, or take vectors predicted by
|
||||||
|
a [pre-trained BERT model](https://github.com/huggingface/pytorch-transformers)
|
||||||
|
and apply them to spaCy tokens. spaCy's [`gold.align`](/api/goldparse#align)
|
||||||
|
helper returns a `(cost, a2b, b2a, a2b_multi, b2a_multi)` tuple describing the
|
||||||
|
number of misaligned tokens, the one-to-one mappings of token indices in both
|
||||||
|
directions and the indices where multiple tokens align to one single token.
|
||||||
|
|
||||||
|
```python
|
||||||
|
### {executable="true"}
|
||||||
|
from spacy.gold import align
|
||||||
|
|
||||||
|
other_tokens = ["i", "listened", "to", "obama", "'", "s", "podcasts", "."]
|
||||||
|
spacy_tokens = ["i", "listened", "to", "obama", "'s", "podcasts", "."]
|
||||||
|
cost, a2b, b2a, a2b_multi, b2a_multi = align(other_tokens, spacy_tokens)
|
||||||
|
print("Misaligned tokens:", cost) # 2
|
||||||
|
print("One-to-one mappings a -> b", a2b) # array([0, 1, 2, 3, -1, -1, 5, 6])
|
||||||
|
print("One-to-one mappings b -> a", b2a) # array([0, 1, 2, 3, 5, 6, 7])
|
||||||
|
print("Many-to-one mappings a -> b", a2b_multi) # {4: 4, 5: 4}
|
||||||
|
print("Many-to-one mappings b-> a", b2a_multi) # {}
|
||||||
|
```
|
||||||
|
|
||||||
|
Here are some insights from the alignment information generated in the example
|
||||||
|
above:
|
||||||
|
|
||||||
|
- Two tokens are misaligned.
|
||||||
|
- The one-to-one mappings for the first four tokens are identical, which means
|
||||||
|
they map to each other. This makes sense because they're also identical in the
|
||||||
|
input: `"i"`, `"listened"`, `"to"` and `"obama"`.
|
||||||
|
- The index mapped to `a2b[6]` is `5`, which means that `other_tokens[6]`
|
||||||
|
(`"podcasts"`) aligns to `spacy_tokens[6]` (also `"podcasts"`).
|
||||||
|
- `a2b[4]` is `-1`, which means that there is no one-to-one alignment for the
|
||||||
|
token at `other_tokens[5]`. The token `"'"` doesn't exist on its own in
|
||||||
|
`spacy_tokens`. The same goes for `a2b[5]` and `other_tokens[5]`, i.e. `"s"`.
|
||||||
|
- The dictionary `a2b_multi` shows that both tokens 4 and 5 of `other_tokens`
|
||||||
|
(`"'"` and `"s"`) align to token 4 of `spacy_tokens` (`"'s"`).
|
||||||
|
- The dictionary `b2a_multi` shows that there are no tokens in `spacy_tokens`
|
||||||
|
that map to multiple tokens in `other_tokens`.
|
||||||
|
|
||||||
## Merging and splitting {#retokenization new="2.1"}
|
## Merging and splitting {#retokenization new="2.1"}
|
||||||
|
|
||||||
The [`Doc.retokenize`](/api/doc#retokenize) context manager lets you merge and
|
The [`Doc.retokenize`](/api/doc#retokenize) context manager lets you merge and
|
||||||
|
|
Loading…
Reference in New Issue