From 1ea472468a12c3ff2abe92bbfa7a7f890b68d7ba Mon Sep 17 00:00:00 2001 From: Ines Montani Date: Wed, 17 Jul 2019 15:08:33 +0200 Subject: [PATCH] Add usage docs for aligning tokenization --- website/docs/usage/linguistic-features.md | 47 +++++++++++++++++++++++ 1 file changed, 47 insertions(+) diff --git a/website/docs/usage/linguistic-features.md b/website/docs/usage/linguistic-features.md index 538a9f205..cc4bbed6d 100644 --- a/website/docs/usage/linguistic-features.md +++ b/website/docs/usage/linguistic-features.md @@ -963,6 +963,53 @@ Once you have a [`Doc`](/api/doc) object, you can write to its attributes to set the part-of-speech tags, syntactic dependencies, named entities and other attributes. For details, see the respective usage pages. +### Aligning tokenization {#aligning-tokenization} + +spaCy's tokenization is non-destructive and uses language-specific rules +optimized for compatibility with treebank annotations. Other tools and resources +can sometimes tokenize things differently – for example, `"I'm"` → `["I", "am"]` +instead of `["I", "'m"]`, or `"Obama's"` → `["Obama", "'", "s"]` instead of +`["Obama", "'s"]`. + +In cases like that, you often want to align the tokenization so that you can +merge annotations from different sources together, or take vectors predicted by +a [pre-trained BERT model](https://github.com/huggingface/pytorch-transformers) +and apply them to spaCy tokens. spaCy's [`gold.align`](/api/goldparse#align) +helper returns a `(cost, a2b, b2a, a2b_multi, b2a_multi)` tuple describing the +number of misaligned tokens, the one-to-one mappings of token indices in both +directions and the indices where multiple tokens align to one single token. + +```python +### {executable="true"} +from spacy.gold import align + +other_tokens = ["i", "listened", "to", "obama", "'", "s", "podcasts", "."] +spacy_tokens = ["i", "listened", "to", "obama", "'s", "podcasts", "."] +cost, a2b, b2a, a2b_multi, b2a_multi = align(other_tokens, spacy_tokens) +print("Misaligned tokens:", cost) # 2 +print("One-to-one mappings a -> b", a2b) # array([0, 1, 2, 3, -1, -1, 5, 6]) +print("One-to-one mappings b -> a", b2a) # array([0, 1, 2, 3, 5, 6, 7]) +print("Many-to-one mappings a -> b", a2b_multi) # {4: 4, 5: 4} +print("Many-to-one mappings b-> a", b2a_multi) # {} +``` + +Here are some insights from the alignment information generated in the example +above: + +- Two tokens are misaligned. +- The one-to-one mappings for the first four tokens are identical, which means + they map to each other. This makes sense because they're also identical in the + input: `"i"`, `"listened"`, `"to"` and `"obama"`. +- The index mapped to `a2b[6]` is `5`, which means that `other_tokens[6]` + (`"podcasts"`) aligns to `spacy_tokens[6]` (also `"podcasts"`). +- `a2b[4]` is `-1`, which means that there is no one-to-one alignment for the + token at `other_tokens[5]`. The token `"'"` doesn't exist on its own in + `spacy_tokens`. The same goes for `a2b[5]` and `other_tokens[5]`, i.e. `"s"`. +- The dictionary `a2b_multi` shows that both tokens 4 and 5 of `other_tokens` + (`"'"` and `"s"`) align to token 4 of `spacy_tokens` (`"'s"`). +- The dictionary `b2a_multi` shows that there are no tokens in `spacy_tokens` + that map to multiple tokens in `other_tokens`. + ## Merging and splitting {#retokenization new="2.1"} The [`Doc.retokenize`](/api/doc#retokenize) context manager lets you merge and