From 1d5ff3e4554b16b4a24a18be4a31b4b73e16602c Mon Sep 17 00:00:00 2001 From: Ines Montani Date: Wed, 17 Jul 2019 15:29:36 +0200 Subject: [PATCH] Add infobox --- website/docs/usage/linguistic-features.md | 9 +++++++++ 1 file changed, 9 insertions(+) diff --git a/website/docs/usage/linguistic-features.md b/website/docs/usage/linguistic-features.md index 09f81c7c0..2ef30576e 100644 --- a/website/docs/usage/linguistic-features.md +++ b/website/docs/usage/linguistic-features.md @@ -1019,6 +1019,15 @@ above: - The dictionary `b2a_multi` shows that there are no tokens in `spacy_tokens` that map to multiple tokens in `other_tokens`. + + +The current implementation of the alignment algorithm assumes that both +tokenizations add up to the same string. For example, you'll be able to align +`["I", "'", "m"]` and `["I", "'m"]`, which both add up to `"I'm"`, but not +`["I", "'m"]` and `["I", "am"]`. + + + ## Merging and splitting {#retokenization new="2.1"} The [`Doc.retokenize`](/api/doc#retokenize) context manager lets you merge and