spaCy/spacy/tests/training
Adriane Boyd 4448680750
Fix alignment for 1-to-1 tokens and lowercasing (#6476)
* When checking for token alignments, check not only that the tokens are
identical but that the character positions are both at the start of a
token.

  It's possible for the tokens to be identical even though the two
tokens aren't aligned one-to-one in a case like `["a'", "''"]` vs.
`["a", "''", "'"]`, where the middle tokens are identical but should not
be aligned on the token level at character position 2 since it's the
start of one token but the middle of another.

* Use the lowercased version of the token texts to create the
character-to-token alignment because lowercasing can change the string
length (e.g., for `İ`, see the not-a-bug bug report:
https://bugs.python.org/issue34723)
2020-12-08 14:25:16 +08:00
..
__init__.py
test_augmenters.py Update data augmenters (#6196) 2020-10-04 17:46:29 +02:00
test_new_example.py
test_readers.py Fixes in test suite (#6457) 2020-12-02 12:57:08 +01:00
test_training.py Fix alignment for 1-to-1 tokens and lowercasing (#6476) 2020-12-08 14:25:16 +08:00