spaCy/spacy/training
Adriane Boyd 4448680750
Fix alignment for 1-to-1 tokens and lowercasing (#6476)
* When checking for token alignments, check not only that the tokens are
identical but that the character positions are both at the start of a
token.

  It's possible for the tokens to be identical even though the two
tokens aren't aligned one-to-one in a case like `["a'", "''"]` vs.
`["a", "''", "'"]`, where the middle tokens are identical but should not
be aligned on the token level at character position 2 since it's the
start of one token but the middle of another.

* Use the lowercased version of the token texts to create the
character-to-token alignment because lowercasing can change the string
length (e.g., for `İ`, see the not-a-bug bug report:
https://bugs.python.org/issue34723)
2020-12-08 14:25:16 +08:00
..
converters fix E902 and E903 numbering 2020-10-05 13:43:32 +02:00
__init__.pxd
__init__.py Replace pytokenizations with internal alignment (#6293) 2020-11-03 16:24:38 +01:00
align.pyx Fix alignment for 1-to-1 tokens and lowercasing (#6476) 2020-12-08 14:25:16 +08:00
alignment.py Replace pytokenizations with internal alignment (#6293) 2020-11-03 16:24:38 +01:00
augment.py Auto-format [ci skip] 2020-10-05 21:58:18 +02:00
batchers.py
corpus.py Integrate file readers 2020-10-02 01:36:06 +02:00
example.pxd Make a pre-check to speed up alignment cache (#6139) 2020-09-24 18:13:39 +02:00
example.pyx Replace pytokenizations with internal alignment (#6293) 2020-11-03 16:24:38 +01:00
gold_io.pyx Use null raw for has_unknown_spaces in docs_to_json 2020-10-15 09:57:54 +02:00
initialize.py TextCat updates and fixes (#6263) 2020-10-18 14:50:41 +02:00
iob_utils.py Merge pull request #6089 from adrianeboyd/feature/doc-ents-v3-2 2020-09-24 14:44:42 +02:00
loggers.py Make console logger table more compact 2020-10-11 12:55:46 +02:00
loop.py Fix success message [ci skip] 2020-10-15 14:42:08 +02:00
pretrain.py avoid resolving the full config (#6465) 2020-11-30 09:34:29 +08:00