spaCy

History

Adriane Boyd 4448680750 Fix alignment for 1-to-1 tokens and lowercasing (#6476 ) * When checking for token alignments, check not only that the tokens are identical but that the character positions are both at the start of a token. It's possible for the tokens to be identical even though the two tokens aren't aligned one-to-one in a case like `["a'", "''"]` vs. `["a", "''", "'"]`, where the middle tokens are identical but should not be aligned on the token level at character position 2 since it's the start of one token but the middle of another. * Use the lowercased version of the token texts to create the character-to-token alignment because lowercasing can change the string length (e.g., for `İ`, see the not-a-bug bug report: https://bugs.python.org/issue34723)		2020-12-08 14:25:16 +08:00
..
converters	fix E902 and E903 numbering	2020-10-05 13:43:32 +02:00
__init__.pxd	…
__init__.py	Replace pytokenizations with internal alignment (#6293 )	2020-11-03 16:24:38 +01:00
align.pyx	Fix alignment for 1-to-1 tokens and lowercasing (#6476 )	2020-12-08 14:25:16 +08:00
alignment.py	Replace pytokenizations with internal alignment (#6293 )	2020-11-03 16:24:38 +01:00
augment.py	Auto-format [ci skip]	2020-10-05 21:58:18 +02:00
batchers.py	…
corpus.py	Integrate file readers	2020-10-02 01:36:06 +02:00
example.pxd	Make a pre-check to speed up alignment cache (#6139 )	2020-09-24 18:13:39 +02:00
example.pyx	Replace pytokenizations with internal alignment (#6293 )	2020-11-03 16:24:38 +01:00
gold_io.pyx	Use null raw for has_unknown_spaces in docs_to_json	2020-10-15 09:57:54 +02:00
initialize.py	TextCat updates and fixes (#6263 )	2020-10-18 14:50:41 +02:00
iob_utils.py	Merge pull request #6089 from adrianeboyd/feature/doc-ents-v3-2	2020-09-24 14:44:42 +02:00
loggers.py	Make console logger table more compact	2020-10-11 12:55:46 +02:00
loop.py	Fix success message [ci skip]	2020-10-15 14:42:08 +02:00
pretrain.py	avoid resolving the full config (#6465 )	2020-11-30 09:34:29 +08:00