spaCy/spacy/lang/hu
Adriane Boyd 2ea9b58006
Ignore prefix in suffix matches (#9155)
* Ignore prefix in suffix matches

Ignore the currently matched prefix when looking for suffix matches in
the tokenizer. Otherwise a lookbehind in the suffix pattern may match
incorrectly due the presence of the prefix in the token string.

* Move °[cfkCFK]. to a tokenizer exception

* Adjust exceptions for same tokenization as v3.1

* Also update test accordingly

* Continue to split . after °CFK if ° is not a prefix

* Exclude new ° exceptions for pl

* Switch back to default tokenization of "° C ."

* Revert "Exclude new ° exceptions for pl"

This reverts commit 952013a5b4.

* Add exceptions for °C for hu
2021-10-27 13:02:25 +02:00
..
__init__.py 🏷 Add Mypy check to CI and ignore all existing Mypy errors (#9167) 2021-10-14 15:21:40 +02:00
examples.py
punctuation.py Fix Hungarian % tokenization (#6013) 2020-09-02 13:06:16 +02:00
stop_words.py
tokenizer_exceptions.py Ignore prefix in suffix matches (#9155) 2021-10-27 13:02:25 +02:00