mirror of https://github.com/explosion/spaCy.git
2ea9b58006
* Ignore prefix in suffix matches
Ignore the currently matched prefix when looking for suffix matches in
the tokenizer. Otherwise a lookbehind in the suffix pattern may match
incorrectly due the presence of the prefix in the token string.
* Move °[cfkCFK]. to a tokenizer exception
* Adjust exceptions for same tokenization as v3.1
* Also update test accordingly
* Continue to split . after °CFK if ° is not a prefix
* Exclude new ° exceptions for pl
* Switch back to default tokenization of "° C ."
* Revert "Exclude new ° exceptions for pl"
This reverts commit
|
||
---|---|---|
.. | ||
__init__.py | ||
examples.py | ||
punctuation.py | ||
stop_words.py | ||
tokenizer_exceptions.py |