Load exceptions last in Tokenizer.from_bytes (#12553)

In `Tokenizer.from_bytes`, the exceptions should be loaded last so that
they are only processed once as part of loading the model.

The exceptions are tokenized as phrase matcher patterns in the
background and the internal tokenization needs to be synced with all the
remaining tokenizer settings. If the exceptions are not loaded last,
there are speed regressions for `Tokenizer.from_bytes/disk` vs.
`Tokenizer.add_special_case` as the caches are reloaded more than
necessary during deserialization.
This commit is contained in:
Adriane Boyd 2023-04-20 11:30:34 +02:00 committed by GitHub
parent 8e6a3d58d8
commit dc0a1a9808
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
1 changed files with 4 additions and 2 deletions

View File

@ -834,10 +834,12 @@ cdef class Tokenizer:
self.token_match = re.compile(data["token_match"]).match
if "url_match" in data and isinstance(data["url_match"], str):
self.url_match = re.compile(data["url_match"]).match
if "rules" in data and isinstance(data["rules"], dict):
self.rules = data["rules"]
if "faster_heuristics" in data:
self.faster_heuristics = data["faster_heuristics"]
# always load rules last so that all other settings are set before the
# internal tokenization for the phrase matcher
if "rules" in data and isinstance(data["rules"], dict):
self.rules = data["rules"]
return self