* Add hacky solution to Issue #220. Currently specials.json only supports literal patterns, which doesn't allow us to pre-tag whitespace with the correct token, SP, as a rule. The data-driven approach should be easy but for some reason fails here. Adding a hard code in Morphology isn't a good solution, but we do want to fix the behaviour right away, and don't want to wait for an architecturally better solution.

2016-01-19 03:35:20 +01:00 · 2016-01-19 03:35:20 +01:00 · 590f38bdb2
parent 445164d5b4
commit 590f38bdb2
1 changed files with 7 additions and 0 deletions
--- a/spacy/morphology.pyx
+++ b/spacy/morphology.pyx
@ -40,6 +40,13 @@ cdef class Morphology:
            tag_id = tag
        if tag_id >= self.n_tags:
            raise ValueError("Unknown tag: %s" % tag)
        # TODO: It's pretty arbitrary to put this logic here. I guess the justification
        # is that this is where the specific word and the tag interact. Still,
        # we should have a better way to enforce this rule, or figure out why
        # the statistical model fails.
        # Related to Issue #220
        if Lexeme.c_check_flag(token.lex, IS_SPACE):
            tag_id = self.reverse_index[self.strings['SP']]
        analysis = <MorphAnalysisC*>self._cache.get(tag_id, token.lex.orth)
        if analysis is NULL:
            analysis = <MorphAnalysisC*>self.mem.alloc(1, sizeof(MorphAnalysisC))