spaCy/spacy/lang/tl/tokenizer_exceptions.py

from ..tokenizer_exceptions import BASE_EXCEPTIONS
from ...symbols import ORTH, LEMMA
from ...util import update_exc


_exc = {
    "tayo'y": [{ORTH: "tayo", LEMMA: "tayo"}, {ORTH: "'y", LEMMA: "ay"}],
    "isa'y": [{ORTH: "isa", LEMMA: "isa"}, {ORTH: "'y", LEMMA: "ay"}],
    "baya'y": [{ORTH: "baya", LEMMA: "bayan"}, {ORTH: "'y", LEMMA: "ay"}],
    "sa'yo": [{ORTH: "sa", LEMMA: "sa"}, {ORTH: "'yo", LEMMA: "iyo"}],
    "ano'ng": [{ORTH: "ano", LEMMA: "ano"}, {ORTH: "'ng", LEMMA: "ang"}],
    "siya'y": [{ORTH: "siya", LEMMA: "siya"}, {ORTH: "'y", LEMMA: "ay"}],
    "nawa'y": [{ORTH: "nawa", LEMMA: "nawa"}, {ORTH: "'y", LEMMA: "ay"}],
    "papa'no": [{ORTH: "papa'no", LEMMA: "papaano"}],
    "'di": [{ORTH: "'di", LEMMA: "hindi"}],
}


TOKENIZER_EXCEPTIONS = update_exc(BASE_EXCEPTIONS, _exc)
Tidy up and move noun_chunks, token_match, url_match 2020-07-22 20:18:46 +00:00			`from ..tokenizer_exceptions import BASE_EXCEPTIONS`
Tidy up and fix small bugs and typos 2019-02-08 13:14:49 +00:00			`from ...symbols import ORTH, LEMMA`
Tidy up and move noun_chunks, token_match, url_match 2020-07-22 20:18:46 +00:00			`from ...util import update_exc`
Added alpha support for Tagalog language (#3062) I have added alpha support for the Tagalog language from the Philippines. It is the basis for the country's national language Filipino. I have heavily based the format to the EN and ES languages. I have provided several words in the lemmatizer lookup table, added stop words from a source, translated numeric words to its Tagalog counterpart, added some tokenizer exceptions, and kept the tag map the same as the English language. While the alpha language passed the preliminary testing that you provided, I think it needs more data to be useful for most cases. * Added alpha support for Tagalog language * Edited contributor template * Included SCA; Reverted templates * Fixed SCA template * Fixed changes in SCA template 2018-12-18 12:08:38 +00:00

			`_exc = {`
Tidy up and fix small bugs and typos 2019-02-08 13:14:49 +00:00			`"tayo'y": [{ORTH: "tayo", LEMMA: "tayo"}, {ORTH: "'y", LEMMA: "ay"}],`
			`"isa'y": [{ORTH: "isa", LEMMA: "isa"}, {ORTH: "'y", LEMMA: "ay"}],`
			`"baya'y": [{ORTH: "baya", LEMMA: "bayan"}, {ORTH: "'y", LEMMA: "ay"}],`
			`"sa'yo": [{ORTH: "sa", LEMMA: "sa"}, {ORTH: "'yo", LEMMA: "iyo"}],`
			`"ano'ng": [{ORTH: "ano", LEMMA: "ano"}, {ORTH: "'ng", LEMMA: "ang"}],`
			`"siya'y": [{ORTH: "siya", LEMMA: "siya"}, {ORTH: "'y", LEMMA: "ay"}],`
			`"nawa'y": [{ORTH: "nawa", LEMMA: "nawa"}, {ORTH: "'y", LEMMA: "ay"}],`
			`"papa'no": [{ORTH: "papa'no", LEMMA: "papaano"}],`
			`"'di": [{ORTH: "'di", LEMMA: "hindi"}],`
Added alpha support for Tagalog language (#3062) I have added alpha support for the Tagalog language from the Philippines. It is the basis for the country's national language Filipino. I have heavily based the format to the EN and ES languages. I have provided several words in the lemmatizer lookup table, added stop words from a source, translated numeric words to its Tagalog counterpart, added some tokenizer exceptions, and kept the tag map the same as the English language. While the alpha language passed the preliminary testing that you provided, I think it needs more data to be useful for most cases. * Added alpha support for Tagalog language * Edited contributor template * Included SCA; Reverted templates * Fixed SCA template * Fixed changes in SCA template 2018-12-18 12:08:38 +00:00			`}`


Tidy up and move noun_chunks, token_match, url_match 2020-07-22 20:18:46 +00:00			`TOKENIZER_EXCEPTIONS = update_exc(BASE_EXCEPTIONS, _exc)`