Commit Graph

13 Commits

Author SHA1 Message Date
Andrew Ongko 81564cc4e8 Update Indonesian model (#2752)
* adding e-KTP in tokenizer exceptions list

* add exception token

* removing lines with containing space as it won't matter since we use .split() method in the end, added new tokens in exception

* add tokenizer exceptions list

* combining base_norms with norm_exceptions

* adding norm_exception

* fix double key in lemmatizer

* remove unused import on punctuation.py

* reformat stop_words to reduce number of lines, improve readibility

* updating tokenizer exception

* implement is_currency for lang/id

* adding orth_first_upper in tokenizer_exceptions

* update the norm_exception list

* remove bunch of abbreviations

* adding contributors file
2018-09-14 12:30:32 +02:00
ines 8ce6f96180 Don't make copies of language data components 2017-10-11 15:34:55 +02:00
ines 417d45f5d0 Add lemmatizer data as variable on language data
Don't create lookup lemmatizer within Language class and just pass in
the data so it can be set on Token creation
2017-10-11 02:24:58 +02:00
ines 0c2343d73a Tidy up language data 2017-10-11 02:22:49 +02:00
Jim Geovedi f77443ab68 reworked 2017-08-20 13:43:21 +07:00
Jim Geovedi bbc75da38d enable syntax iterator and lemma lookup 2017-07-27 10:51:15 +07:00
Jim Geovedi 62443d495a enable token match 2017-07-26 19:13:14 +07:00
Jim Geovedi 082e94ac1c added inflix rules 2017-07-24 06:17:07 +07:00
Jim Geovedi 0e590c711f added prefix & suffix rules 2017-07-23 23:46:40 +07:00
Jim Geovedi b80c35bc9a added norm_exceptions 2017-07-23 22:54:49 +07:00
Jim Geovedi 082e9ade46 fixed typo 2017-07-23 21:30:34 +07:00
Jim Geovedi da98676839 use template 2017-07-23 20:51:31 +07:00
Jim Geovedi c2b4dd7809 start working on Indonesian language 2017-07-23 20:50:56 +07:00