Commit Graph

13 Commits

Author SHA1 Message Date
Ioannis Daras 6ed18412d0 Greek language optimizations (#2558)
* Greek language optimizations

* Add encoding on files containing greek words

* Add encoding on files containing greek words
2018-07-18 18:51:38 +02:00
Aliia E 428bae66b5 Add Tatar Language Support (#2444)
* add Tatar lang support

* add Tatar letters

* add Tatar tests

* sign contributor agreement

* sign contributor agreement [x]

* remove comments from Language class

* remove all template comments
2018-06-19 10:17:53 +02:00
Tahar Zanouda 00417794d3 Add Arabic language (#2314)
* added support for Arabic lang

* added Arabic language support

* updated conftest
2018-05-15 00:27:19 +02:00
Ali Zarezade 42349471bc
add ٪ as punctuation 2018-01-23 18:11:33 +03:30
Ali Zarezade 2bda582135
Add Persian character and symbols
Add Persian characters and the following:
- ٪ used instead of %
- ؟ used instead of ?
- ﷼ used instead of $
- ، used instead of ,
- ؛ used instead of ;
2018-01-23 13:20:36 +03:30
Vadim Mazaev 81314f8659 Fixed tokenizer: added char classes; added first lemmatizer and
tokenizer tests
2017-11-21 22:23:59 +03:00
ines e85e1d571b Update base punctuation 2017-10-14 14:59:23 +02:00
ines 09aed58140 Port over changes from #1333 and add comments 2017-10-14 12:52:59 +02:00
ines 5ee10379db Port over changes from #1340 2017-09-26 16:38:08 +02:00
ines 10d291f129 Port over change from #1351 2017-09-26 16:11:41 +02:00
Matthew Honnibal cfc055734e Split % in units, for compatibility with corpus 2017-08-25 20:03:37 -05:00
ines a8e58e04ef Add symbols class to punctuation rules to handle emoji (see #1088)
Currently doesn't work for Hungarian, because of conflicts with the
custom punctuation rules. Also doesn't take multi-character emoji like
👩🏽‍💻 into account.
2017-05-27 17:57:10 +02:00
ines 604f299cf6 Add char classes to global language data 2017-05-08 23:59:33 +02:00