Commit Graph

470 Commits

Author SHA1 Message Date
Wannaphong Phatthiyaphaibun 5a14a13f64 fix thai bug (#3693)
fix tokenize for pythainlp
2019-05-10 14:21:34 +02:00
Ines Montani 78cb807a9a Auto-format [ci skip] 2019-05-06 16:58:29 +02:00
Dobita21 f95ecedd83 Add Thai lex_attrs (#3655)
* test sPacy commit to git fri 04052019 10:54

* change Data format from my format to master format

* ทัทั้งนี้ ---> ทั้งนี้

* delete stop_word translate from Eng

* Adjust formatting and readability

* add Thai norm_exception

* Add Dobita21 SCA

* editรึ : หรือ,

* Update Dobita21.md

* Auto-format

* Integrate norms into language defaults

* add acronym and some norm exception words

* add lex_attrs

* Add lexical attribute getters into the language defaults

* fix LEX_ATTRS


Co-authored-by: Donut <dobita21@gmail.com>
Co-authored-by: Ines Montani <ines@ines.io>
2019-05-01 12:03:14 +02:00
BreakBB 8952004dfc Update French example sents and add two German stop words (#3662)
* Update french example sentences

* Add 'anderem' and 'ihren' to German stop words
2019-05-01 12:01:35 +02:00
Dobita21 721e1fc86c update norm_exceptions (#3627)
* test sPacy commit to git fri 04052019 10:54

* change Data format from my format to master format

* ทัทั้งนี้ ---> ทั้งนี้

* delete stop_word translate from Eng

* Adjust formatting and readability

* add Thai norm_exception

* Add Dobita21 SCA

* editรึ : หรือ,

* Update Dobita21.md

* Auto-format

* Integrate norms into language defaults

* add acronym and some norm exception words
2019-04-23 12:48:03 +02:00
Dobita21 189c90743c Add Thai norm_exceptions (#3612)
* test sPacy commit to git fri 04052019 10:54

* change Data format from my format to master format

* ทัทั้งนี้ ---> ทั้งนี้

* delete stop_word translate from Eng

* Adjust formatting and readability

* add Thai norm_exception

* Add Dobita21 SCA

* editรึ : หรือ,

* Update Dobita21.md

* Auto-format

* Integrate norms into language defaults
2019-04-20 12:16:03 +02:00
Omer Celik 531c0869b2 Added Turkish Lira symbol(₺) (#3576)
Added Turkish Lira symbol(₺) 
https://en.wikipedia.org/wiki/Turkish_lira
2019-04-11 11:32:28 +02:00
Ines Montani 145c0b7e88 Tidy up and auto-format 2019-04-09 11:40:19 +02:00
Dobita21 8bf6967eb7 Update Thai stop words (#3545)
* test sPacy commit to git fri 04052019 10:54

* change Data format from my format to master format

* ทัทั้งนี้ ---> ทั้งนี้

* delete stop_word translate from Eng

* Adjust formatting and readability
2019-04-05 12:06:38 +02:00
jeannefukumaru f67d881b30 fix typos in tag_map flagged by `python -m debug-data` (#3542)
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [ ] I have submitted the spaCy Contributor Agreement.
- [ ] I ran the tests, and all new and existing tests passed.
- [ ] My changes don't require a change to the documentation, or if they do, I've added all required information.


Co-authored-by: Ines Montani <ines@ines.io>
2019-04-05 12:06:09 +02:00
Jeanne Choo b6c9807431 Merge remote-tracking branch 'upstream/master' 2019-04-04 14:21:50 +08:00
Jeanne Choo 80e15af76c fixed tag_map.py merge conflict 2019-04-04 14:18:27 +08:00
jeannefukumaru 876ce01567 updated tag map with missing tags 2019-04-03 23:09:11 +08:00
Ines Montani 4faf62d515
Merge pull request #3530 from svlandeg/fix/issue_3521
Allow English stopwords with any type of apostrophe
2019-04-03 14:14:03 +02:00
Yves Peirsman 951825532c Improved Dutch language resources and Dutch lemmatization (#3409)
* Improved Dutch language resources and Dutch lemmatization

* Fix conftest

* Update punctuation.py

* Auto-format

* Format and fix tests

* Remove unused test file

* Re-add deleted test

* removed redundant infix regex pattern for ','; note: brackets + simple hyphen remains

* Cleaner lemmatization files
2019-04-03 14:13:26 +02:00
svlandeg 4ff786e113 addressed all comments by Ines 2019-04-03 13:50:33 +02:00
Kamolsit Mongkolsrisawat dcc67f3f51 Update Thai tokenizer_exception list (#3529)
* add tokenizer_exceptions word (ก-น) from https://goo.gl/JpJ2qq

* update tokenizer_exceptions word list

* add contributor file
2019-04-03 09:13:36 +02:00
svlandeg 673c81bbb4 unicode string for python 2.7 2019-04-02 13:52:07 +02:00
svlandeg eca9cc5417 fixing Issue #3521 by adding all hyphen variants for each stopword 2019-04-02 13:24:59 +02:00
jeannefukumaru 6cdb7b2e04 added tag_map for indonesian (#3515)
* added tag_map for indonesian

* changed tag map from .py to .txt to see if tests pass

* added symbols import

* added utf8 encoding flag

* added missing SCONJ symbol

* Auto-format

* Remove unused imports

* Make tag map available in Indonesian defaults
2019-04-01 12:27:48 +02:00
Ines Montani c23e234d65 Auto-format 2019-04-01 12:11:27 +02:00
Ines Montani 0a0b1087b0 Make tag map available in Indonesian defaults 2019-04-01 11:46:51 +02:00
Ines Montani 5d9212c44c Remove unused imports 2019-04-01 11:46:25 +02:00
Ines Montani 8d6b544632 Auto-format 2019-04-01 11:45:43 +02:00
jeannefukumaru 6567f27849
added missing SCONJ symbol 2019-04-01 17:02:53 +08:00
jeannefukumaru 082a0a2232
added utf8 encoding flag 2019-04-01 16:37:11 +08:00
jeannefukumaru a741bed7a7
added symbols import 2019-04-01 16:21:06 +08:00
jeannefukumaru 745cf0c914 changed tag map from .py to .txt to see if tests pass 2019-04-01 07:04:50 +08:00
jeannefukumaru 3cc897102f added tag_map for indonesian 2019-04-01 00:00:08 +08:00
Duygu Altinok 5a7bc6b39d Fix/irreg adverbs extension (#3499)
* extended list of irreg adverbs

* added test to exceptions

* fixed typo
2019-03-28 13:23:33 +01:00
Wannaphong Phatthiyaphaibun 297a051992 Update Thai tag map (#3480)
* Update Thai tag map

Update Thai tag map

* Create wannaphongcom.md
2019-03-25 16:53:26 +01:00
Matthew Honnibal c66bd61e88 Fix lemmas 2019-03-21 14:22:12 +01:00
Matthew Honnibal 04395ffa49 Bring English tag_map in line with UD Treebank
I wrote a small script to read the UD English training data and check
that our tag map and morph rules were resulting in the best POS map.
This hadn't been done for some time, and there have been various changes
to the UD schema since it has been done. After these changes we should
see much better agreement between our POS assignments and the UD POS
tags.
2019-03-21 13:53:44 +01:00
Mehdi Hamoumi 9211f30ee3 Tiny correction in french lookup dictionary (#3427) 2019-03-19 13:00:19 +01:00
Ines Montani 2912ddc9a6 Don't set extension attribute in Japanese (closes #3398) 2019-03-12 13:30:33 +01:00
Ines Montani cdd418b93e Auto-format [ci skip] 2019-03-11 17:10:50 +01:00
Matthew Honnibal 39a4741e26 Add support for vocab.writing_system property (#3390)
* Add xfail test for vocab.writing_system

* Add vocab.writing_system property

* Set Language.Defaults.writing_system

* Set default writing system

* Remove xfail on test_vocab_writing_system
2019-03-11 15:23:20 +01:00
Ines Montani ee4f312e89 Add writing_system to ArabicDefaults (experimental) 2019-03-11 14:22:23 +01:00
Ines Montani ef80cfde6f Fix pickling of Japanese (closes #3191) 2019-03-11 13:34:23 +01:00
Matthew Honnibal 5d25ee52fb Fix English tag map 2019-03-11 01:06:02 +01:00
Matthew Honnibal 7503e1e505 Improve English tag map. Re #593, #3311 2019-03-10 23:50:00 +01:00
Ines Montani 610fb306bd Revert hyphens 2019-03-09 12:51:53 +01:00
Ines Montani bbabb6aaae Escape more hyphens 2019-03-09 12:41:05 +01:00
Ines Montani b8db219850 Auto-format 2019-03-09 12:40:58 +01:00
Ines Montani a145bfe627 Try escaping hyphens again 2019-03-09 03:06:50 +01:00
Ines Montani b9c71fc0f0 Fix flags 2019-03-09 02:46:04 +01:00
Ines Montani ae09b6a6cf Try fixing unicode inconsistencies on Python 2 2019-03-09 02:37:50 +01:00
Ines Montani d957d7a697 Auto-format 2019-03-09 02:37:41 +01:00
Ines Montani 65402c3d02 Revert "Experiment with escaping hyphens"
This reverts commit 9b42e2d5dd.
2019-03-09 02:13:00 +01:00
Ines Montani 9b42e2d5dd Experiment with escaping hyphens 2019-03-09 02:05:26 +01:00