Commit Graph

830 Commits

Author SHA1 Message Date
Adriane Boyd 5eeb25f043 Tidy up code 2021-06-28 12:08:15 +02:00
Adriane Boyd 02bac8f269
Fix non-deterministic deduplication in Greek lemmatizer (#8421) 2021-06-17 09:11:01 +02:00
Giovanni Toffoli 19521d525b
Added Italian POS-aware lemmatizer. (#8079)
* Added Italian POS-aware lemmatizer.

Also added the code used to build the lookup tables by POS.

* Create gtoffoli.md

* Add imports and format

* Remove helper script

* Use lemma_lookup instead of lemma_lookup_legacy

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2021-06-16 11:14:45 +02:00
Antti Ajanki 5a6125c227
[Finnish tokenizer] Handle conjunction contractions (#8105) 2021-06-16 10:56:47 +02:00
Adriane Boyd 5646fcbe46 Merge remote-tracking branch 'upstream/develop' into chore/develop-into-master-v3.1 2021-06-15 15:05:17 +02:00
Adriane Boyd b98d216205
Update Catalan language data (#8308)
* Update Catalan language data

Update Catalan language data based on contributions from the Text Mining
Unit at the Barcelona Supercomputing Center:

https://github.com/TeMU-BSC/spacy4release/tree/main/lang_data

* Update tokenizer settings for UD Catalan AnCora

Update for UD Catalan AnCora v2.7 with merged multi-word tokens.

* Update test

* Move prefix patternt to more generic infix pattern

* Clean up
2021-06-11 10:21:22 +02:00
Adriane Boyd f4008bdb13
Restrict pymorphy2 requirement to pymorphy2 mode (#8299)
For the Russian and Ukrainian lemmatizers, restrict the `pymorphy2`
requirement to the mode `pymorphy2` so that lookup or other lemmatizer
modes can be loaded without installing `pymorphy2`.
2021-06-11 10:19:22 +02:00
Jean-Hugues Roy ff5cf3606c
Improvements to French stopwords list (#7941)
* "y" etc.

Many changes described in pull request

* Update spacy/lang/fr/stop_words.py

* Update spacy/lang/fr/stop_words.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2021-06-02 11:50:49 +02:00
Paul O'Leary McCann d1a221a374
Add all symbols in Unicode Currency Symbols block (#8212)
* Add all symbols in Unicode Currency Symbols block

In #8102 it came up that the rupee symbol was treated different from
dollar / euro / yen symbols. This adds many symbols not already
included.

* Fix test

* Fix training test
2021-05-31 18:03:40 +10:00
Adriane Boyd 1d59fdbd39
Update Vietnamese tokenizer (#8099)
* Adapt tokenization methods from `pyvi` to preserve text encoding and
whitespace
* Add serialization support similar to Chinese and Japanese

Note: as for Chinese and Japanese, some settings are duplicated in
`config.cfg` and `tokenizer/cfg`.
2021-05-17 18:16:20 +10:00
Paul O'Leary McCann bdeaf3a18b
Fix/fix en ordinals (#8028)
* Fix #8019

"th" is not the only ordinal ending.

* Add some more ordinal tests
2021-05-07 10:26:42 +02:00
Adriane Boyd 31528f62ed
Add / to nb infixes (#7991) 2021-05-04 11:00:10 +02:00
Sevdimali 49aed683cc
Azerbaijani language added (#7911) 2021-04-28 14:42:02 +02:00
Jacopo Farina c105ed10fd
Remove torino from stop words (#7634)
Torino is the proper name of a city and the token has no other meaning
2021-04-26 16:53:43 +02:00
m0canu1 921feee092
Added more exception to the italian language from https://forum.wordr… (#7246)
* Added more exception to the italian language from https://forum.wordreference.com/threads/le-abbreviazioni-nella-lingua-italiana-abbreviations-in-italian.2464189/

* Remove unnecessary exception

Co-authored-by: Alexandru Mocanu <alexandru.mocanu@augeos.it>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2021-03-30 10:23:32 +02:00
Adriane Boyd 3bcf74aca7 Rename and update ru pymorphy2 lookup lemmatize
* To allow default lookup lemmatization with a blank Russian model,
rename pymorphy2 lookup mode to `pymorphy2_lookup`

* Bug fix: update pymorphy2 lookup lemmatize to return list rather than
string
2021-03-15 11:11:06 +01:00
Adriane Boyd 264862c67a
Fix Ukrainian lemmatizer init (#7127)
Fix class variable and init for `UkrainianLemmatizer` so that it loads
the `uk` dictionaries rather than having the parent `RussianLemmatizer`
override with the `ru` settings.
2021-02-22 11:05:08 +11:00
Boian Tzonev cca8651fc8
Bulgarian tokenizer exceptions (#7114)
* [Bulgarian] Add tokenizer exceptions and like_num for Bulgarian

* [Bulgarian] Add tokenizer exceptions and like_num for Bulgarian
2021-02-19 19:19:19 +01:00
Ines Montani 9ba715ed16 Tidy up and auto-format 2021-02-13 12:55:56 +11:00
Ines Montani 6c450decfc Fix punctuation settings and add to initialize tests 2021-02-13 11:51:21 +11:00
Shumi 4e514f1ea8
Update stop_words.py
I have deleted line 1 to 5 and the statement print(STOP_WORDS)
2021-02-11 21:30:34 +02:00
Shumi 0d57e84b7b
Update lex_attrs.py
I have removed line 1 to 4
2021-02-11 21:28:23 +02:00
Shumi 37ec67f868
Update examples.py
I have removed two lines:
# coding: utf8
from __future__ import unicode_literals

And updated: >>> from spacy.lang.tn.examples import sentences
2021-02-11 21:25:58 +02:00
Shumi 39eeba6760
Update __init__.py
Added infixes = TOKENIZER_INFIXES
2021-02-11 21:20:46 +02:00
Shumi ed3397727e
Delete tag_map.py
Tag map file is deleted. I will add it later because it was failing validations
2021-02-10 20:41:18 +02:00
Shumi 7c8721b1bd
Update tag_map.py
Updated tag_map
2021-02-10 20:21:22 +02:00
Shumi f6be28cfb2
Added files to Setswana Language
Add South African Setswana Language
2021-02-10 20:15:13 +02:00
Shumi 24046fef17
South African Setswana language
Please accept the additional of Setswana language
2021-02-10 20:12:33 +02:00
svlandeg 91e72c031e reformatting 2021-01-30 17:29:33 +01:00
svlandeg a8d84188f0 add stop words
Co-authored-by: tewodrosm <tedmaam2006@gmail.com>
2021-01-30 17:26:49 +01:00
Ines Montani e6accb3a9e Tidy up and auto-format 2021-01-30 12:52:33 +11:00
Ines Montani 817b0db521 Fix escape sequence 2021-01-30 12:39:58 +11:00
Ines Montani bbf080dfe5
Merge pull request #6645 from bittlingmayer/patch-3 2021-01-30 01:26:28 +11:00
Adriane Boyd bced6309e5
Add full exceptions with spaces 2021-01-29 14:27:22 +01:00
Ines Montani 5ed51c9dd2
Merge pull request #6828 from explosion/master-tmp 2021-01-27 23:05:46 +11:00
Adriane Boyd d17afb4826
Add Spanish rule-based lemmatizer (#6833)
* Initial Spanish lemmatizer

* Handle merged verb+pron(s) multi-word tokens

* Use VERB for AUX rule lookup

* Add morph to lemma cache key

* Fix aux lookups, minor refactoring

* Improve verb+pron handling

* Move verb+pron handling into its own method
* Check for exceptions (primarily for se)
* Collect pronouns in the same (not reversed) order

* Only add modified possible lemmas
2021-01-27 19:21:35 +08:00
Ines Montani 615dba9d99 Fix tokenizer exceptions 2021-01-27 22:11:42 +11:00
Ines Montani e3f8be9a94 Update language data 2021-01-27 13:29:22 +11:00
Ines Montani 230e651ad6 Merge branch 'develop' into master-tmp 2021-01-27 13:26:29 +11:00
Adriane Boyd 71a6350744
Implement overwrite param for all custom lemmatizers (#6794) 2021-01-26 14:53:43 +11:00
muratjumashev 2b19ebad59 Remove Kyrgyz chars fr. char_classes since Tatar ones already cover 2021-01-25 00:46:45 +06:00
muratjumashev 53abf759ad Fix punctuation 2021-01-24 20:54:22 +06:00
muratjumashev 2a2646362b Fix language subclass 2021-01-23 22:00:50 +06:00
muratjumashev fe3b5b8ff5 Add kyrgyz to char_classes 2021-01-23 21:53:41 +06:00
muratjumashev e30bbf5432 Add examples 2021-01-23 21:49:08 +06:00
muratjumashev 2f385385a9 Remove comment 2021-01-23 21:36:28 +06:00
muratjumashev d53724ba1d Add lex_attrs 2021-01-23 21:35:25 +06:00
muratjumashev 4418ec2eee Add punctuation 2021-01-23 21:31:31 +06:00
muratjumashev 101d265778 Add stopwords 2021-01-23 21:25:28 +06:00
muratjumashev 28d06ab860 Add tokenizer_exceptions 2021-01-22 23:08:41 +06:00