Commit Graph

11649 Commits

Author SHA1 Message Date
Vu Ha 6d465ec52c
add oprd to the list of accepted deps for noun chunking (#6302)
* add oprd to the list of accepted deps for noun chunking

* add SCA
2020-11-05 09:17:35 +01:00
Adriane Boyd 31de700b0f
Fix on_match callback and remove empty patterns (#6312)
For the `DependencyMatcher`:

* Fix on_match callback so that it is called once per matched pattern
* Fix results so that patterns with empty match lists are not returned
2020-11-05 09:16:26 +01:00
Adriane Boyd 45c9a68828
Identify final Matcher pattern node by quantifier (#6317)
Modify the internal pattern representation in `Matcher` patterns to
identify the final ID state using a unique quantifier rather than a
combination of other attributes.

It was insufficient to identify the final ID node based on an
uninitialized `quantifier` (coincidentally being the same as the `ZERO`)
with `nr_attr` as 0. (In addition, it was potentially bug-prone that
`nr_attr` was set to 0 even though attrs were allocated.)

In the case of `{"OP": "!"}` (a valid, if pointless, pattern), `nr_attr`
is 0 and the quantifier is ZERO, so the previous methods for
incrementing to the ID node at the end of the pattern weren't able to
distinguish the final ID node from the `{"OP": "!"}` pattern.
2020-10-31 12:18:48 +01:00
Duygu Altinok 0e55f806dd
Turkish tokenization improvements (#6268)
* added single and paired orth variants

* added token match

* added long text tokenization test

* inverted init

* normalized lemmas to lowercase

* more abbrevs

* tests for ordinals and abbrevs

* separated period abbvrevs to another list

* fiex typo

* added ordinal and abbrev tests

* added number tests for dates

* minor refinement

* added inflected abbrevs regex

* added percentage and inflection

* cosmetics

* added token match

* added url inflection tests

* excluded url tokens from custom pattern

* removed url match import
2020-10-29 09:43:17 +01:00
Adriane Boyd 8cc5ed6771 Add Macedonian to website languages 2020-10-29 08:49:56 +01:00
Ines Montani 1e4d7e059f Revert "Test FUNDING.yml [ci skip]"
This reverts commit 287be48ad0.
2020-10-28 17:42:23 +01:00
Ines Montani 287be48ad0 Test FUNDING.yml [ci skip] 2020-10-28 17:36:25 +01:00
Adriane Boyd 4dd86306e9
Add Nepali to supported languages on website (#6315) 2020-10-28 16:32:07 +01:00
Robert Šípek 260c29794a
Fill contributor agreement by robertsipek (#6285)
* Fill contributor agreement by robertsipek

* Fill contributor agreement by robertsipek
2020-10-22 22:13:17 +02:00
Kunal Sharma 01aec7a313
Adding MindMeld to Universe JSON (#6275)
* Adding Mindmeld to Universe JSON

Mindmeld is a conversational AI platform for deep-domain voice interfaces and chatbots. https://www.mindmeld.com/

* Signing contribution agreement.

Co-authored-by: kunshar2 <kunshar2@cisco.com>
2020-10-21 18:42:11 +02:00
Ines Montani d7a4e8454b
Merge pull request #6274 from walterhenry/master
User contributor agreement
2020-10-19 16:30:58 +02:00
walterhenry ff82644746 User contributor agreement
Here it is!
2020-10-19 16:25:09 +02:00
Ines Montani 3851300e80 Update landing [ci skip] 2020-10-16 11:46:33 +02:00
Borijan Georgievski 2311192ba1
Include Macedonian language (#6230)
* Include Macedonian language

* Fix indentation at char_classes.py

* Fix indentation at char_classes.py

* Add Macedonian tests, update lex_attrs and char_classes

* Import unicode literals for python 2
2020-10-15 15:55:01 +02:00
Ines Montani bc027dc35c Update .gitignore [ci skip] 2020-10-15 12:43:35 +02:00
Ines Montani a3b84c7656 Update netlify.toml [ci skip] 2020-10-15 12:42:30 +02:00
Ines Montani 07a976b036
Merge pull request #6221 from baranitharan2020/master 2020-10-13 11:03:49 +02:00
Ines Montani 7f92a5ee6a
Update spacy/lang/ta/examples.py 2020-10-13 11:03:35 +02:00
Baranitharan d6037c1860
added sentence 2020-10-08 08:22:58 +05:30
Baranitharan 169857e0ec
Merge pull request #1 from baranitharan2020/baranitharan2020-patch-1
Update examples.py
2020-10-08 08:17:57 +05:30
Baranitharan 81afe9b19d
Update examples.py 2020-10-08 08:17:25 +05:30
Sofie Van Landeghem 241cd112f5
add reenabled pipe names back to the meta before serializing (#6219) 2020-10-08 00:44:16 +02:00
Sofie Van Landeghem 2998131416
Reproducibility for TextCat and Tok2Vec (#6218)
* ensure fixed seed in HashEmbed layers

* forgot about the joys of python 2
2020-10-08 00:43:46 +02:00
Wannaphong Phatthiyaphaibun 9fc8392b38
Add Thai tag map (LST20 Corpus) (#6163)
* Add Thai tag map (LST20 Corpus)

By @korakot

* Update tag_map.py

* Update tag_map.py

* Update tag_map.py
2020-10-07 11:12:01 +02:00
Duygu Altinok 7e821c2776
Turkish language syntax iterators (#6191)
* added tr_vocab to config

* basic test

* added syntax iterator to Turkish lang class

* first version for Turkish syntax iter, without flat

* added simple tests with nmod, amod, det

* more tests to amod and nmod

* separated noun chunks and parser test

* rearrangement after nchunk parser separation

* added recursive NPs

* tests with complicated recursive NPs

* tests with conjed NPs

* additional tests for conj NP

* small modification for shaving off conj from NP

* added tests with flat

* more tests with flat

* added examples with flats conjed

* added inner func for flat trick

* corrected parse

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2020-10-07 11:07:52 +02:00
Duygu Altinok 2ce6fc2611
Turkish tag map and morph rules addition (#6141)
* feat: added turkish tag map

* feat: morph rules cconj and sconj

* feat: more conjuncts

* feat: added popular postpositions

* feat: added adverbs

* feat: added personal pronouns

* feat: added reflexive pronouns

* minor: corrected case capital

* minor: fixed comma typo

* feat: added indef pronouns

* feat: added dict iter

* fixed comma typo

* updated language class with tag map and morph

* use default tag map instead

* removed tag map
2020-10-07 10:27:36 +02:00
Duygu Altinok b95a11dd95
Ordinal numbers for Turkish (#6142)
* minor ordinal number addition

* fixed typo

* added corresponding lexical test
2020-10-07 10:25:37 +02:00
Rahul Gupta 1a00bff06d
Hindi: Adds tests for lexical attributes (norm and like_num) (#5829)
* Hindi: Adds tests for lexical attributes (norm and like_num)

* Signs and sdds the contributor agreement

* Add ordinal numbers to be tagged as like_num

* Adds alternate pronunciation for 31 and 39
2020-10-07 10:23:32 +02:00
Nuccy90 c809b2c8e7
Update morph_rules.py (#6102)
* Update morph_rules.py

Added "dig" and "dej" ("you" in accusative form)

* Create Nuccy90.md

* Update Nuccy90.md
2020-10-06 15:14:47 +02:00
delzac 15ea401b39
Reflect on usage doc that IS_SENT_START attribute exist (#6114)
* Reflect on usage doc that IS_SENT_START attribute exist

* Create delzac.md
2020-10-06 15:11:01 +02:00
Šarūnas Navickas 047fb9f8b8
Website (Universe): An entry for rita-dsl (#6138)
* Create zaibacu.md

* Add RITA-DSL entry

* Update agreement

* Fix formatting
2020-10-06 11:19:36 +02:00
Florijan Stamenković 9db670b996
Fix Issue 6207 (#6208)
* Regression test for issue 6207

* Fix issue 6207

* Sign contributor agreement

* Minor adjustments to test

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2020-10-06 11:17:37 +02:00
Stanislav Schmidt 3589a64d44
Change type of texts argument in pipe to iterable (#6186)
* Change type of texts argument in pipe to iterable

* Add contributor agreement
2020-10-02 21:00:11 +02:00
Yohei Tamura 3243ddac8f
Fix/span.sent (#6083)
* add fail test

* fix test

* fix span.sent

* Remove incorrect implicit check

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2020-10-01 14:01:52 +02:00
Elijah Rippeth 4cbb954281
reorder so tagmap is replaced only if a custom file is provided. (#6164)
* reorder so tagmap is replaced only if a custom file is provided.

* Remove unneeded variable initialization

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2020-09-30 13:26:06 +02:00
Ines Montani 27c5795ea5 Fix version check in models directory [ci skip] 2020-09-25 09:23:29 +02:00
Muhammad Fahmi Rasyid 7489d02dea
Update Indonesian Example Phrases (#6124)
* create contributor agreement

* Update Indonesian example. (see  #1107)

Update Indonesian examples with more proper phrases. the current phrases contains sensitive and violent words.
2020-09-23 14:02:26 +02:00
Adriane Boyd e4acb28658
Fix norm in retokenizer split (#6111)
Parallel to behavior in merge, reset norm on original token in
retokenizer split.
2020-09-22 21:53:33 +02:00
Adriane Boyd 9b4979407d
Fix overlapping German noun chunks (#6112)
Add a similar fix as in #5470 to prevent the German noun chunks iterator
from producing overlapping spans.
2020-09-22 21:52:42 +02:00
Adriane Boyd 4625029370
Add pin for pyrsistent<0.17.0 (#6116)
Add pin for pyrsistent<0.17.0 since pyrsistent>=0.17.1 is only
compatible with python3.5+.
2020-09-22 19:04:49 +02:00
Marek Grzenkowicz a26f864ed3
Clarify how to choose pretrained weights files (closes #6027) [ci skip] (#6039) 2020-09-08 21:13:50 +02:00
Ines Montani 33d9c64977 Fix outbound link and update package lock [ci skip] 2020-09-04 14:44:38 +02:00
Ines Montani ba6cf9821f Replace docs analytics [ci skip] 2020-09-04 14:28:28 +02:00
holubvl3 0a27fca557
Create examples.py (#5985)
* Create examples.py

* Create tag_map.py

* Delete tag_map.py

* Update examples.py

formatting: add empty line

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2020-09-04 11:00:14 +02:00
Brad Jascob 2160aafec6
Updates spaCy Universe for amrlib (#6020)
* Updates spaCy Universe for amrlib

* Updates to doc based on feedback
2020-09-04 10:03:35 +02:00
Marek Grzenkowicz 92d7832a86
Fix off-by-one error for best iteration calculation (closes #6014) (#6016) 2020-09-02 15:15:45 +02:00
Sofie Van Landeghem f7a25d69f7
Bugfix in merge_entities (#6005)
* failing test

* bugfix
2020-09-01 21:57:52 +02:00
Juan Gutiérrez 9002bea29f
Update suffixes example (#5989)
* Update suffixes example

The current example will throw `TypeError: can only concatenate list (not "tuple") to list`

* Signing Contributor Agreement
2020-08-31 12:44:56 +02:00
Adriane Boyd caf23462eb
Add 3rd party licenses (#5959) 2020-08-26 15:23:59 +02:00
Adriane Boyd 7d7b65ffd4
Fix raw strings in URL pattern (#5972)
Add missing raw string specifiers.
2020-08-26 04:00:49 +02:00