spaCy/spacy/lang/fa/lemmatizer/_nouns_exc.py

782 lines
25 KiB
Python
Raw Normal View History

💫 Port master changes over to develop (#2979) * Create aryaprabhudesai.md (#2681) * Update _install.jade (#2688) Typo fix: "models" -> "model" * Add FAC to spacy.explain (resolves #2706) * Remove docstrings for deprecated arguments (see #2703) * When calling getoption() in conftest.py, pass a default option (#2709) * When calling getoption() in conftest.py, pass a default option This is necessary to allow testing an installed spacy by running: pytest --pyargs spacy * Add contributor agreement * update bengali token rules for hyphen and digits (#2731) * Less norm computations in token similarity (#2730) * Less norm computations in token similarity * Contributor agreement * Remove ')' for clarity (#2737) Sorry, don't mean to be nitpicky, I just noticed this when going through the CLI and thought it was a quick fix. That said, if this was intention than please let me know. * added contributor agreement for mbkupfer (#2738) * Basic support for Telugu language (#2751) * Lex _attrs for polish language (#2750) * Signed spaCy contributor agreement * Added polish version of english lex_attrs * Introduces a bulk merge function, in order to solve issue #653 (#2696) * Fix comment * Introduce bulk merge to increase performance on many span merges * Sign contributor agreement * Implement pull request suggestions * Describe converters more explicitly (see #2643) * Add multi-threading note to Language.pipe (resolves #2582) [ci skip] * Fix formatting * Fix dependency scheme docs (closes #2705) [ci skip] * Don't set stop word in example (closes #2657) [ci skip] * Add words to portuguese language _num_words (#2759) * Add words to portuguese language _num_words * Add words to portuguese language _num_words * Update Indonesian model (#2752) * adding e-KTP in tokenizer exceptions list * add exception token * removing lines with containing space as it won't matter since we use .split() method in the end, added new tokens in exception * add tokenizer exceptions list * combining base_norms with norm_exceptions * adding norm_exception * fix double key in lemmatizer * remove unused import on punctuation.py * reformat stop_words to reduce number of lines, improve readibility * updating tokenizer exception * implement is_currency for lang/id * adding orth_first_upper in tokenizer_exceptions * update the norm_exception list * remove bunch of abbreviations * adding contributors file * Fixed spaCy+Keras example (#2763) * bug fixes in keras example * created contributor agreement * Adding French hyphenated first name (#2786) * Fix typo (closes #2784) * Fix typo (#2795) [ci skip] Fixed typo on line 6 "regcognizer --> recognizer" * Adding basic support for Sinhala language. (#2788) * adding Sinhala language package, stop words, examples and lex_attrs. * Adding contributor agreement * Updating contributor agreement * Also include lowercase norm exceptions * Fix error (#2802) * Fix error ValueError: cannot resize an array that references or is referenced by another array in this way. Use the resize function * added spaCy Contributor Agreement * Add charlax's contributor agreement (#2805) * agreement of contributor, may I introduce a tiny pl languge contribution (#2799) * Contributors agreement * Contributors agreement * Contributors agreement * Add jupyter=True to displacy.render in documentation (#2806) * Revert "Also include lowercase norm exceptions" This reverts commit 70f4e8adf37cfcfab60be2b97d6deae949b30e9e. * Remove deprecated encoding argument to msgpack * Set up dependency tree pattern matching skeleton (#2732) * Fix bug when too many entity types. Fixes #2800 * Fix Python 2 test failure * Require older msgpack-numpy * Restore encoding arg on msgpack-numpy * Try to fix version pin for msgpack-numpy * Update Portuguese Language (#2790) * Add words to portuguese language _num_words * Add words to portuguese language _num_words * Portuguese - Add/remove stopwords, fix tokenizer, add currency symbols * Extended punctuation and norm_exceptions in the Portuguese language * Correct error in spacy universe docs concerning spacy-lookup (#2814) * Update Keras Example for (Parikh et al, 2016) implementation (#2803) * bug fixes in keras example * created contributor agreement * baseline for Parikh model * initial version of parikh 2016 implemented * tested asymmetric models * fixed grevious error in normalization * use standard SNLI test file * begin to rework parikh example * initial version of running example * start to document the new version * start to document the new version * Update Decompositional Attention.ipynb * fixed calls to similarity * updated the README * import sys package duh * simplified indexing on mapping word to IDs * stupid python indent error * added code from https://github.com/tensorflow/tensorflow/issues/3388 for tf bug workaround * Fix typo (closes #2815) [ci skip] * Update regex version dependency * Set version to 2.0.13.dev3 * Skip seemingly problematic test * Remove problematic test * Try previous version of regex * Revert "Remove problematic test" This reverts commit bdebbef45552d698d390aa430b527ee27830f11b. * Unskip test * Try older version of regex * 💫 Update training examples and use minibatching (#2830) <!--- Provide a general summary of your changes in the title. --> ## Description Update the training examples in `/examples/training` to show usage of spaCy's `minibatch` and `compounding` helpers ([see here](https://spacy.io/usage/training#tips-batch-size) for details). The lack of batching in the examples has caused some confusion in the past, especially for beginners who would copy-paste the examples, update them with large training sets and experienced slow and unsatisfying results. ### Types of change enhancements ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. * Visual C++ link updated (#2842) (closes #2841) [ci skip] * New landing page * Add contribution agreement * Correcting lang/ru/examples.py (#2845) * Correct some grammatical inaccuracies in lang\ru\examples.py; filled Contributor Agreement * Correct some grammatical inaccuracies in lang\ru\examples.py * Move contributor agreement to separate file * Set version to 2.0.13.dev4 * Add Persian(Farsi) language support (#2797) * Also include lowercase norm exceptions * Remove in favour of https://github.com/explosion/spaCy/graphs/contributors * Rule-based French Lemmatizer (#2818) <!--- Provide a general summary of your changes in the title. --> ## Description <!--- Use this section to describe your changes. If your changes required testing, include information about the testing environment and the tests you ran. If your test fixes a bug reported in an issue, don't forget to include the issue number. If your PR is still a work in progress, that's totally fine – just include a note to let us know. --> Add a rule-based French Lemmatizer following the english one and the excellent PR for [greek language optimizations](https://github.com/explosion/spaCy/pull/2558) to adapt the Lemmatizer class. ### Types of change <!-- What type of change does your PR cover? Is it a bug fix, an enhancement or new feature, or a change to the documentation? --> - Lemma dictionary used can be found [here](http://infolingu.univ-mlv.fr/DonneesLinguistiques/Dictionnaires/telechargement.html), I used the XML version. - Add several files containing exhaustive list of words for each part of speech - Add some lemma rules - Add POS that are not checked in the standard Lemmatizer, i.e PRON, DET, ADV and AUX - Modify the Lemmatizer class to check in lookup table as a last resort if POS not mentionned - Modify the lemmatize function to check in lookup table as a last resort - Init files are updated so the model can support all the functionalities mentioned above - Add words to tokenizer_exceptions_list.py in respect to regex used in tokenizer_exceptions.py ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [X] I have submitted the spaCy Contributor Agreement. - [X] I ran the tests, and all new and existing tests passed. - [X] My changes don't require a change to the documentation, or if they do, I've added all required information. * Set version to 2.0.13 * Fix formatting and consistency * Update docs for new version [ci skip] * Increment version [ci skip] * Add info on wheels [ci skip] * Adding "This is a sentence" example to Sinhala (#2846) * Add wheels badge * Update badge [ci skip] * Update README.rst [ci skip] * Update murmurhash pin * Increment version to 2.0.14.dev0 * Update GPU docs for v2.0.14 * Add wheel to setup_requires * Import prefer_gpu and require_gpu functions from Thinc * Add tests for prefer_gpu() and require_gpu() * Update requirements and setup.py * Workaround bug in thinc require_gpu * Set version to v2.0.14 * Update push-tag script * Unhack prefer_gpu * Require thinc 6.10.6 * Update prefer_gpu and require_gpu docs [ci skip] * Fix specifiers for GPU * Set version to 2.0.14.dev1 * Set version to 2.0.14 * Update Thinc version pin * Increment version * Fix msgpack-numpy version pin * Increment version * Update version to 2.0.16 * Update version [ci skip] * Redundant ')' in the Stop words' example (#2856) <!--- Provide a general summary of your changes in the title. --> ## Description <!--- Use this section to describe your changes. If your changes required testing, include information about the testing environment and the tests you ran. If your test fixes a bug reported in an issue, don't forget to include the issue number. If your PR is still a work in progress, that's totally fine – just include a note to let us know. --> ### Types of change <!-- What type of change does your PR cover? Is it a bug fix, an enhancement or new feature, or a change to the documentation? --> ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [ ] I have submitted the spaCy Contributor Agreement. - [ ] I ran the tests, and all new and existing tests passed. - [ ] My changes don't require a change to the documentation, or if they do, I've added all required information. * Documentation improvement regarding joblib and SO (#2867) Some documentation improvements ## Description 1. Fixed the dead URL to joblib 2. Fixed Stack Overflow brand name (with space) ### Types of change Documentation ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. * raise error when setting overlapping entities as doc.ents (#2880) * Fix out-of-bounds access in NER training The helper method state.B(1) gets the index of the first token of the buffer, or -1 if no such token exists. Normally this is safe because we pass this to functions like state.safe_get(), which returns an empty token. Here we used it directly as an array index, which is not okay! This error may have been the cause of out-of-bounds access errors during training. Similar errors may still be around, so much be hunted down. Hunting this one down took a long time...I printed out values across training runs and diffed, looking for points of divergence between runs, when no randomness should be allowed. * Change PyThaiNLP Url (#2876) * Fix missing comma * Add example showing a fix-up rule for space entities * Set version to 2.0.17.dev0 * Update regex version * Revert "Update regex version" This reverts commit 62358dd867d15bc6a475942dff34effba69dd70a. * Try setting older regex version, to align with conda * Set version to 2.0.17 * Add spacy-js to universe [ci-skip] * Add spacy-raspberry to universe (closes #2889) * Add script to validate universe json [ci skip] * Removed space in docs + added contributor indo (#2909) * - removed unneeded space in documentation * - added contributor info * Allow input text of length up to max_length, inclusive (#2922) * Include universe spec for spacy-wordnet component (#2919) * feat: include universe spec for spacy-wordnet component * chore: include spaCy contributor agreement * Minor formatting changes [ci skip] * Fix image [ci skip] Twitter URL doesn't work on live site * Check if the word is in one of the regular lists specific to each POS (#2886) * 💫 Create random IDs for SVGs to prevent ID clashes (#2927) Resolves #2924. ## Description Fixes problem where multiple visualizations in Jupyter notebooks would have clashing arc IDs, resulting in weirdly positioned arc labels. Generating a random ID prefix so even identical parses won't receive the same IDs for consistency (even if effect of ID clash isn't noticable here.) ### Types of change bug fix ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. * Fix typo [ci skip] * fixes symbolic link on py3 and windows (#2949) * fixes symbolic link on py3 and windows during setup of spacy using command python -m spacy link en_core_web_sm en closes #2948 * Update spacy/compat.py Co-Authored-By: cicorias <cicorias@users.noreply.github.com> * Fix formatting * Update universe [ci skip] * Catalan Language Support (#2940) * Catalan language Support * Ddding Catalan to documentation * Sort languages alphabetically [ci skip] * Update tests for pytest 4.x (#2965) <!--- Provide a general summary of your changes in the title. --> ## Description - [x] Replace marks in params for pytest 4.0 compat ([see here](https://docs.pytest.org/en/latest/deprecations.html#marks-in-pytest-mark-parametrize)) - [x] Un-xfail passing tests (some fixes in a recent update resolved a bunch of issues, but tests were apparently never updated here) ### Types of change <!-- What type of change does your PR cover? Is it a bug fix, an enhancement or new feature, or a change to the documentation? --> ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. * Fix regex pin to harmonize with conda (#2964) * Update README.rst * Fix bug where Vocab.prune_vector did not use 'batch_size' (#2977) Fixes #2976 * Fix typo * Fix typo * Remove duplicate file * Require thinc 7.0.0.dev2 Fixes bug in gpu_ops that would use cupy instead of numpy on CPU * Add missing import * Fix error IDs * Fix tests
2018-11-29 15:30:29 +00:00
# coding: utf8
from __future__ import unicode_literals
NOUNS_EXC = {
"آثار": ("اثر",),
"آرا": ("رأی",),
"آراء": ("رأی",),
"آفات": ("آفت",),
"اباطیل": ("باطل",),
"ائمه": ("امام",),
"ابرار": ("بر",),
"ابعاد": ("بعد",),
"ابنیه": ("بنا",),
"ابواب": ("باب",),
"ابیات": ("بیت",),
"اجداد": ("جد",),
"اجساد": ("جسد",),
"اجناس": ("جنس",),
"اثمار": ("ثمر",),
"اجرام": ("جرم",),
"اجسام": ("جسم",),
"اجنه": ("جن",),
"احادیث": ("حدیث",),
"احجام": ("حجم",),
"احرار": ("حر",),
"احزاب": ("حزب",),
"احکام": ("حکم",),
"اخبار": ("خبر",),
"اخیار": ("خیر",),
"ادبا": ("ادیب",),
"ادعیه": ("دعا",),
"ادله": ("دلیل",),
"ادوار": ("دوره",),
"ادیان": ("دین",),
"اذهان": ("ذهن",),
"اذکار": ("ذکر",),
"اراضی": ("ارض",),
"ارزاق": ("رزق",),
"ارقام": ("رقم",),
"ارواح": ("روح",),
"ارکان": ("رکن",),
"ازمنه": ("زمان",),
"اساتید": ("استاد",),
"اساطیر": ("اسطوره",),
"اسامی": ("اسم",),
"اسرار": ("سر",),
"اسما": ("اسم",),
"اسناد": ("سند",),
"اسیله": ("سوال",),
"اشجار": ("شجره",),
"اشخاص": ("شخص",),
"اشرار": ("شر",),
"اشربه": ("شراب",),
"اشعار": ("شعر",),
"اشقیا": ("شقی",),
"اشیا": ("شی",),
"اشباح": ("شبح",),
"اصدقا": ("صدیق",),
"اصناف": ("صنف",),
"اصنام": ("صنم",),
"اصوات": ("صوت",),
"اصول": ("اصل",),
"اضداد": ("ضد",),
"اطبا": ("طبیب",),
"اطعمه": ("طعام",),
"اطفال": ("طفل",),
"الطاف": ("لطف",),
"اعدا": ("عدو",),
"اعزا": ("عزیز",),
"اعضا": ("عضو",),
"اعماق": ("عمق",),
"الفاظ": ("لفظ",),
"اعناب": ("عنب",),
"اغذیه": ("غذا",),
"اغراض": ("غرض",),
"افراد": ("فرد",),
"افعال": ("فعل",),
"افلاک": ("فلک",),
"افکار": ("فکر",),
"اقالیم": ("اقلیم",),
"اقربا": ("قریب",),
"اقسام": ("قسم",),
"اقشار": ("قشر",),
"اقفال": ("قفل",),
"اقلام": ("قلم",),
"اقوال": ("قول",),
"اقوام": ("قوم",),
"البسه": ("لباس",),
"الحام": ("لحم",),
"الحکام": ("الحاکم",),
"القاب": ("لقب",),
"الواح": ("لوح",),
"الکبار": ("الکبیر",),
"اماکن": ("مکان",),
"امثال": ("مثل",),
"امراض": ("مرض",),
"امم": ("امت",),
"امواج": ("موج",),
"اموال": ("مال",),
"امور": ("امر",),
"امیال": ("میل",),
"انبیا": ("نبی",),
"انجم": ("نجم",),
"انظار": ("نظر",),
"انفس": ("نفس",),
"انهار": ("نهر",),
"انواع": ("نوع",),
"اهالی": ("اهل",),
"اهداف": ("هدف",),
"اواخر": ("آخر",),
"اواسط": ("وسط",),
"اوایل": ("اول",),
"اوراد": ("ورد",),
"اوراق": ("ورق",),
"اوزان": ("وزن",),
"اوصاف": ("وصف",),
"اوضاع": ("وضع",),
"اوقات": ("وقت",),
"اولاد": ("ولد",),
"اولیا": ("ولی",),
"اولیاء": ("ولی",),
"اوهام": ("وهم",),
"اکاذیب": ("اکذوبه",),
"اکفان": ("کفن",),
"ایالات": ("ایالت",),
"ایام": ("یوم",),
"ایتام": ("یتیم",),
"بشایر": ("بشارت",),
"بصایر": ("بصیرت",),
"بطون": ("بطن",),
"بنادر": ("بندر",),
"بیوت": ("بیت",),
"تجار": ("تاجر",),
"تجارب": ("تجربه",),
"تدابیر": ("تدبیر",),
"تعاریف": ("تعریف",),
"تلامیذ": ("تلمیذ",),
"تهم": ("تهمت",),
"توابیت": ("تابوت",),
"تواریخ": ("تاریخ",),
"جبال": ("جبل",),
"جداول": ("جدول",),
"جدود": ("جد",),
"جراثیم": ("جرثوم",),
"جرایم": ("جرم",),
"جرائم": ("جرم",),
"جزئیات": ("جزء",),
"جزایر": ("جزیره",),
"جزییات": ("جزء",),
"جنایات": ("جنایت",),
"جهات": ("جهت",),
"جوامع": ("جامعه",),
"حدود": ("حد",),
"حروف": ("حرف",),
"حقایق": ("حقیقت",),
"حقوق": ("حق",),
"حوادث": ("حادثه",),
"حواشی": ("حاشیه",),
"حوایج": ("حاجت",),
"حوائج": ("حاجت",),
"حکما": ("حکیم",),
"خدمات": ("خدمت",),
"خدمه": ("خادم",),
"خدم": ("خادم",),
"خزاین": ("خزینه",),
"خصایص": ("خصیصه",),
"خطوط": ("خط",),
"دراهم": ("درهم",),
"دروس": ("درس",),
"دفاتر": ("دفتر",),
"دلایل": ("دلیل",),
"دلائل": ("دلیل",),
"ذخایر": ("ذخیره",),
"ذنوب": ("ذنب",),
"ربوع": ("ربع",),
"رجال": ("رجل",),
"رسایل": ("رسال",),
"رسوم": ("رسم",),
"روابط": ("رابطه",),
"روسا": ("رئیس",),
"رئوس": ("راس",),
"ریوس": ("راس",),
"زوار": ("زائر",),
"ساعات": ("ساعت",),
"سبل": ("سبیل",),
"سطوح": ("سطح",),
"سطور": ("سطر",),
"سعدا": ("سعید",),
"سفن": ("سفینه",),
"سقاط": ("ساقی",),
"سلاطین": ("سلطان",),
"سلایق": ("سلیقه",),
"سموم": ("سم",),
"سنن": ("سنت",),
"سنین": ("سن",),
"سهام": ("سهم",),
"سوابق": ("سابقه",),
"سواحل": ("ساحل",),
"سوانح": ("سانحه",),
"شباب": ("شاب",),
"شرایط": ("شرط",),
"شروط": ("شرط",),
"شرکا": ("شریک",),
"شعب": ("شعبه",),
"شعوب": ("شعب",),
"شموس": ("شمس",),
"شهدا": ("شهید",),
"شهور": ("شهر",),
"شواهد": ("شاهد",),
"شوون": ("شان",),
"شکات": ("شاکی",),
"شیاطین": ("شیطان",),
"صبیان": ("صبی",),
"صحف": ("صحیفه",),
"صغار": ("صغیر",),
"صفوف": ("صف",),
"صنادیق": ("صندوق",),
"ضعفا": ("ضعیف",),
"ضمایر": ("ضمیر",),
"ضوابط": ("ضابطه",),
"طرق": ("طریق",),
"طلاب": ("طلبه",),
"طواغیت": ("طاغوت",),
"طیور": ("طیر",),
"عادات": ("عادت",),
"عباد": ("عبد",),
"عبارات": ("عبارت",),
"عجایب": ("عجیب",),
"عزایم": ("عزیمت",),
"عشایر": ("عشیره",),
"عطور": ("عطر",),
"عظما": ("عظیم",),
"عقاید": ("عقیده",),
"عقائد": ("عقیده",),
"علائم": ("علامت",),
"علایم": ("علامت",),
"علما": ("عالم",),
"علوم": ("علم",),
"عمال": ("عمله",),
"عناصر": ("عنصر",),
"عناوین": ("عنوان",),
"عواطف": ("عاطفه",),
"عواقب": ("عاقبت",),
"عوالم": ("عالم",),
"عوامل": ("عامل",),
"عیوب": ("عیب",),
"عیون": ("عین",),
"غدد": ("غده",),
"غرف": ("غرفه",),
"غیوب": ("غیب",),
"غیوم": ("غیم",),
"فرایض": ("فریضه",),
"فضایل": ("فضیلت",),
"فضلا": ("فاضل",),
"فواصل": ("فاصله",),
"فواید": ("فایده",),
"قبایل": ("قبیله",),
"قرون": ("قرن",),
"قصص": ("قصه",),
"قضات": ("قاضی",),
"قضایا": ("قضیه",),
"قلل": ("قله",),
"قلوب": ("قلب",),
"قواعد": ("قاعده",),
"قوانین": ("قانون",),
"قیود": ("قید",),
"لطایف": ("لطیفه",),
"لیالی": ("لیل",),
"مباحث": ("مبحث",),
"مبالغ": ("مبلغ",),
"متون": ("متن",),
"مجالس": ("مجلس",),
"محاصیل": ("محصول",),
"محافل": ("محفل",),
"محاکم": ("محکمه",),
"مخارج": ("خرج",),
"مدارس": ("مدرسه",),
"مدارک": ("مدرک",),
"مداین": ("مدینه",),
"مدن": ("مدینه",),
"مراتب": ("مرتبه",),
"مراتع": ("مرتع",),
"مراجع": ("مرجع",),
"مراحل": ("مرحله",),
"مسائل": ("مسئله",),
"مساجد": ("مسجد",),
"مساعی": ("سعی",),
"مسالک": ("مسلک",),
"مساکین": ("مسکین",),
"مسایل": ("مسئله",),
"مشاعر": ("مشعر",),
"مشاغل": ("شغل",),
"مشایخ": ("شیخ",),
"مصادر": ("مصدر",),
"مصادق": ("مصداق",),
"مصادیق": ("مصداق",),
"مصاعب": ("مصعب",),
"مضار": ("ضرر",),
"مضامین": ("مضمون",),
"مطالب": ("مطلب",),
"مظالم": ("مظلمه",),
"مظاهر": ("مظهر",),
"اهرام": ("هرم",),
"معابد": ("معبد",),
"معابر": ("معبر",),
"معاجم": ("معجم",),
"معادن": ("معدن",),
"معاذیر": ("عذر",),
"معارج": ("معراج",),
"معاصی": ("معصیت",),
"معالم": ("معلم",),
"معایب": ("عیب",),
"مفاسد": ("مفسده",),
"مفاصل": ("مفصل",),
"مفاهیم": ("مفهوم",),
"مقابر": ("مقبره",),
"مقاتل": ("مقتل",),
"مقادیر": ("مقدار",),
"مقاصد": ("مقصد",),
"مقاطع": ("مقطع",),
"ملابس": ("ملبس",),
"ملوک": ("ملک",),
"ممالک": ("مملکت",),
"منابع": ("منبع",),
"منازل": ("منزل",),
"مناسبات": ("مناسبت",),
"مناسک": ("منسک",),
"مناطق": ("منطقه",),
"مناظر": ("منظره",),
"منافع": ("منفعت",),
"موارد": ("مورد",),
"مواضع": ("موضع",),
"مواضیع": ("موضوع",),
"مواطن": ("موطن",),
"مواقع": ("موقع",),
"موانع": ("مانع",),
"مکاتب": ("مکتب",),
"مکاتیب": ("مکتوب",),
"مکارم": ("مکرمه",),
"میادین": ("میدان",),
"نتایج": ("نتیجه",),
"نعم": ("نعمت",),
"نفوس": ("نفس",),
"نقاط": ("نقطه",),
"نواحی": ("ناحیه",),
"نوافذ": ("نافذه",),
"نواقص": ("نقص",),
"نوامیس": ("ناموس",),
"نکات": ("نکته",),
"نیات": ("نیت",),
"هدایا": ("هدیه",),
"واقعیات": ("واقعیت",),
"وجوه": ("وجه",),
"وحوش": ("وحش",),
"وزرا": ("وزیر",),
"وسایل": ("وسیله",),
"وصایا": ("وصیت",),
"وظایف": ("وظیفه",),
"وعاظ": ("واعظ",),
"وقایع": ("واقعه",),
"کتب": ("کتاب",),
"کسبه": ("کاسب",),
"کفار": ("کافر",),
"کواکب": ("کوکب",),
"تصاویر": ("تصویر",),
"صنوف": ("صنف",),
"اجزا": ("جزء",),
"اجزاء": ("جزء",),
"ذخائر": ("ذخیره",),
"خسارات": ("خسارت",),
"عشاق": ("عاشق",),
"تصانیف": ("تصنیف",),
"دﻻیل": ("دلیل",),
"قوا": ("قوه",),
"ملل": ("ملت",),
"جوایز": ("جایزه",),
"جوائز": ("جایزه",),
"ابعاض": ("بعض",),
"اتباع": ("تبعه",),
"اجلاس": ("جلسه",),
"احشام": ("حشم",),
"اخلاف": ("خلف",),
"ارامنه": ("ارمنی",),
"ازواج": ("زوج",),
"اسباط": ("سبط",),
"اعداد": ("عدد",),
"اعصار": ("عصر",),
"اعقاب": ("عقبه",),
"اعیاد": ("عید",),
"اعیان": ("عین",),
"اغیار": ("غیر",),
"اقارب": ("اقرب",),
"اقران": ("قرن",),
"اقساط": ("قسط",),
"امنای": ("امین",),
"امنا": ("امین",),
"اموات": ("میت",),
"اناجیل": ("انجیل",),
"انحا": ("نحو",),
"انساب": ("نسب",),
"انوار": ("نور",),
"اوامر": ("امر",),
"اوائل": ("اول",),
"اوصیا": ("وصی",),
"آحاد": ("احد",),
"براهین": ("برهان",),
"تعابیر": ("تعبیر",),
"تعالیم": ("تعلیم",),
"تفاسیر": ("تفسیر",),
"تکالیف": ("تکلیف",),
"تماثیل": ("تمثال",),
"جنود": ("جند",),
"جوانب": ("جانب",),
"حاجات": ("حاجت",),
"حرکات": ("حرکت",),
"حضرات": ("حضرت",),
"حکایات": ("حکایت",),
"حوالی": ("حول",),
"خصایل": ("خصلت",),
"خلایق": ("خلق",),
"خلفا": ("خلیفه",),
"دعاوی": ("دعوا",),
"دیون": ("دین",),
"ذراع": ("ذرع",),
"رعایا": ("رعیت",),
"روایات": ("روایت",),
"شعرا": ("شاعر",),
"شکایات": ("شکایت",),
"شهوات": ("شهوت",),
"شیوخ": ("شیخ",),
"شئون": ("شأن",),
"طبایع": ("طبع",),
"ظروف": ("ظرف",),
"ظواهر": ("ظاهر",),
"عبادات": ("عبادت",),
"عرایض": ("عریضه",),
"عرفا": ("عارف",),
"عروق": ("عرق",),
"عساکر": ("عسکر",),
"علماء": ("عالم",),
"فتاوا": ("فتوا",),
"فراعنه": ("فرعون",),
"فرامین": ("فرمان",),
"فروض": ("فرض",),
"فروع": ("فرع",),
"فصول": ("فصل",),
"فقها": ("فقیه",),
"قبور": ("قبر",),
"قبوض": ("قبض",),
"قدوم": ("قدم",),
"قرائات": ("قرائت",),
"قرائن": ("قرینه",),
"لغات": ("لغت",),
"مجامع": ("مجمع",),
"مخازن": ("مخزن",),
"مدارج": ("درجه",),
"مذاهب": ("مذهب",),
"مراکز": ("مرکز",),
"مصارف": ("مصرف",),
"مطامع": ("طمع",),
"معانی": ("معنی",),
"مناصب": ("منصب",),
"منافذ": ("منفذ",),
"مواریث": ("میراث",),
"موازین": ("میزان",),
"موالی": ("مولی",),
"مواهب": ("موهبت",),
"نسوان": ("نسا",),
"نصوص": ("نص",),
"نظایر": ("نظیر",),
"نقایص": ("نقص",),
"نقوش": ("نقش",),
"ولایات": ("ولایت",),
"هیئات": ("هیأت",),
"جماهیر": ("جمهوری",),
"خصائص": ("خصیصه",),
"دقایق": ("دقیقه",),
"رذایل": ("رذیلت",),
"طوایف": ("طایفه",),
"علامات": ("علامت",),
"علایق": ("علاقه",),
"علل": ("علت",),
"غرایز": ("غریزه",),
"غرائز": ("غریزه",),
"غنایم": ("غنیمت",),
"فرائض": ("فریضه",),
"فضائل": ("فضیلت",),
"فقرا": ("فقیر",),
"فلاسفه": ("فیلسوف",),
"فواحش": ("فاحشه",),
"قصائد": ("قصیده",),
"قصاید": ("قصیده",),
"قوائد": ("قائده",),
"مزارع": ("مزرعه",),
"مصائب": ("مصیبت",),
"معارف": ("معرفت",),
"نصایح": ("نصیحت",),
"وثایق": ("وثیقه",),
"وظائف": ("وظیفه",),
"توابین": ("تواب",),
"رفقا": ("رفیق",),
"رقبا": ("رقیب",),
"زحمات": ("زحمت",),
"زعما": ("زعیم",),
"زوایا": ("زاویه",),
"سماوات": ("سما",),
"علوفه": ("علف",),
"غایات": ("غایت",),
"فنون": ("فن",),
"لذات": ("لذت",),
"نعمات": ("نعمت",),
"امراء": ("امیر",),
"امرا": ("امیر",),
"دهاقین": ("دهقان",),
"سنوات": ("سنه",),
"عمارات": ("عمارت",),
"فتوح": ("فتح",),
"لذائذ": ("لذیذ",),
"لذایذ": ("لذیذ", "لذت"),
"تکایا": ("تکیه",),
"صفات": ("صفت",),
"خصوصیات": ("خصوصیت",),
"کیفیات": ("کیفیت",),
"حملات": ("حمله",),
"شایعات": ("شایعه",),
"صدمات": ("صدمه",),
"غلات": ("غله",),
"کلمات": ("کلمه",),
"مبارزات": ("مبارزه",),
"مراجعات": ("مراجعه",),
"مطالبات": ("مطالبه",),
"مکاتبات": ("مکاتبه",),
"نشریات": ("نشریه",),
"بحور": ("بحر",),
"تحقیقات": ("تحقیق",),
"مکالمات": ("مکالمه",),
"ریزمکالمات": ("ریزمکالمه",),
"تجربیات": ("تجربه",),
"جملات": ("جمله",),
"حالات": ("حالت",),
"حجاج": ("حاجی",),
"حسنات": ("حسنه",),
"حشرات": ("حشره",),
"خاطرات": ("خاطره",),
"درجات": ("درجه",),
"دفعات": ("دفعه",),
"سیارات": ("سیاره",),
"شبهات": ("شبهه",),
"ضایعات": ("ضایعه",),
"ضربات": ("ضربه",),
"طبقات": ("طبقه",),
"فرضیات": ("فرضیه",),
"قطرات": ("قطره",),
"قطعات": ("قطعه",),
"قلاع": ("قلعه",),
"کشیشان": ("کشیش",),
"مادیات": ("مادی",),
"مباحثات": ("مباحثه",),
"مجاهدات": ("مجاهدت",),
"محلات": ("محله",),
"مداخلات": ("مداخله",),
"مشقات": ("مشقت",),
"معادلات": ("معادله",),
"معوقات": ("معوقه",),
"منویات": ("منویه",),
"موقوفات": ("موقوفه",),
"موسسات": ("موسسه",),
"حلقات": ("حلقه",),
"ایات": ("ایه",),
"اصلح": ("صالح",),
"اظهر": ("ظاهر",),
"آیات": ("آیه",),
"برکات": ("برکت",),
"جزوات": ("جزوه",),
"خطابات": ("خطابه",),
"دوایر": ("دایره",),
"روحیات": ("روحیه",),
"متهمان": ("متهم",),
"مجاری": ("مجرا",),
"مشترکات": ("مشترک",),
"ورثه": ("وارث",),
"وکلا": ("وکیل",),
"نقبا": ("نقیب",),
"سفرا": ("سفیر",),
"مآخذ": ("مأخذ",),
"احوال": ("حال",),
"آلام": ("الم",),
"مزایا": ("مزیت",),
"عقلا": ("عاقل",),
"مشاهد": ("مشهد",),
"ظلمات": ("ظلمت",),
"خفایا": ("خفیه",),
"مشاهدات": ("مشاهده",),
"امامان": ("امام",),
"سگان": ("سگ",),
"نظریات": ("نظریه",),
"آفاق": ("افق",),
"آمال": ("امل",),
"دکاکین": ("دکان",),
"قصبات": ("قصبه",),
"مضرات": ("مضرت",),
"قبائل": ("قبیله",),
"مجانین": ("مجنون",),
"سيئات": ("سیئه",),
"صدقات": ("صدقه",),
"کثافات": ("کثافت",),
"کسورات": ("کسر",),
"معالجات": ("معالجه",),
"مقابلات": ("مقابله",),
"مناظرات": ("مناظره",),
"ناملايمات": ("ناملایمت",),
"وجوهات": ("وجه",),
"مصادرات": ("مصادره",),
"ملمعات": ("ملمع",),
"اولویات": ("اولویت",),
"جمرات": ("جمره",),
"زیارات": ("زیارت",),
"عقبات": ("عقبه",),
"کرامات": ("کرامت",),
"مراقبات": ("مراقبه",),
"نجاسات": ("نجاست",),
"هجویات": ("هجو",),
"تبدلات": ("تبدل",),
"روات": ("راوی",),
"فیوضات": ("فیض",),
"کفارات": ("کفاره",),
"نذورات": ("نذر",),
"حفریات": ("حفر",),
"عنایات": ("عنایت",),
"جراحات": ("جراحت",),
"ثمرات": ("ثمره",),
"حکام": ("حاکم",),
"مرسولات": ("مرسوله",),
"درایات": ("درایت",),
"سیئات": ("سیئه",),
"عدوات": ("عداوت",),
"عشرات": ("عشره",),
"عقوبات": ("عقوبه",),
"عقودات": ("عقود",),
"کثرات": ("کثرت",),
"مواجهات": ("مواجهه",),
"مواصلات": ("مواصله",),
"اجوبه": ("جواب",),
"اضلاع": ("ضلع",),
"السنه": ("لسان",),
"اشتات": ("شت",),
"دعوات": ("دعوت",),
"صعوبات": ("صعوبت",),
"عفونات": ("عفونت",),
"علوفات": ("علوفه",),
"غرامات": ("غرامت",),
"فارقات": ("فارقت",),
"لزوجات": ("لزوجت",),
"محللات": ("محلله",),
"مسافات": ("مسافت",),
"مسافحات": ("مسافحه",),
"مسامرات": ("مسامره",),
"مستلذات": ("مستلذ",),
"مسرات": ("مسرت",),
"مشافهات": ("مشافهه",),
"مشاهرات": ("مشاهره",),
"معروشات": ("معروشه",),
"مجادلات": ("مجادله",),
"ابغاض": ("بغض",),
"اجداث": ("جدث",),
"اجواز": ("جوز",),
"اجواد": ("جواد",),
"ازاهیر": ("ازهار",),
"عوائد": ("عائده",),
"احافیر": ("احفار",),
"احزان": ("حزن",),
"آنام": ("انام",),
"احباب": ("حبیب",),
"نوابغ": ("نابغه",),
"بینات": ("بینه",),
"حوالات": ("حواله",),
"حوالجات": ("حواله",),
"دستجات": ("دسته",),
"شمومات": ("شموم",),
"طاقات": ("طاقه",),
"علاقات": ("علاقه",),
"مراسلات": ("مراسله",),
"موجهات": ("موجه",),
"اقویا": ("قوی",),
"اغنیا": ("غنی",),
"بلایا": ("بلا",),
"خطایا": ("خطا",),
"ثنایا": ("ثنا",),
"لوایح": ("لایحه",),
"غزلیات": ("غزل",),
"اشارات": ("اشاره",),
"رکعات": ("رکعت",),
"امثالهم": ("مثل",),
"تشنجات": ("تشنج",),
"امانات": ("امانت",),
"بریات": ("بریت",),
"توست": ("تو",),
"حبست": ("حبس",),
"حیثیات": ("حیثیت",),
"شامات": ("شامه",),
"قبالات": ("قباله",),
"قرابات": ("قرابت",),
"مطلقات": ("مطلقه",),
"نزلات": ("نزله",),
"بکمان": ("بکیم",),
"روشان": ("روشن",),
"مسانید": ("مسند",),
"ناحیت": ("ناحیه",),
"رسوله": ("رسول",),
"دانشجویان": ("دانشجو",),
"روحانیون": ("روحانی",),
"قرون": ("قرن",),
"انقلابیون": ("انقلابی",),
"قوانین": ("قانون",),
"مجاهدین": ("مجاهد",),
"محققین": ("محقق",),
"متهمین": ("متهم",),
"مهندسین": ("مهندس",),
"مؤمنین": ("مؤمن",),
"مسئولین": ("مسئول",),
"مشرکین": ("مشرک",),
"مخاطبین": ("مخاطب",),
"مأمورین": ("مأمور",),
"سلاطین": ("سلطان",),
"مضامین": ("مضمون",),
"منتخبین": ("منتخب",),
"متحدین": ("متحد",),
"متخصصین": ("متخصص",),
"مسوولین": ("مسوول",),
"شیاطین": ("شیطان",),
"مباشرین": ("مباشر",),
"منتقدین": ("منتقد",),
"موسسین": ("موسس",),
"مسؤلین": ("مسؤل",),
"متحجرین": ("متحجر",),
"مهاجرین": ("مهاجر",),
"مترجمین": ("مترجم",),
"مدعوین": ("مدعو",),
"مشترکین": ("مشترک",),
"معصومین": ("معصوم",),
"مسابقات": ("مسابقه",),
"معانی": ("معنی",),
"مطالعات": ("مطالعه",),
"نکات": ("نکته",),
"خصوصیات": ("خصوصیت",),
"خدمات": ("خدمت",),
"نشریات": ("نشریه",),
"ساعات": ("ساعت",),
"بزرگان": ("بزرگ",),
"خسارات": ("خسارت",),
"شیعیان": ("شیعه",),
"واقعیات": ("واقعیت",),
"مذاکرات": ("مذاکره",),
"حشرات": ("حشره",),
"طبقات": ("طبقه",),
"شکایات": ("شکایت",),
"ابیات": ("بیت",),
"شایعات": ("شایعه",),
"ضربات": ("ضربه",),
"مقالات": ("مقاله",),
"اوقات": ("وقت",),
"عباراتی": ("عبارت",),
"سالیان": ("سال",),
"زحمات": ("زحمت",),
"عبارات": ("عبارت",),
"لغات": ("لغت",),
"نیات": ("نیت",),
"مطالبات": ("مطالبه",),
"مطالب": ("مطلب",),
"خلقیات": ("خلق",),
"نکات": ("نکته",),
"بزرگان": ("بزرگ",),
"ابیاتی": ("بیت",),
"محرمات": ("حرام",),
"اوزان": ("وزن",),
"اخلاقیات": ("اخلاق",),
"سبزیجات": ("سبزی",),
"اضافات": ("اضافه",),
"قضات": ("قاضی",),
💫 Port master changes over to develop (#2979) * Create aryaprabhudesai.md (#2681) * Update _install.jade (#2688) Typo fix: "models" -> "model" * Add FAC to spacy.explain (resolves #2706) * Remove docstrings for deprecated arguments (see #2703) * When calling getoption() in conftest.py, pass a default option (#2709) * When calling getoption() in conftest.py, pass a default option This is necessary to allow testing an installed spacy by running: pytest --pyargs spacy * Add contributor agreement * update bengali token rules for hyphen and digits (#2731) * Less norm computations in token similarity (#2730) * Less norm computations in token similarity * Contributor agreement * Remove ')' for clarity (#2737) Sorry, don't mean to be nitpicky, I just noticed this when going through the CLI and thought it was a quick fix. That said, if this was intention than please let me know. * added contributor agreement for mbkupfer (#2738) * Basic support for Telugu language (#2751) * Lex _attrs for polish language (#2750) * Signed spaCy contributor agreement * Added polish version of english lex_attrs * Introduces a bulk merge function, in order to solve issue #653 (#2696) * Fix comment * Introduce bulk merge to increase performance on many span merges * Sign contributor agreement * Implement pull request suggestions * Describe converters more explicitly (see #2643) * Add multi-threading note to Language.pipe (resolves #2582) [ci skip] * Fix formatting * Fix dependency scheme docs (closes #2705) [ci skip] * Don't set stop word in example (closes #2657) [ci skip] * Add words to portuguese language _num_words (#2759) * Add words to portuguese language _num_words * Add words to portuguese language _num_words * Update Indonesian model (#2752) * adding e-KTP in tokenizer exceptions list * add exception token * removing lines with containing space as it won't matter since we use .split() method in the end, added new tokens in exception * add tokenizer exceptions list * combining base_norms with norm_exceptions * adding norm_exception * fix double key in lemmatizer * remove unused import on punctuation.py * reformat stop_words to reduce number of lines, improve readibility * updating tokenizer exception * implement is_currency for lang/id * adding orth_first_upper in tokenizer_exceptions * update the norm_exception list * remove bunch of abbreviations * adding contributors file * Fixed spaCy+Keras example (#2763) * bug fixes in keras example * created contributor agreement * Adding French hyphenated first name (#2786) * Fix typo (closes #2784) * Fix typo (#2795) [ci skip] Fixed typo on line 6 "regcognizer --> recognizer" * Adding basic support for Sinhala language. (#2788) * adding Sinhala language package, stop words, examples and lex_attrs. * Adding contributor agreement * Updating contributor agreement * Also include lowercase norm exceptions * Fix error (#2802) * Fix error ValueError: cannot resize an array that references or is referenced by another array in this way. Use the resize function * added spaCy Contributor Agreement * Add charlax's contributor agreement (#2805) * agreement of contributor, may I introduce a tiny pl languge contribution (#2799) * Contributors agreement * Contributors agreement * Contributors agreement * Add jupyter=True to displacy.render in documentation (#2806) * Revert "Also include lowercase norm exceptions" This reverts commit 70f4e8adf37cfcfab60be2b97d6deae949b30e9e. * Remove deprecated encoding argument to msgpack * Set up dependency tree pattern matching skeleton (#2732) * Fix bug when too many entity types. Fixes #2800 * Fix Python 2 test failure * Require older msgpack-numpy * Restore encoding arg on msgpack-numpy * Try to fix version pin for msgpack-numpy * Update Portuguese Language (#2790) * Add words to portuguese language _num_words * Add words to portuguese language _num_words * Portuguese - Add/remove stopwords, fix tokenizer, add currency symbols * Extended punctuation and norm_exceptions in the Portuguese language * Correct error in spacy universe docs concerning spacy-lookup (#2814) * Update Keras Example for (Parikh et al, 2016) implementation (#2803) * bug fixes in keras example * created contributor agreement * baseline for Parikh model * initial version of parikh 2016 implemented * tested asymmetric models * fixed grevious error in normalization * use standard SNLI test file * begin to rework parikh example * initial version of running example * start to document the new version * start to document the new version * Update Decompositional Attention.ipynb * fixed calls to similarity * updated the README * import sys package duh * simplified indexing on mapping word to IDs * stupid python indent error * added code from https://github.com/tensorflow/tensorflow/issues/3388 for tf bug workaround * Fix typo (closes #2815) [ci skip] * Update regex version dependency * Set version to 2.0.13.dev3 * Skip seemingly problematic test * Remove problematic test * Try previous version of regex * Revert "Remove problematic test" This reverts commit bdebbef45552d698d390aa430b527ee27830f11b. * Unskip test * Try older version of regex * 💫 Update training examples and use minibatching (#2830) <!--- Provide a general summary of your changes in the title. --> ## Description Update the training examples in `/examples/training` to show usage of spaCy's `minibatch` and `compounding` helpers ([see here](https://spacy.io/usage/training#tips-batch-size) for details). The lack of batching in the examples has caused some confusion in the past, especially for beginners who would copy-paste the examples, update them with large training sets and experienced slow and unsatisfying results. ### Types of change enhancements ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. * Visual C++ link updated (#2842) (closes #2841) [ci skip] * New landing page * Add contribution agreement * Correcting lang/ru/examples.py (#2845) * Correct some grammatical inaccuracies in lang\ru\examples.py; filled Contributor Agreement * Correct some grammatical inaccuracies in lang\ru\examples.py * Move contributor agreement to separate file * Set version to 2.0.13.dev4 * Add Persian(Farsi) language support (#2797) * Also include lowercase norm exceptions * Remove in favour of https://github.com/explosion/spaCy/graphs/contributors * Rule-based French Lemmatizer (#2818) <!--- Provide a general summary of your changes in the title. --> ## Description <!--- Use this section to describe your changes. If your changes required testing, include information about the testing environment and the tests you ran. If your test fixes a bug reported in an issue, don't forget to include the issue number. If your PR is still a work in progress, that's totally fine – just include a note to let us know. --> Add a rule-based French Lemmatizer following the english one and the excellent PR for [greek language optimizations](https://github.com/explosion/spaCy/pull/2558) to adapt the Lemmatizer class. ### Types of change <!-- What type of change does your PR cover? Is it a bug fix, an enhancement or new feature, or a change to the documentation? --> - Lemma dictionary used can be found [here](http://infolingu.univ-mlv.fr/DonneesLinguistiques/Dictionnaires/telechargement.html), I used the XML version. - Add several files containing exhaustive list of words for each part of speech - Add some lemma rules - Add POS that are not checked in the standard Lemmatizer, i.e PRON, DET, ADV and AUX - Modify the Lemmatizer class to check in lookup table as a last resort if POS not mentionned - Modify the lemmatize function to check in lookup table as a last resort - Init files are updated so the model can support all the functionalities mentioned above - Add words to tokenizer_exceptions_list.py in respect to regex used in tokenizer_exceptions.py ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [X] I have submitted the spaCy Contributor Agreement. - [X] I ran the tests, and all new and existing tests passed. - [X] My changes don't require a change to the documentation, or if they do, I've added all required information. * Set version to 2.0.13 * Fix formatting and consistency * Update docs for new version [ci skip] * Increment version [ci skip] * Add info on wheels [ci skip] * Adding "This is a sentence" example to Sinhala (#2846) * Add wheels badge * Update badge [ci skip] * Update README.rst [ci skip] * Update murmurhash pin * Increment version to 2.0.14.dev0 * Update GPU docs for v2.0.14 * Add wheel to setup_requires * Import prefer_gpu and require_gpu functions from Thinc * Add tests for prefer_gpu() and require_gpu() * Update requirements and setup.py * Workaround bug in thinc require_gpu * Set version to v2.0.14 * Update push-tag script * Unhack prefer_gpu * Require thinc 6.10.6 * Update prefer_gpu and require_gpu docs [ci skip] * Fix specifiers for GPU * Set version to 2.0.14.dev1 * Set version to 2.0.14 * Update Thinc version pin * Increment version * Fix msgpack-numpy version pin * Increment version * Update version to 2.0.16 * Update version [ci skip] * Redundant ')' in the Stop words' example (#2856) <!--- Provide a general summary of your changes in the title. --> ## Description <!--- Use this section to describe your changes. If your changes required testing, include information about the testing environment and the tests you ran. If your test fixes a bug reported in an issue, don't forget to include the issue number. If your PR is still a work in progress, that's totally fine – just include a note to let us know. --> ### Types of change <!-- What type of change does your PR cover? Is it a bug fix, an enhancement or new feature, or a change to the documentation? --> ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [ ] I have submitted the spaCy Contributor Agreement. - [ ] I ran the tests, and all new and existing tests passed. - [ ] My changes don't require a change to the documentation, or if they do, I've added all required information. * Documentation improvement regarding joblib and SO (#2867) Some documentation improvements ## Description 1. Fixed the dead URL to joblib 2. Fixed Stack Overflow brand name (with space) ### Types of change Documentation ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. * raise error when setting overlapping entities as doc.ents (#2880) * Fix out-of-bounds access in NER training The helper method state.B(1) gets the index of the first token of the buffer, or -1 if no such token exists. Normally this is safe because we pass this to functions like state.safe_get(), which returns an empty token. Here we used it directly as an array index, which is not okay! This error may have been the cause of out-of-bounds access errors during training. Similar errors may still be around, so much be hunted down. Hunting this one down took a long time...I printed out values across training runs and diffed, looking for points of divergence between runs, when no randomness should be allowed. * Change PyThaiNLP Url (#2876) * Fix missing comma * Add example showing a fix-up rule for space entities * Set version to 2.0.17.dev0 * Update regex version * Revert "Update regex version" This reverts commit 62358dd867d15bc6a475942dff34effba69dd70a. * Try setting older regex version, to align with conda * Set version to 2.0.17 * Add spacy-js to universe [ci-skip] * Add spacy-raspberry to universe (closes #2889) * Add script to validate universe json [ci skip] * Removed space in docs + added contributor indo (#2909) * - removed unneeded space in documentation * - added contributor info * Allow input text of length up to max_length, inclusive (#2922) * Include universe spec for spacy-wordnet component (#2919) * feat: include universe spec for spacy-wordnet component * chore: include spaCy contributor agreement * Minor formatting changes [ci skip] * Fix image [ci skip] Twitter URL doesn't work on live site * Check if the word is in one of the regular lists specific to each POS (#2886) * 💫 Create random IDs for SVGs to prevent ID clashes (#2927) Resolves #2924. ## Description Fixes problem where multiple visualizations in Jupyter notebooks would have clashing arc IDs, resulting in weirdly positioned arc labels. Generating a random ID prefix so even identical parses won't receive the same IDs for consistency (even if effect of ID clash isn't noticable here.) ### Types of change bug fix ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. * Fix typo [ci skip] * fixes symbolic link on py3 and windows (#2949) * fixes symbolic link on py3 and windows during setup of spacy using command python -m spacy link en_core_web_sm en closes #2948 * Update spacy/compat.py Co-Authored-By: cicorias <cicorias@users.noreply.github.com> * Fix formatting * Update universe [ci skip] * Catalan Language Support (#2940) * Catalan language Support * Ddding Catalan to documentation * Sort languages alphabetically [ci skip] * Update tests for pytest 4.x (#2965) <!--- Provide a general summary of your changes in the title. --> ## Description - [x] Replace marks in params for pytest 4.0 compat ([see here](https://docs.pytest.org/en/latest/deprecations.html#marks-in-pytest-mark-parametrize)) - [x] Un-xfail passing tests (some fixes in a recent update resolved a bunch of issues, but tests were apparently never updated here) ### Types of change <!-- What type of change does your PR cover? Is it a bug fix, an enhancement or new feature, or a change to the documentation? --> ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. * Fix regex pin to harmonize with conda (#2964) * Update README.rst * Fix bug where Vocab.prune_vector did not use 'batch_size' (#2977) Fixes #2976 * Fix typo * Fix typo * Remove duplicate file * Require thinc 7.0.0.dev2 Fixes bug in gpu_ops that would use cupy instead of numpy on CPU * Add missing import * Fix error IDs * Fix tests
2018-11-29 15:30:29 +00:00
}