spaCy/spacy/lang/id/_tokenizer_exceptions_list.py

3906 lines
52 KiB
Python
Raw Normal View History

2017-07-24 07:10:16 +00:00
# coding: utf8
from __future__ import unicode_literals
ID_BASE_EXCEPTIONS = set(
"""
2017-07-24 07:10:16 +00:00
aba-aba
abah-abah
2017-07-26 12:12:52 +00:00
abal-abal
abang-abang
2017-07-27 12:46:30 +00:00
abar-abar
2017-07-24 07:10:16 +00:00
abong-abong
abrit-abrit
2017-07-27 12:46:30 +00:00
abrit-abritan
2017-07-24 07:10:16 +00:00
abu-abu
2017-07-27 12:46:30 +00:00
abuh-abuhan
2017-07-24 07:10:16 +00:00
abuk-abuk
abun-abun
2017-07-26 12:12:52 +00:00
acak-acak
2017-07-27 12:46:30 +00:00
acak-acakan
2017-07-24 07:10:16 +00:00
acang-acang
2017-07-27 12:46:30 +00:00
acap-acap
2017-07-24 07:10:16 +00:00
aci-aci
2017-07-27 12:46:30 +00:00
aci-acian
aci-acinya
aco-acoan
2017-07-26 12:12:52 +00:00
ad-blocker
ad-interim
💫 Port master changes over to develop (#2979) * Create aryaprabhudesai.md (#2681) * Update _install.jade (#2688) Typo fix: "models" -> "model" * Add FAC to spacy.explain (resolves #2706) * Remove docstrings for deprecated arguments (see #2703) * When calling getoption() in conftest.py, pass a default option (#2709) * When calling getoption() in conftest.py, pass a default option This is necessary to allow testing an installed spacy by running: pytest --pyargs spacy * Add contributor agreement * update bengali token rules for hyphen and digits (#2731) * Less norm computations in token similarity (#2730) * Less norm computations in token similarity * Contributor agreement * Remove ')' for clarity (#2737) Sorry, don't mean to be nitpicky, I just noticed this when going through the CLI and thought it was a quick fix. That said, if this was intention than please let me know. * added contributor agreement for mbkupfer (#2738) * Basic support for Telugu language (#2751) * Lex _attrs for polish language (#2750) * Signed spaCy contributor agreement * Added polish version of english lex_attrs * Introduces a bulk merge function, in order to solve issue #653 (#2696) * Fix comment * Introduce bulk merge to increase performance on many span merges * Sign contributor agreement * Implement pull request suggestions * Describe converters more explicitly (see #2643) * Add multi-threading note to Language.pipe (resolves #2582) [ci skip] * Fix formatting * Fix dependency scheme docs (closes #2705) [ci skip] * Don't set stop word in example (closes #2657) [ci skip] * Add words to portuguese language _num_words (#2759) * Add words to portuguese language _num_words * Add words to portuguese language _num_words * Update Indonesian model (#2752) * adding e-KTP in tokenizer exceptions list * add exception token * removing lines with containing space as it won't matter since we use .split() method in the end, added new tokens in exception * add tokenizer exceptions list * combining base_norms with norm_exceptions * adding norm_exception * fix double key in lemmatizer * remove unused import on punctuation.py * reformat stop_words to reduce number of lines, improve readibility * updating tokenizer exception * implement is_currency for lang/id * adding orth_first_upper in tokenizer_exceptions * update the norm_exception list * remove bunch of abbreviations * adding contributors file * Fixed spaCy+Keras example (#2763) * bug fixes in keras example * created contributor agreement * Adding French hyphenated first name (#2786) * Fix typo (closes #2784) * Fix typo (#2795) [ci skip] Fixed typo on line 6 "regcognizer --> recognizer" * Adding basic support for Sinhala language. (#2788) * adding Sinhala language package, stop words, examples and lex_attrs. * Adding contributor agreement * Updating contributor agreement * Also include lowercase norm exceptions * Fix error (#2802) * Fix error ValueError: cannot resize an array that references or is referenced by another array in this way. Use the resize function * added spaCy Contributor Agreement * Add charlax's contributor agreement (#2805) * agreement of contributor, may I introduce a tiny pl languge contribution (#2799) * Contributors agreement * Contributors agreement * Contributors agreement * Add jupyter=True to displacy.render in documentation (#2806) * Revert "Also include lowercase norm exceptions" This reverts commit 70f4e8adf37cfcfab60be2b97d6deae949b30e9e. * Remove deprecated encoding argument to msgpack * Set up dependency tree pattern matching skeleton (#2732) * Fix bug when too many entity types. Fixes #2800 * Fix Python 2 test failure * Require older msgpack-numpy * Restore encoding arg on msgpack-numpy * Try to fix version pin for msgpack-numpy * Update Portuguese Language (#2790) * Add words to portuguese language _num_words * Add words to portuguese language _num_words * Portuguese - Add/remove stopwords, fix tokenizer, add currency symbols * Extended punctuation and norm_exceptions in the Portuguese language * Correct error in spacy universe docs concerning spacy-lookup (#2814) * Update Keras Example for (Parikh et al, 2016) implementation (#2803) * bug fixes in keras example * created contributor agreement * baseline for Parikh model * initial version of parikh 2016 implemented * tested asymmetric models * fixed grevious error in normalization * use standard SNLI test file * begin to rework parikh example * initial version of running example * start to document the new version * start to document the new version * Update Decompositional Attention.ipynb * fixed calls to similarity * updated the README * import sys package duh * simplified indexing on mapping word to IDs * stupid python indent error * added code from https://github.com/tensorflow/tensorflow/issues/3388 for tf bug workaround * Fix typo (closes #2815) [ci skip] * Update regex version dependency * Set version to 2.0.13.dev3 * Skip seemingly problematic test * Remove problematic test * Try previous version of regex * Revert "Remove problematic test" This reverts commit bdebbef45552d698d390aa430b527ee27830f11b. * Unskip test * Try older version of regex * 💫 Update training examples and use minibatching (#2830) <!--- Provide a general summary of your changes in the title. --> ## Description Update the training examples in `/examples/training` to show usage of spaCy's `minibatch` and `compounding` helpers ([see here](https://spacy.io/usage/training#tips-batch-size) for details). The lack of batching in the examples has caused some confusion in the past, especially for beginners who would copy-paste the examples, update them with large training sets and experienced slow and unsatisfying results. ### Types of change enhancements ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. * Visual C++ link updated (#2842) (closes #2841) [ci skip] * New landing page * Add contribution agreement * Correcting lang/ru/examples.py (#2845) * Correct some grammatical inaccuracies in lang\ru\examples.py; filled Contributor Agreement * Correct some grammatical inaccuracies in lang\ru\examples.py * Move contributor agreement to separate file * Set version to 2.0.13.dev4 * Add Persian(Farsi) language support (#2797) * Also include lowercase norm exceptions * Remove in favour of https://github.com/explosion/spaCy/graphs/contributors * Rule-based French Lemmatizer (#2818) <!--- Provide a general summary of your changes in the title. --> ## Description <!--- Use this section to describe your changes. If your changes required testing, include information about the testing environment and the tests you ran. If your test fixes a bug reported in an issue, don't forget to include the issue number. If your PR is still a work in progress, that's totally fine – just include a note to let us know. --> Add a rule-based French Lemmatizer following the english one and the excellent PR for [greek language optimizations](https://github.com/explosion/spaCy/pull/2558) to adapt the Lemmatizer class. ### Types of change <!-- What type of change does your PR cover? Is it a bug fix, an enhancement or new feature, or a change to the documentation? --> - Lemma dictionary used can be found [here](http://infolingu.univ-mlv.fr/DonneesLinguistiques/Dictionnaires/telechargement.html), I used the XML version. - Add several files containing exhaustive list of words for each part of speech - Add some lemma rules - Add POS that are not checked in the standard Lemmatizer, i.e PRON, DET, ADV and AUX - Modify the Lemmatizer class to check in lookup table as a last resort if POS not mentionned - Modify the lemmatize function to check in lookup table as a last resort - Init files are updated so the model can support all the functionalities mentioned above - Add words to tokenizer_exceptions_list.py in respect to regex used in tokenizer_exceptions.py ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [X] I have submitted the spaCy Contributor Agreement. - [X] I ran the tests, and all new and existing tests passed. - [X] My changes don't require a change to the documentation, or if they do, I've added all required information. * Set version to 2.0.13 * Fix formatting and consistency * Update docs for new version [ci skip] * Increment version [ci skip] * Add info on wheels [ci skip] * Adding "This is a sentence" example to Sinhala (#2846) * Add wheels badge * Update badge [ci skip] * Update README.rst [ci skip] * Update murmurhash pin * Increment version to 2.0.14.dev0 * Update GPU docs for v2.0.14 * Add wheel to setup_requires * Import prefer_gpu and require_gpu functions from Thinc * Add tests for prefer_gpu() and require_gpu() * Update requirements and setup.py * Workaround bug in thinc require_gpu * Set version to v2.0.14 * Update push-tag script * Unhack prefer_gpu * Require thinc 6.10.6 * Update prefer_gpu and require_gpu docs [ci skip] * Fix specifiers for GPU * Set version to 2.0.14.dev1 * Set version to 2.0.14 * Update Thinc version pin * Increment version * Fix msgpack-numpy version pin * Increment version * Update version to 2.0.16 * Update version [ci skip] * Redundant ')' in the Stop words' example (#2856) <!--- Provide a general summary of your changes in the title. --> ## Description <!--- Use this section to describe your changes. If your changes required testing, include information about the testing environment and the tests you ran. If your test fixes a bug reported in an issue, don't forget to include the issue number. If your PR is still a work in progress, that's totally fine – just include a note to let us know. --> ### Types of change <!-- What type of change does your PR cover? Is it a bug fix, an enhancement or new feature, or a change to the documentation? --> ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [ ] I have submitted the spaCy Contributor Agreement. - [ ] I ran the tests, and all new and existing tests passed. - [ ] My changes don't require a change to the documentation, or if they do, I've added all required information. * Documentation improvement regarding joblib and SO (#2867) Some documentation improvements ## Description 1. Fixed the dead URL to joblib 2. Fixed Stack Overflow brand name (with space) ### Types of change Documentation ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. * raise error when setting overlapping entities as doc.ents (#2880) * Fix out-of-bounds access in NER training The helper method state.B(1) gets the index of the first token of the buffer, or -1 if no such token exists. Normally this is safe because we pass this to functions like state.safe_get(), which returns an empty token. Here we used it directly as an array index, which is not okay! This error may have been the cause of out-of-bounds access errors during training. Similar errors may still be around, so much be hunted down. Hunting this one down took a long time...I printed out values across training runs and diffed, looking for points of divergence between runs, when no randomness should be allowed. * Change PyThaiNLP Url (#2876) * Fix missing comma * Add example showing a fix-up rule for space entities * Set version to 2.0.17.dev0 * Update regex version * Revert "Update regex version" This reverts commit 62358dd867d15bc6a475942dff34effba69dd70a. * Try setting older regex version, to align with conda * Set version to 2.0.17 * Add spacy-js to universe [ci-skip] * Add spacy-raspberry to universe (closes #2889) * Add script to validate universe json [ci skip] * Removed space in docs + added contributor indo (#2909) * - removed unneeded space in documentation * - added contributor info * Allow input text of length up to max_length, inclusive (#2922) * Include universe spec for spacy-wordnet component (#2919) * feat: include universe spec for spacy-wordnet component * chore: include spaCy contributor agreement * Minor formatting changes [ci skip] * Fix image [ci skip] Twitter URL doesn't work on live site * Check if the word is in one of the regular lists specific to each POS (#2886) * 💫 Create random IDs for SVGs to prevent ID clashes (#2927) Resolves #2924. ## Description Fixes problem where multiple visualizations in Jupyter notebooks would have clashing arc IDs, resulting in weirdly positioned arc labels. Generating a random ID prefix so even identical parses won't receive the same IDs for consistency (even if effect of ID clash isn't noticable here.) ### Types of change bug fix ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. * Fix typo [ci skip] * fixes symbolic link on py3 and windows (#2949) * fixes symbolic link on py3 and windows during setup of spacy using command python -m spacy link en_core_web_sm en closes #2948 * Update spacy/compat.py Co-Authored-By: cicorias <cicorias@users.noreply.github.com> * Fix formatting * Update universe [ci skip] * Catalan Language Support (#2940) * Catalan language Support * Ddding Catalan to documentation * Sort languages alphabetically [ci skip] * Update tests for pytest 4.x (#2965) <!--- Provide a general summary of your changes in the title. --> ## Description - [x] Replace marks in params for pytest 4.0 compat ([see here](https://docs.pytest.org/en/latest/deprecations.html#marks-in-pytest-mark-parametrize)) - [x] Un-xfail passing tests (some fixes in a recent update resolved a bunch of issues, but tests were apparently never updated here) ### Types of change <!-- What type of change does your PR cover? Is it a bug fix, an enhancement or new feature, or a change to the documentation? --> ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. * Fix regex pin to harmonize with conda (#2964) * Update README.rst * Fix bug where Vocab.prune_vector did not use 'batch_size' (#2977) Fixes #2976 * Fix typo * Fix typo * Remove duplicate file * Require thinc 7.0.0.dev2 Fixes bug in gpu_ops that would use cupy instead of numpy on CPU * Add missing import * Fix error IDs * Fix tests
2018-11-29 15:30:29 +00:00
ada-ada
2017-07-27 12:46:30 +00:00
ada-adanya
ada-adanyakah
2017-07-24 07:10:16 +00:00
adang-adang
2017-07-27 12:46:30 +00:00
adap-adapan
2017-07-26 12:12:52 +00:00
add-on
add-ons
adik-adik
2017-07-27 12:46:30 +00:00
adik-beradik
aduk-adukan
2017-07-26 12:12:52 +00:00
after-sales
2017-07-27 12:46:30 +00:00
agak-agak
2017-07-24 07:10:16 +00:00
agak-agih
2017-07-26 12:12:52 +00:00
agama-agama
2017-07-24 07:10:16 +00:00
agar-agar
2017-07-26 12:12:52 +00:00
age-related
2017-07-24 07:10:16 +00:00
agut-agut
2017-07-26 12:12:52 +00:00
air-air
air-cooled
air-to-air
ajak-ajak
2017-07-24 07:10:16 +00:00
ajar-ajar
aji-aji
2017-07-27 12:46:30 +00:00
akal-akal
2017-07-26 12:12:52 +00:00
akal-akalan
2017-07-27 12:46:30 +00:00
akan-akan
2017-07-26 12:12:52 +00:00
akar-akar
2017-07-27 12:46:30 +00:00
akar-akaran
2017-07-26 12:12:52 +00:00
akhir-akhir
2017-07-27 12:46:30 +00:00
akhir-akhirnya
aki-aki
2017-07-26 12:12:52 +00:00
aksi-aksi
2017-07-27 12:46:30 +00:00
alah-mengalahi
2017-07-24 07:10:16 +00:00
alai-belai
alan-alan
alang-alang
2017-07-27 12:46:30 +00:00
alang-alangan
2017-07-24 07:10:16 +00:00
alap-alap
2017-07-26 12:12:52 +00:00
alat-alat
2017-07-24 07:10:16 +00:00
ali-ali
2017-07-27 12:46:30 +00:00
alif-alifan
2017-07-24 07:10:16 +00:00
alih-alih
2017-07-27 12:46:30 +00:00
aling-aling
aling-alingan
alip-alipan
2017-07-26 12:12:52 +00:00
all-electric
all-in-one
all-out
all-time
alon-alon
alt-right
alt-text
2017-07-24 07:10:16 +00:00
alu-alu
alu-aluan
alun-alun
alur-alur
2017-07-27 12:46:30 +00:00
alur-aluran
2017-07-26 12:12:52 +00:00
always-on
2017-07-27 12:46:30 +00:00
amai-amai
amatir-amatiran
2017-07-24 07:10:16 +00:00
ambah-ambah
ambai-ambai
2017-07-27 12:46:30 +00:00
ambil-mengambil
2017-07-24 07:10:16 +00:00
ambreng-ambrengan
2017-07-27 12:46:30 +00:00
ambring-ambringan
2017-07-24 07:10:16 +00:00
ambu-ambu
ambung-ambung
amin-amin
amit-amit
ampai-ampai
2017-07-27 12:46:30 +00:00
amprung-amprungan
amung-amung
2017-07-24 07:10:16 +00:00
anai-anai
2017-07-26 12:12:52 +00:00
anak-anak
2017-07-27 12:46:30 +00:00
anak-anakan
anak-beranak
2017-07-26 12:12:52 +00:00
anak-cucu
anak-istri
2017-07-24 07:10:16 +00:00
ancak-ancak
2017-07-26 12:12:52 +00:00
ancang-ancang
2017-07-24 07:10:16 +00:00
ancar-ancar
andang-andang
andeng-andeng
2017-07-27 12:46:30 +00:00
aneh-aneh
angan-angan
anggar-anggar
2017-07-26 12:12:52 +00:00
anggaran-red
anggota-anggota
2017-07-24 07:10:16 +00:00
anggung-anggip
2017-07-26 12:12:52 +00:00
angin-angin
angin-anginan
2017-07-24 07:10:16 +00:00
angkal-angkal
angkul-angkul
2017-07-27 12:46:30 +00:00
angkup-angkup
2017-07-24 07:10:16 +00:00
angkut-angkut
ani-ani
aning-aning
anjang-anjang
anjing-anjing
2017-07-27 12:46:30 +00:00
anjung-anjung
anjung-anjungan
2017-07-24 07:10:16 +00:00
antah-berantah
antar-antar
2017-07-27 12:46:30 +00:00
antar-mengantar
2017-07-26 12:12:52 +00:00
ante-mortem
antek-antek
2017-07-24 07:10:16 +00:00
anter-anter
2017-07-26 12:12:52 +00:00
antihuru-hara
2017-07-27 12:46:30 +00:00
anting-anting
2017-07-24 07:10:16 +00:00
antung-antung
2017-07-27 12:46:30 +00:00
anyam-menganyam
2017-07-24 07:10:16 +00:00
anyang-anyang
2017-07-26 12:12:52 +00:00
apa-apa
2017-07-27 12:46:30 +00:00
apa-apaan
2017-07-26 12:12:52 +00:00
apel-apel
2017-07-24 07:10:16 +00:00
api-api
apit-apit
2017-07-26 12:12:52 +00:00
aplikasi-aplikasi
apotek-apotek
2017-07-27 12:46:30 +00:00
aprit-apritan
2017-07-24 07:10:16 +00:00
apu-apu
2017-07-27 12:46:30 +00:00
apung-apung
arah-arah
2017-07-26 12:12:52 +00:00
arak-arak
arak-arakan
2017-07-27 12:46:30 +00:00
aram-aram
2017-07-26 12:12:52 +00:00
arek-arek
2017-07-24 07:10:16 +00:00
arem-arem
ari-ari
2017-07-26 12:12:52 +00:00
artis-artis
2017-07-27 12:46:30 +00:00
aru-aru
arung-arungan
asa-asaan
2017-07-26 12:12:52 +00:00
asal-asalan
asal-muasal
asal-usul
2017-07-27 12:46:30 +00:00
asam-asaman
2017-07-26 12:12:52 +00:00
asas-asas
aset-aset
asmaul-husna
asosiasi-asosiasi
2017-07-24 07:10:16 +00:00
asuh-asuh
2017-07-27 12:46:30 +00:00
asyik-asyiknya
atas-mengatasi
ati-ati
atung-atung
2017-07-26 12:12:52 +00:00
aturan-aturan
audio-video
audio-visual
auto-brightness
auto-complete
auto-focus
auto-play
auto-update
avant-garde
awan-awan
2017-07-27 12:46:30 +00:00
awan-berawan
2017-07-26 12:12:52 +00:00
awang-awang
2017-07-27 12:46:30 +00:00
awang-gemawang
2017-07-24 07:10:16 +00:00
awar-awar
awat-awat
awik-awik
2017-07-27 12:46:30 +00:00
awut-awutan
2017-07-26 12:12:52 +00:00
ayah-anak
2017-07-24 07:10:16 +00:00
ayak-ayak
2017-07-26 12:12:52 +00:00
ayam-ayam
2017-07-27 12:46:30 +00:00
ayam-ayaman
2017-07-24 07:10:16 +00:00
ayang-ayang
2017-07-26 12:12:52 +00:00
ayat-ayat
2017-07-27 12:46:30 +00:00
ayeng-ayengan
ayun-temayun
ayut-ayutan
2017-07-24 07:10:16 +00:00
ba-bi-bu
2017-07-26 12:12:52 +00:00
back-to-back
back-up
badan-badan
2017-07-27 12:46:30 +00:00
bade-bade
2017-07-26 12:12:52 +00:00
badut-badut
bagi-bagi
bahan-bahan
bahu-membahu
baik-baik
bail-out
2017-07-24 07:10:16 +00:00
bajang-bajang
baji-baji
balai-balai
2017-07-27 12:46:30 +00:00
balam-balam
balas-berbalas
balas-membalas
2017-07-26 12:12:52 +00:00
bale-bale
baling-baling
ball-playing
balon-balon
2017-07-27 12:46:30 +00:00
balut-balut
2017-07-26 12:12:52 +00:00
band-band
bandara-bandara
2017-07-27 12:46:30 +00:00
bangsa-bangsa
2017-07-24 07:10:16 +00:00
bangun-bangun
2017-07-26 12:12:52 +00:00
bangunan-bangunan
bank-bank
2017-07-27 12:46:30 +00:00
bantah-bantah
2017-07-26 12:12:52 +00:00
bantahan-bantahan
2017-07-27 12:46:30 +00:00
bantal-bantal
2017-07-26 12:12:52 +00:00
banyak-banyak
bapak-anak
bapak-bapak
bapak-ibu
bapak-ibunya
barang-barang
2017-07-24 07:10:16 +00:00
barat-barat
2017-07-26 12:12:52 +00:00
barat-daya
barat-laut
2017-07-24 07:10:16 +00:00
barau-barau
bare-bare
2017-07-26 12:12:52 +00:00
bareng-bareng
2017-07-24 07:10:16 +00:00
bari-bari
2017-07-27 12:46:30 +00:00
barik-barik
baris-berbaris
2017-07-26 12:12:52 +00:00
baru-baru
baru-batu
2017-07-24 07:10:16 +00:00
barung-barung
basa-basi
bata-bata
2017-07-26 12:12:52 +00:00
batalyon-batalyon
batang-batang
batas-batas
2017-07-24 07:10:16 +00:00
batir-batir
2017-07-26 12:12:52 +00:00
batu-batu
2017-07-27 12:46:30 +00:00
batuk-batuk
batung-batung
bau-bauan
2017-07-26 12:12:52 +00:00
bawa-bawa
2017-07-24 07:10:16 +00:00
bayan-bayan
2017-07-26 12:12:52 +00:00
bayang-bayang
bayi-bayi
bea-cukai
bedeng-bedeng
2017-07-27 12:46:30 +00:00
bedil-bedal
bedil-bedilan
2017-07-24 07:10:16 +00:00
begana-begini
2017-07-26 12:12:52 +00:00
bek-bek
2017-07-27 12:46:30 +00:00
bekal-bekalan
bekerdom-kerdom
bekertak-kertak
2017-07-26 12:12:52 +00:00
belang-belang
2017-07-24 07:10:16 +00:00
belat-belit
2017-07-26 12:12:52 +00:00
beliau-beliau
2017-07-24 07:10:16 +00:00
belu-belai
2017-07-27 12:46:30 +00:00
belum-belum
2017-07-26 12:12:52 +00:00
benar-benar
benda-benda
2017-07-24 07:10:16 +00:00
bengang-bengut
benggal-benggil
bengkal-bengkil
bengkang-bengkok
bengkang-bengkong
bengkang-bengkung
2017-07-26 12:12:52 +00:00
benteng-benteng
bentuk-bentuk
benua-benua
ber-selfie
2017-07-27 12:46:30 +00:00
berabad-abad
berabun-rabun
beracah-acah
berada-ada
beradik-berkakak
beragah-agah
beragak-agak
beragam-ragam
beraja-raja
berakit-rakit
beraku-akuan
beralu-aluan
beralun-alun
beramah-ramah
beramah-ramahan
beramah-tamah
2017-07-26 12:12:52 +00:00
beramai-ramai
2017-07-27 12:46:30 +00:00
berambai-ambai
berambal-ambalan
berambil-ambil
beramuk-amuk
beramuk-amukan
berandai-andai
berandai-randai
beraneh-aneh
2017-07-24 07:10:16 +00:00
berang-berang
2017-07-27 12:46:30 +00:00
berangan-angan
beranggap-anggapan
berangguk-angguk
berangin-angin
berangka-angka
berangka-angkaan
berangkai-rangkai
berangkap-rangkapan
berani-berani
beranja-anja
berantai-rantai
berapi-api
berapung-apung
berarak-arakan
2017-07-24 07:10:16 +00:00
beras-beras
2017-07-27 12:46:30 +00:00
berasak-asak
berasak-asakan
berasap-asap
berasing-asingan
beratus-ratus
berawa-rawa
berawas-awas
berayal-ayalan
berayun-ayun
berbagai-bagai
berbahas-bahasan
berbahasa-bahasa
berbaik-baikan
berbait-bait
berbala-bala
berbalas-balasan
berbalik-balik
berbalun-balun
berbanjar-banjar
berbantah-bantah
berbanyak-banyak
berbarik-barik
berbasa-basi
berbasah-basah
berbatu-batu
berbayang-bayang
berbecak-becak
2017-07-26 12:12:52 +00:00
berbeda-beda
2017-07-27 12:46:30 +00:00
berbedil-bedilan
berbega-bega
berbeka-beka
berbelah-belah
berbelakang-belakangan
berbelang-belang
berbelau-belauan
berbeli-beli
berbeli-belian
2017-07-26 12:12:52 +00:00
berbelit-belit
2017-07-27 12:46:30 +00:00
berbelok-belok
berbenang-benang
berbenar-benar
berbencah-bencah
berbencol-bencol
berbenggil-benggil
berbentol-bentol
berbentong-bentong
berberani-berani
berbesar-besar
berbidai-bidai
berbiduk-biduk
berbiku-biku
berbilik-bilik
berbinar-binar
berbincang-bincang
berbingkah-bingkah
berbintang-bintang
berbintik-bintik
berbintil-bintil
berbisik-bisik
berbolak-balik
berbolong-bolong
berbondong-bondong
berbongkah-bongkah
berbuai-buai
berbual-bual
berbudak-budak
berbukit-bukit
berbulan-bulan
berbunga-bunga
berbuntut-buntut
berbunuh-bunuhan
berburu-buru
berburuk-buruk
berbutir-butir
bercabang-cabang
bercaci-cacian
bercakap-cakap
bercakar-cakaran
bercamping-camping
bercantik-cantik
bercari-cari
bercari-carian
bercarik-carik
bercarut-carut
bercebar-cebur
bercepat-cepat
bercerai-berai
bercerai-cerai
bercetai-cetai
berciap-ciap
bercikun-cikun
bercinta-cintaan
2017-07-26 12:12:52 +00:00
bercita-cita
2017-07-27 12:46:30 +00:00
berciut-ciut
bercompang-camping
berconteng-conteng
bercoreng-coreng
bercoreng-moreng
bercuang-caing
bercuit-cuit
bercumbu-cumbu
bercumbu-cumbuan
bercura-bura
bercura-cura
berdada-dadaan
berdahulu-dahuluan
berdalam-dalam
berdalih-dalih
berdampung-dampung
berdebar-debar
berdecak-decak
berdecap-decap
berdecup-decup
berdecut-decut
berdedai-dedai
berdegap-degap
berdegar-degar
berdeham-deham
berdekah-dekah
berdekak-dekak
berdekap-dekapan
berdekat-dekat
berdelat-delat
berdembai-dembai
berdembun-dembun
berdempang-dempang
berdempet-dempet
berdencing-dencing
berdendam-dendaman
berdengkang-dengkang
berdengut-dengut
berdentang-dentang
berdentum-dentum
berdentung-dentung
berdenyar-denyar
berdenyut-denyut
berdepak-depak
berdepan-depan
berderai-derai
berderak-derak
berderam-deram
berderau-derau
berderik-derik
berdering-dering
berderung-derung
berderus-derus
berdesak-desakan
berdesik-desik
berdesing-desing
berdesus-desus
berdikit-dikit
berdingkit-dingkit
berdua-dua
berduri-duri
berduru-duru
berduyun-duyun
berebut-rebut
berebut-rebutan
beregang-regang
2017-07-24 07:10:16 +00:00
berek-berek
2017-07-27 12:46:30 +00:00
berembut-rembut
berempat-empat
berenak-enak
berencel-encel
2017-07-24 07:10:16 +00:00
bereng-bereng
2017-07-27 12:46:30 +00:00
berenggan-enggan
berenteng-renteng
beresa-esaan
beresah-resah
berfoya-foya
bergagah-gagahan
bergagap-gagap
bergagau-gagau
bergalur-galur
berganda-ganda
berganjur-ganjur
2017-07-26 12:12:52 +00:00
berganti-ganti
2017-07-27 12:46:30 +00:00
bergarah-garah
bergaruk-garuk
bergaya-gaya
bergegas-gegas
bergelang-gelang
bergelap-gelap
bergelas-gelasan
bergeleng-geleng
bergemal-gemal
bergembar-gembor
bergembut-gembut
bergepok-gepok
bergerek-gerek
bergesa-gesa
bergilir-gilir
bergolak-golak
bergolek-golek
bergolong-golong
bergores-gores
2017-07-26 12:12:52 +00:00
bergotong-royong
2017-07-27 12:46:30 +00:00
bergoyang-goyang
bergugus-gugus
bergulung-gulung
bergulut-gulut
bergumpal-gumpal
bergunduk-gunduk
bergunung-gunung
berhadap-hadapan
berhamun-hamun
berhandai-handai
berhanyut-hanyut
2017-07-26 12:12:52 +00:00
berhari-hari
berhati-hati
berhati-hatilah
2017-07-27 12:46:30 +00:00
berhektare-hektare
berhilau-hilau
berhormat-hormat
berhujan-hujan
berhura-hura
2017-07-24 07:10:16 +00:00
beri-beri
2017-07-27 12:46:30 +00:00
beri-memberi
beria-ia
beria-ria
beriak-riak
beriba-iba
beribu-ribu
berigi-rigi
berimpit-impit
berindap-indap
2017-07-24 07:10:16 +00:00
bering-bering
2017-07-27 12:46:30 +00:00
beringat-ingat
beringgit-ringgit
berintik-rintik
beriring-iring
beriring-iringan
2017-07-26 12:12:52 +00:00
berita-berita
2017-07-27 12:46:30 +00:00
berjabir-jabir
berjaga-jaga
berjagung-jagung
2017-07-26 12:12:52 +00:00
berjalan-jalan
2017-07-27 12:46:30 +00:00
berjalar-jalar
berjalin-jalin
berjalur-jalur
berjam-jam
berjari-jari
berjauh-jauhan
berjegal-jegalan
berjejal-jejal
berjela-jela
berjengkek-jengkek
berjenis-jenis
berjenjang-jenjang
berjilid-jilid
berjinak-jinak
berjingkat-jingkat
berjingkik-jingkik
berjingkrak-jingkrak
berjongkok-jongkok
berjubel-jubel
berjujut-jujutan
berjulai-julai
berjumbai-jumbai
berjumbul-jumbul
berjuntai-juntai
berjurai-jurai
berjurus-jurus
berjuta-juta
berka-li-kali
berkabu-kabu
berkaca-kaca
berkaing-kaing
berkait-kaitan
berkala-kala
2017-07-24 07:10:16 +00:00
berkali-kali
2017-07-27 12:46:30 +00:00
berkamit-kamit
berkanjar-kanjar
berkaok-kaok
berkarung-karung
berkasak-kusuk
berkasih-kasihan
berkata-kata
berkatak-katak
berkecai-kecai
berkecek-kecek
berkecil-kecil
berkecil-kecilan
berkedip-kedip
berkejang-kejang
berkejap-kejap
berkejar-kejaran
berkelar-kelar
berkelepai-kelepai
berkelip-kelip
berkelit-kelit
berkelok-kelok
berkelompok-kelompok
berkelun-kelun
berkembur-kembur
berkempul-kempul
berkena-kenaan
berkenal-kenalan
berkendur-kendur
berkeok-keok
berkepak-kepak
berkepal-kepal
berkeping-keping
berkepul-kepul
berkeras-kerasan
berkering-kering
berkeritik-keritik
berkeruit-keruit
berkerut-kerut
berketai-ketai
berketak-ketak
berketak-ketik
berketap-ketap
berketap-ketip
berketar-ketar
berketi-keti
berketil-ketil
berketuk-ketak
berketul-ketul
berkial-kial
berkian-kian
berkias-kias
berkias-kiasan
berkibar-kibar
berkilah-kilah
berkilap-kilap
berkilat-kilat
berkilau-kilauan
berkilo-kilo
berkimbang-kimbang
berkinja-kinja
berkipas-kipas
berkira-kira
berkirim-kiriman
berkisar-kisar
berkoak-koak
berkoar-koar
berkobar-kobar
berkobok-kobok
berkocak-kocak
berkodi-kodi
berkolek-kolek
berkomat-kamit
berkopah-kopah
berkoper-koper
berkotak-kotak
berkuat-kuat
berkuat-kuatan
berkumur-kumur
berkunang-kunang
berkunar-kunar
berkunjung-kunjungan
berkurik-kurik
berkurun-kurun
berkusau-kusau
berkusu-kusu
berkusut-kusut
berkuting-kuting
berkutu-kutuan
berlabun-labun
berlain-lainan
berlaju-laju
berlalai-lalai
berlama-lama
berlambai-lambai
berlambak-lambak
berlampang-lampang
berlanggar-langgar
berlapang-lapang
berlapis-lapis
berlapuk-lapuk
berlarah-larah
berlarat-larat
2017-07-26 12:12:52 +00:00
berlari-lari
2017-07-27 12:46:30 +00:00
berlari-larian
berlarih-larih
berlarik-larik
berlarut-larut
berlawak-lawak
berlayap-layapan
berlebih-lebih
berlebih-lebihan
berleha-leha
berlekas-lekas
berlekas-lekasan
berlekat-lekat
berlekuk-lekuk
berlempar-lemparan
berlena-lena
berlengah-lengah
berlenggak-lenggok
berlenggek-lenggek
berlenggok-lenggok
berleret-leret
berletih-letih
berliang-liuk
berlibat-libat
berligar-ligar
2017-07-26 12:12:52 +00:00
berliku-liku
2017-07-27 12:46:30 +00:00
berlikur-likur
berlimbak-limbak
berlimpah-limpah
berlimpap-limpap
berlimpit-limpit
berlinang-linang
berlindak-lindak
berlipat-lipat
berlomba-lomba
berlompok-lompok
berloncat-loncatan
berlopak-lopak
berlubang-lubang
berlusin-lusin
bermaaf-maafan
bermabuk-mabukan
2017-07-24 07:10:16 +00:00
bermacam-macam
2017-07-27 12:46:30 +00:00
bermain-main
bermalam-malam
bermalas-malas
2017-07-26 12:12:52 +00:00
bermalas-malasan
2017-07-27 12:46:30 +00:00
bermanik-manik
bermanis-manis
bermanja-manja
bermasak-masak
bermati-mati
bermegah-megah
bermemek-memek
bermenung-menung
bermesra-mesraan
bermewah-mewah
bermewah-mewahan
berminggu-minggu
berminta-minta
berminyak-minyak
bermuda-muda
bermudah-mudah
bermuka-muka
bermula-mula
bermuluk-muluk
bermulut-mulut
bernafsi-nafsi
bernaka-naka
bernala-nala
bernanti-nanti
berniat-niat
bernyala-nyala
berogak-ogak
beroleng-oleng
berolok-olok
beromong-omong
beroncet-roncet
beronggok-onggok
berorang-orang
beroyal-royal
berpada-pada
berpadu-padu
berpahit-pahit
berpair-pair
berpal-pal
berpalu-palu
berpalu-paluan
berpalun-palun
berpanas-panas
berpandai-pandai
berpandang-pandangan
berpangkat-pangkat
berpanjang-panjang
berpantun-pantun
berpasang-pasang
berpasang-pasangan
berpasuk-pasuk
berpayah-payah
berpeluh-peluh
berpeluk-pelukan
berpenat-penat
berpencar-pencar
berpendar-pendar
berpenggal-penggal
berperai-perai
berperang-perangan
berpesai-pesai
berpesta-pesta
berpesuk-pesuk
berpetak-petak
berpeti-peti
berpihak-pihak
berpijar-pijar
berpikir-pikir
berpikul-pikul
berpilih-pilih
berpilin-pilin
2017-07-26 12:12:52 +00:00
berpindah-pindah
2017-07-27 12:46:30 +00:00
berpintal-pintal
berpirau-pirau
berpisah-pisah
berpolah-polah
berpolok-polok
berpongah-pongah
berpontang-panting
berporah-porah
berpotong-potong
berpotong-potongan
berpuak-puak
berpual-pual
berpugak-pugak
berpuing-puing
berpukas-pukas
berpuluh-puluh
berpulun-pulun
berpuntal-puntal
2017-07-26 12:12:52 +00:00
berpura-pura
2017-07-27 12:46:30 +00:00
berpusar-pusar
berpusing-pusing
berpusu-pusu
berputar-putar
berrumpun-rumpun
bersaf-saf
bersahut-sahutan
bersakit-sakit
bersalah-salahan
bersalam-salaman
bersalin-salin
bersalip-salipan
2017-07-24 07:10:16 +00:00
bersama-sama
2017-07-27 12:46:30 +00:00
bersambar-sambaran
bersambut-sambutan
bersampan-sampan
bersantai-santai
bersapa-sapaan
bersarang-sarang
bersedan-sedan
bersedia-sedia
bersedu-sedu
bersejuk-sejuk
bersekat-sekat
berselang-selang
berselang-seli
berselang-seling
berselang-tenggang
berselit-selit
berseluk-beluk
bersembunyi-sembunyi
bersembunyi-sembunyian
bersembur-semburan
bersempit-sempit
2017-07-26 12:12:52 +00:00
bersenang-senang
2017-07-27 12:46:30 +00:00
bersenang-senangkan
bersenda-senda
bersendi-sendi
bersenggang-senggang
bersenggau-senggau
bersepah-sepah
bersepak-sepakan
bersepi-sepi
berserak-serak
berseri-seri
berseru-seru
bersesak-sesak
bersetai-setai
bersia-sia
2017-07-24 07:10:16 +00:00
bersiap-siap
2017-07-27 12:46:30 +00:00
bersiar-siar
2017-07-26 12:12:52 +00:00
bersih-bersih
2017-07-27 12:46:30 +00:00
bersikut-sikutan
bersilir-silir
bersimbur-simburan
bersinau-sinau
bersopan-sopan
bersorak-sorai
bersuap-suapan
bersudah-sudah
bersuka-suka
bersuka-sukaan
bersuku-suku
bersulang-sulang
bersumpah-sumpahan
bersungguh-sungguh
bersungut-sungut
bersunyi-sunyi
bersuruk-surukan
bersusah-susah
bersusuk-susuk
bersusuk-susukan
bersutan-sutan
bertabur-tabur
bertahan-tahan
bertahu-tahu
2017-07-26 12:12:52 +00:00
bertahun-tahun
2017-07-27 12:46:30 +00:00
bertajuk-tajuk
bertakik-takik
bertala-tala
bertalah-talah
bertali-tali
bertalu-talu
bertalun-talun
bertambah-tambah
bertanda-tandaan
bertangis-tangisan
bertangkil-tangkil
2017-07-24 07:10:16 +00:00
bertanya-tanya
2017-07-27 12:46:30 +00:00
bertarik-tarikan
bertatai-tatai
bertatap-tatapan
bertatih-tatih
bertawan-tawan
bertawar-tawaran
bertebu-tebu
bertebu-tebukan
berteguh-teguh
berteguh-teguhan
berteka-teki
bertelang-telang
bertelau-telau
bertele-tele
bertembuk-tembuk
bertempat-tempat
bertempuh-tempuh
bertenang-tenang
bertenggang-tenggangan
bertentu-tentu
bertepek-tepek
berterang-terang
berterang-terangan
2017-07-26 12:12:52 +00:00
berteriak-teriak
2017-07-27 12:46:30 +00:00
bertikam-tikaman
bertimbal-timbalan
bertimbun-timbun
bertimpa-timpa
bertimpas-timpas
bertingkah-tingkah
bertingkat-tingkat
bertinjau-tinjauan
bertiras-tiras
bertitar-titar
bertitik-titik
bertoboh-toboh
bertolak-tolak
bertolak-tolakan
bertolong-tolongan
bertonjol-tonjol
bertruk-truk
bertua-tua
bertua-tuaan
bertual-tual
2017-07-26 12:12:52 +00:00
bertubi-tubi
2017-07-27 12:46:30 +00:00
bertukar-tukar
bertukar-tukaran
bertukas-tukas
bertumpak-tumpak
bertumpang-tindih
bertumpuk-tumpuk
bertunda-tunda
bertunjuk-tunjukan
bertura-tura
2017-07-24 07:10:16 +00:00
berturut-turut
2017-07-27 12:46:30 +00:00
bertutur-tutur
beruas-ruas
2017-07-26 12:12:52 +00:00
berubah-ubah
2017-07-27 12:46:30 +00:00
berulang-alik
2017-07-26 12:12:52 +00:00
berulang-ulang
2017-07-27 12:46:30 +00:00
berumbai-rumbai
berundak-undak
berundan-undan
berundung-undung
berunggas-runggas
berunggun-unggun
berunggut-unggut
berungkur-ungkuran
beruntai-untai
beruntun-runtun
beruntung-untung
berunyai-unyai
berupa-rupa
berura-ura
beruris-uris
berurut-urutan
berwarna-warna
berwarna-warni
berwindu-windu
berwiru-wiru
beryang-yang
2017-07-26 12:12:52 +00:00
besar-besar
besar-besaran
2017-07-24 07:10:16 +00:00
betak-betak
beti-beti
2017-07-27 12:46:30 +00:00
betik-betik
betul-betul
biang-biang
2017-07-24 07:10:16 +00:00
biar-biar
2017-07-26 12:12:52 +00:00
biaya-biaya
2017-07-27 12:46:30 +00:00
bicu-bicu
2017-07-26 12:12:52 +00:00
bidadari-bidadari
bidang-bidang
bijak-bijaklah
biji-bijian
2017-07-27 12:46:30 +00:00
bila-bila
2017-07-24 07:10:16 +00:00
bilang-bilang
2017-07-26 12:12:52 +00:00
bincang-bincang
2017-07-24 07:10:16 +00:00
bincang-bincut
2017-07-27 12:46:30 +00:00
bingkah-bingkah
bini-binian
2017-07-26 12:12:52 +00:00
bintang-bintang
bintik-bintik
bio-oil
2017-07-24 07:10:16 +00:00
biri-biri
2017-07-26 12:12:52 +00:00
biru-biru
biru-hitam
biru-kuning
bisik-bisik
2017-07-27 12:46:30 +00:00
biti-biti
2017-07-26 12:12:52 +00:00
blak-blakan
blok-blok
bocah-bocah
bohong-bohong
2017-07-27 12:46:30 +00:00
bohong-bohongan
2017-07-26 12:12:52 +00:00
bola-bola
2017-07-24 07:10:16 +00:00
bolak-balik
bolang-baling
2017-07-26 12:12:52 +00:00
boleh-boleh
bom-bom
bomber-bomber
bonek-bonek
2017-07-24 07:10:16 +00:00
bongkar-bangkir
2017-07-27 12:46:30 +00:00
bongkar-membongkar
2017-07-26 12:12:52 +00:00
bongkar-pasang
2017-07-24 07:10:16 +00:00
boro-boro
2017-07-26 12:12:52 +00:00
bos-bos
bottom-up
box-to-box
2017-07-24 07:10:16 +00:00
boyo-boyo
2017-07-26 12:12:52 +00:00
buah-buahan
buang-buang
2017-07-27 12:46:30 +00:00
buat-buatan
2017-07-24 07:10:16 +00:00
buaya-buaya
2017-07-27 12:46:30 +00:00
bubun-bubun
2017-07-24 07:10:16 +00:00
bugi-bugi
2017-07-26 12:12:52 +00:00
build-up
built-in
built-up
buka-buka
buka-bukaan
buka-tutup
2017-07-27 12:46:30 +00:00
bukan-bukan
2017-07-26 12:12:52 +00:00
bukti-bukti
buku-buku
bulan-bulan
2017-07-27 12:46:30 +00:00
bulan-bulanan
2017-07-24 07:10:16 +00:00
bulang-baling
2017-07-27 12:46:30 +00:00
bulang-bulang
2017-07-26 12:12:52 +00:00
bulat-bulat
2017-07-24 07:10:16 +00:00
buli-buli
bulu-bulu
2017-07-27 12:46:30 +00:00
buluh-buluh
2017-07-24 07:10:16 +00:00
bulus-bulus
2017-07-26 12:12:52 +00:00
bunga-bunga
2017-07-27 12:46:30 +00:00
bunga-bungaan
bunuh-membunuh
bunyi-bunyian
2017-07-26 12:12:52 +00:00
bupati-bupati
bupati-wakil
buru-buru
burung-burung
2017-07-27 12:46:30 +00:00
burung-burungan
2017-07-26 12:12:52 +00:00
bus-bus
business-to-business
2017-07-27 12:46:30 +00:00
busur-busur
2017-07-26 12:12:52 +00:00
butir-butir
by-pass
bye-bye
cabang-cabang
2017-07-27 12:46:30 +00:00
cabik-cabik
cabik-mencabik
2017-07-26 12:12:52 +00:00
cabup-cawabup
2017-07-24 07:10:16 +00:00
caci-maki
2017-07-26 12:12:52 +00:00
cagub-cawagub
2017-07-27 12:46:30 +00:00
caing-caing
cakar-mencakar
cakup-mencakup
calak-calak
calar-balar
2017-07-26 12:12:52 +00:00
caleg-caleg
calo-calo
calon-calon
2017-07-27 12:46:30 +00:00
campang-camping
2017-07-26 12:12:52 +00:00
campur-campur
capres-cawapres
cara-cara
cari-cari
2017-07-27 12:46:30 +00:00
cari-carian
2017-07-26 12:12:52 +00:00
carut-marut
catch-up
cawali-cawawali
2017-07-24 07:10:16 +00:00
cawe-cawe
2017-07-27 12:46:30 +00:00
cawi-cawi
cebar-cebur
2017-07-26 12:12:52 +00:00
celah-celah
2017-07-24 07:10:16 +00:00
celam-celum
celangak-celinguk
celas-celus
celedang-celedok
celengkak-celengkok
2017-07-27 12:46:30 +00:00
celingak-celinguk
celung-celung
cemas-cemas
2017-07-24 07:10:16 +00:00
cenal-cenil
cengar-cengir
cengir-cengir
cengis-cengis
2017-07-27 12:46:30 +00:00
cengking-mengking
centang-perenang
2017-07-26 12:12:52 +00:00
cepat-cepat
2017-07-24 07:10:16 +00:00
ceplas-ceplos
cerai-berai
2017-07-26 12:12:52 +00:00
cerita-cerita
2017-07-27 12:46:30 +00:00
ceruk-menceruk
ceruk-meruk
2017-07-26 12:12:52 +00:00
cetak-biru
2017-07-27 12:46:30 +00:00
cetak-mencetak
cetar-ceter
2017-07-26 12:12:52 +00:00
check-in
check-ins
check-up
chit-chat
choki-choki
2017-07-27 12:46:30 +00:00
cingak-cinguk
2017-07-26 12:12:52 +00:00
cipika-cipiki
ciri-ciri
ciri-cirinya
2017-07-27 12:46:30 +00:00
cirit-birit
2017-07-26 12:12:52 +00:00
cita-cita
cita-citaku
close-up
closed-circuit
coba-coba
2017-07-24 07:10:16 +00:00
cobak-cabik
cobar-cabir
cola-cala
colang-caling
comat-comot
2017-07-27 12:46:30 +00:00
comot-comot
2017-07-24 07:10:16 +00:00
compang-camping
2017-07-26 12:12:52 +00:00
computer-aided
computer-generated
2017-07-27 12:46:30 +00:00
condong-mondong
congak-cangit
2017-07-24 07:10:16 +00:00
conggah-canggih
congkah-cangkih
congkah-mangkih
copak-capik
2017-07-26 12:12:52 +00:00
copy-paste
2017-07-27 12:46:30 +00:00
corak-carik
2017-07-26 12:12:52 +00:00
corat-coret
2017-07-27 12:46:30 +00:00
coreng-moreng
coret-coret
2017-07-24 07:10:16 +00:00
crat-crit
2017-07-26 12:12:52 +00:00
cross-border
cross-dressing
crypto-ransomware
2017-07-27 12:46:30 +00:00
cuang-caing
2017-07-26 12:12:52 +00:00
cublak-cublak
2017-07-27 12:46:30 +00:00
cubung-cubung
culik-culik
2017-07-26 12:12:52 +00:00
cuma-cuma
2017-07-24 07:10:16 +00:00
cumi-cumi
2017-07-27 12:46:30 +00:00
cungap-cangip
cupu-cupu
2017-07-26 12:12:52 +00:00
dabu-dabu
daerah-daerah
2017-07-24 07:10:16 +00:00
dag-dag
dag-dig-dug
2017-07-27 12:46:30 +00:00
daging-dagingan
dahulu-mendahului
2017-07-26 12:12:52 +00:00
dalam-dalam
2017-07-24 07:10:16 +00:00
dali-dali
2017-07-27 12:46:30 +00:00
dam-dam
2017-07-26 12:12:52 +00:00
danau-danau
2017-07-27 12:46:30 +00:00
dansa-dansi
2017-07-26 12:12:52 +00:00
dapil-dapil
2017-07-24 07:10:16 +00:00
dapur-dapur
dari-dari
daru-daru
2017-07-26 12:12:52 +00:00
dasar-dasar
2017-07-27 12:46:30 +00:00
datang-datang
datang-mendatangi
2017-07-26 12:12:52 +00:00
daun-daun
2017-07-27 12:46:30 +00:00
daun-daunan
2017-07-24 07:10:16 +00:00
dawai-dawai
2017-07-27 12:46:30 +00:00
dayang-dayang
dayung-mayung
debak-debuk
2017-07-26 12:12:52 +00:00
debu-debu
deca-core
decision-making
deep-lying
deg-degan
2017-07-27 12:46:30 +00:00
degap-degap
2017-07-24 07:10:16 +00:00
dekak-dekak
2017-07-27 12:46:30 +00:00
dekat-dekat
dengar-dengaran
dengking-mendengking
2017-07-26 12:12:52 +00:00
departemen-departemen
depo-depo
deputi-deputi
desa-desa
desa-kota
2017-07-24 07:10:16 +00:00
desas-desus
2017-07-26 12:12:52 +00:00
detik-detik
dewa-dewa
dewa-dewi
dewan-dewan
2017-07-24 07:10:16 +00:00
dewi-dewi
2017-07-26 12:12:52 +00:00
dial-up
diam-diam
dibayang-bayangi
dibuat-buat
diiming-imingi
dilebih-lebihkan
💫 Port master changes over to develop (#2979) * Create aryaprabhudesai.md (#2681) * Update _install.jade (#2688) Typo fix: "models" -> "model" * Add FAC to spacy.explain (resolves #2706) * Remove docstrings for deprecated arguments (see #2703) * When calling getoption() in conftest.py, pass a default option (#2709) * When calling getoption() in conftest.py, pass a default option This is necessary to allow testing an installed spacy by running: pytest --pyargs spacy * Add contributor agreement * update bengali token rules for hyphen and digits (#2731) * Less norm computations in token similarity (#2730) * Less norm computations in token similarity * Contributor agreement * Remove ')' for clarity (#2737) Sorry, don't mean to be nitpicky, I just noticed this when going through the CLI and thought it was a quick fix. That said, if this was intention than please let me know. * added contributor agreement for mbkupfer (#2738) * Basic support for Telugu language (#2751) * Lex _attrs for polish language (#2750) * Signed spaCy contributor agreement * Added polish version of english lex_attrs * Introduces a bulk merge function, in order to solve issue #653 (#2696) * Fix comment * Introduce bulk merge to increase performance on many span merges * Sign contributor agreement * Implement pull request suggestions * Describe converters more explicitly (see #2643) * Add multi-threading note to Language.pipe (resolves #2582) [ci skip] * Fix formatting * Fix dependency scheme docs (closes #2705) [ci skip] * Don't set stop word in example (closes #2657) [ci skip] * Add words to portuguese language _num_words (#2759) * Add words to portuguese language _num_words * Add words to portuguese language _num_words * Update Indonesian model (#2752) * adding e-KTP in tokenizer exceptions list * add exception token * removing lines with containing space as it won't matter since we use .split() method in the end, added new tokens in exception * add tokenizer exceptions list * combining base_norms with norm_exceptions * adding norm_exception * fix double key in lemmatizer * remove unused import on punctuation.py * reformat stop_words to reduce number of lines, improve readibility * updating tokenizer exception * implement is_currency for lang/id * adding orth_first_upper in tokenizer_exceptions * update the norm_exception list * remove bunch of abbreviations * adding contributors file * Fixed spaCy+Keras example (#2763) * bug fixes in keras example * created contributor agreement * Adding French hyphenated first name (#2786) * Fix typo (closes #2784) * Fix typo (#2795) [ci skip] Fixed typo on line 6 "regcognizer --> recognizer" * Adding basic support for Sinhala language. (#2788) * adding Sinhala language package, stop words, examples and lex_attrs. * Adding contributor agreement * Updating contributor agreement * Also include lowercase norm exceptions * Fix error (#2802) * Fix error ValueError: cannot resize an array that references or is referenced by another array in this way. Use the resize function * added spaCy Contributor Agreement * Add charlax's contributor agreement (#2805) * agreement of contributor, may I introduce a tiny pl languge contribution (#2799) * Contributors agreement * Contributors agreement * Contributors agreement * Add jupyter=True to displacy.render in documentation (#2806) * Revert "Also include lowercase norm exceptions" This reverts commit 70f4e8adf37cfcfab60be2b97d6deae949b30e9e. * Remove deprecated encoding argument to msgpack * Set up dependency tree pattern matching skeleton (#2732) * Fix bug when too many entity types. Fixes #2800 * Fix Python 2 test failure * Require older msgpack-numpy * Restore encoding arg on msgpack-numpy * Try to fix version pin for msgpack-numpy * Update Portuguese Language (#2790) * Add words to portuguese language _num_words * Add words to portuguese language _num_words * Portuguese - Add/remove stopwords, fix tokenizer, add currency symbols * Extended punctuation and norm_exceptions in the Portuguese language * Correct error in spacy universe docs concerning spacy-lookup (#2814) * Update Keras Example for (Parikh et al, 2016) implementation (#2803) * bug fixes in keras example * created contributor agreement * baseline for Parikh model * initial version of parikh 2016 implemented * tested asymmetric models * fixed grevious error in normalization * use standard SNLI test file * begin to rework parikh example * initial version of running example * start to document the new version * start to document the new version * Update Decompositional Attention.ipynb * fixed calls to similarity * updated the README * import sys package duh * simplified indexing on mapping word to IDs * stupid python indent error * added code from https://github.com/tensorflow/tensorflow/issues/3388 for tf bug workaround * Fix typo (closes #2815) [ci skip] * Update regex version dependency * Set version to 2.0.13.dev3 * Skip seemingly problematic test * Remove problematic test * Try previous version of regex * Revert "Remove problematic test" This reverts commit bdebbef45552d698d390aa430b527ee27830f11b. * Unskip test * Try older version of regex * 💫 Update training examples and use minibatching (#2830) <!--- Provide a general summary of your changes in the title. --> ## Description Update the training examples in `/examples/training` to show usage of spaCy's `minibatch` and `compounding` helpers ([see here](https://spacy.io/usage/training#tips-batch-size) for details). The lack of batching in the examples has caused some confusion in the past, especially for beginners who would copy-paste the examples, update them with large training sets and experienced slow and unsatisfying results. ### Types of change enhancements ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. * Visual C++ link updated (#2842) (closes #2841) [ci skip] * New landing page * Add contribution agreement * Correcting lang/ru/examples.py (#2845) * Correct some grammatical inaccuracies in lang\ru\examples.py; filled Contributor Agreement * Correct some grammatical inaccuracies in lang\ru\examples.py * Move contributor agreement to separate file * Set version to 2.0.13.dev4 * Add Persian(Farsi) language support (#2797) * Also include lowercase norm exceptions * Remove in favour of https://github.com/explosion/spaCy/graphs/contributors * Rule-based French Lemmatizer (#2818) <!--- Provide a general summary of your changes in the title. --> ## Description <!--- Use this section to describe your changes. If your changes required testing, include information about the testing environment and the tests you ran. If your test fixes a bug reported in an issue, don't forget to include the issue number. If your PR is still a work in progress, that's totally fine – just include a note to let us know. --> Add a rule-based French Lemmatizer following the english one and the excellent PR for [greek language optimizations](https://github.com/explosion/spaCy/pull/2558) to adapt the Lemmatizer class. ### Types of change <!-- What type of change does your PR cover? Is it a bug fix, an enhancement or new feature, or a change to the documentation? --> - Lemma dictionary used can be found [here](http://infolingu.univ-mlv.fr/DonneesLinguistiques/Dictionnaires/telechargement.html), I used the XML version. - Add several files containing exhaustive list of words for each part of speech - Add some lemma rules - Add POS that are not checked in the standard Lemmatizer, i.e PRON, DET, ADV and AUX - Modify the Lemmatizer class to check in lookup table as a last resort if POS not mentionned - Modify the lemmatize function to check in lookup table as a last resort - Init files are updated so the model can support all the functionalities mentioned above - Add words to tokenizer_exceptions_list.py in respect to regex used in tokenizer_exceptions.py ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [X] I have submitted the spaCy Contributor Agreement. - [X] I ran the tests, and all new and existing tests passed. - [X] My changes don't require a change to the documentation, or if they do, I've added all required information. * Set version to 2.0.13 * Fix formatting and consistency * Update docs for new version [ci skip] * Increment version [ci skip] * Add info on wheels [ci skip] * Adding "This is a sentence" example to Sinhala (#2846) * Add wheels badge * Update badge [ci skip] * Update README.rst [ci skip] * Update murmurhash pin * Increment version to 2.0.14.dev0 * Update GPU docs for v2.0.14 * Add wheel to setup_requires * Import prefer_gpu and require_gpu functions from Thinc * Add tests for prefer_gpu() and require_gpu() * Update requirements and setup.py * Workaround bug in thinc require_gpu * Set version to v2.0.14 * Update push-tag script * Unhack prefer_gpu * Require thinc 6.10.6 * Update prefer_gpu and require_gpu docs [ci skip] * Fix specifiers for GPU * Set version to 2.0.14.dev1 * Set version to 2.0.14 * Update Thinc version pin * Increment version * Fix msgpack-numpy version pin * Increment version * Update version to 2.0.16 * Update version [ci skip] * Redundant ')' in the Stop words' example (#2856) <!--- Provide a general summary of your changes in the title. --> ## Description <!--- Use this section to describe your changes. If your changes required testing, include information about the testing environment and the tests you ran. If your test fixes a bug reported in an issue, don't forget to include the issue number. If your PR is still a work in progress, that's totally fine – just include a note to let us know. --> ### Types of change <!-- What type of change does your PR cover? Is it a bug fix, an enhancement or new feature, or a change to the documentation? --> ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [ ] I have submitted the spaCy Contributor Agreement. - [ ] I ran the tests, and all new and existing tests passed. - [ ] My changes don't require a change to the documentation, or if they do, I've added all required information. * Documentation improvement regarding joblib and SO (#2867) Some documentation improvements ## Description 1. Fixed the dead URL to joblib 2. Fixed Stack Overflow brand name (with space) ### Types of change Documentation ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. * raise error when setting overlapping entities as doc.ents (#2880) * Fix out-of-bounds access in NER training The helper method state.B(1) gets the index of the first token of the buffer, or -1 if no such token exists. Normally this is safe because we pass this to functions like state.safe_get(), which returns an empty token. Here we used it directly as an array index, which is not okay! This error may have been the cause of out-of-bounds access errors during training. Similar errors may still be around, so much be hunted down. Hunting this one down took a long time...I printed out values across training runs and diffed, looking for points of divergence between runs, when no randomness should be allowed. * Change PyThaiNLP Url (#2876) * Fix missing comma * Add example showing a fix-up rule for space entities * Set version to 2.0.17.dev0 * Update regex version * Revert "Update regex version" This reverts commit 62358dd867d15bc6a475942dff34effba69dd70a. * Try setting older regex version, to align with conda * Set version to 2.0.17 * Add spacy-js to universe [ci-skip] * Add spacy-raspberry to universe (closes #2889) * Add script to validate universe json [ci skip] * Removed space in docs + added contributor indo (#2909) * - removed unneeded space in documentation * - added contributor info * Allow input text of length up to max_length, inclusive (#2922) * Include universe spec for spacy-wordnet component (#2919) * feat: include universe spec for spacy-wordnet component * chore: include spaCy contributor agreement * Minor formatting changes [ci skip] * Fix image [ci skip] Twitter URL doesn't work on live site * Check if the word is in one of the regular lists specific to each POS (#2886) * 💫 Create random IDs for SVGs to prevent ID clashes (#2927) Resolves #2924. ## Description Fixes problem where multiple visualizations in Jupyter notebooks would have clashing arc IDs, resulting in weirdly positioned arc labels. Generating a random ID prefix so even identical parses won't receive the same IDs for consistency (even if effect of ID clash isn't noticable here.) ### Types of change bug fix ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. * Fix typo [ci skip] * fixes symbolic link on py3 and windows (#2949) * fixes symbolic link on py3 and windows during setup of spacy using command python -m spacy link en_core_web_sm en closes #2948 * Update spacy/compat.py Co-Authored-By: cicorias <cicorias@users.noreply.github.com> * Fix formatting * Update universe [ci skip] * Catalan Language Support (#2940) * Catalan language Support * Ddding Catalan to documentation * Sort languages alphabetically [ci skip] * Update tests for pytest 4.x (#2965) <!--- Provide a general summary of your changes in the title. --> ## Description - [x] Replace marks in params for pytest 4.0 compat ([see here](https://docs.pytest.org/en/latest/deprecations.html#marks-in-pytest-mark-parametrize)) - [x] Un-xfail passing tests (some fixes in a recent update resolved a bunch of issues, but tests were apparently never updated here) ### Types of change <!-- What type of change does your PR cover? Is it a bug fix, an enhancement or new feature, or a change to the documentation? --> ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. * Fix regex pin to harmonize with conda (#2964) * Update README.rst * Fix bug where Vocab.prune_vector did not use 'batch_size' (#2977) Fixes #2976 * Fix typo * Fix typo * Remove duplicate file * Require thinc 7.0.0.dev2 Fixes bug in gpu_ops that would use cupy instead of numpy on CPU * Add missing import * Fix error IDs * Fix tests
2018-11-29 15:30:29 +00:00
dimana-mana
2017-07-26 12:12:52 +00:00
dimata-matai
dinas-dinas
2017-07-24 07:10:16 +00:00
dinul-Islam
2017-07-26 12:12:52 +00:00
diobok-obok
diolok-olok
direksi-direksi
direktorat-direktorat
dirjen-dirjen
dirut-dirut
ditunggu-tunggu
divisi-divisi
do-it-yourself
doa-doa
2017-07-24 07:10:16 +00:00
dog-dog
2017-07-26 12:12:52 +00:00
doggy-style
2017-07-24 07:10:16 +00:00
dokok-dokok
dolak-dalik
2017-07-27 12:46:30 +00:00
dor-doran
2017-07-26 12:12:52 +00:00
dorong-mendorong
dosa-dosa
dress-up
drive-in
2017-07-27 12:46:30 +00:00
dua-dua
dua-duaan
2017-07-26 12:12:52 +00:00
dua-duanya
dubes-dubes
duduk-duduk
dugaan-dugaan
2017-07-24 07:10:16 +00:00
dulang-dulang
2017-07-26 12:12:52 +00:00
duri-duri
duta-duta
dwi-kewarganegaraan
💫 Port master changes over to develop (#2979) * Create aryaprabhudesai.md (#2681) * Update _install.jade (#2688) Typo fix: "models" -> "model" * Add FAC to spacy.explain (resolves #2706) * Remove docstrings for deprecated arguments (see #2703) * When calling getoption() in conftest.py, pass a default option (#2709) * When calling getoption() in conftest.py, pass a default option This is necessary to allow testing an installed spacy by running: pytest --pyargs spacy * Add contributor agreement * update bengali token rules for hyphen and digits (#2731) * Less norm computations in token similarity (#2730) * Less norm computations in token similarity * Contributor agreement * Remove ')' for clarity (#2737) Sorry, don't mean to be nitpicky, I just noticed this when going through the CLI and thought it was a quick fix. That said, if this was intention than please let me know. * added contributor agreement for mbkupfer (#2738) * Basic support for Telugu language (#2751) * Lex _attrs for polish language (#2750) * Signed spaCy contributor agreement * Added polish version of english lex_attrs * Introduces a bulk merge function, in order to solve issue #653 (#2696) * Fix comment * Introduce bulk merge to increase performance on many span merges * Sign contributor agreement * Implement pull request suggestions * Describe converters more explicitly (see #2643) * Add multi-threading note to Language.pipe (resolves #2582) [ci skip] * Fix formatting * Fix dependency scheme docs (closes #2705) [ci skip] * Don't set stop word in example (closes #2657) [ci skip] * Add words to portuguese language _num_words (#2759) * Add words to portuguese language _num_words * Add words to portuguese language _num_words * Update Indonesian model (#2752) * adding e-KTP in tokenizer exceptions list * add exception token * removing lines with containing space as it won't matter since we use .split() method in the end, added new tokens in exception * add tokenizer exceptions list * combining base_norms with norm_exceptions * adding norm_exception * fix double key in lemmatizer * remove unused import on punctuation.py * reformat stop_words to reduce number of lines, improve readibility * updating tokenizer exception * implement is_currency for lang/id * adding orth_first_upper in tokenizer_exceptions * update the norm_exception list * remove bunch of abbreviations * adding contributors file * Fixed spaCy+Keras example (#2763) * bug fixes in keras example * created contributor agreement * Adding French hyphenated first name (#2786) * Fix typo (closes #2784) * Fix typo (#2795) [ci skip] Fixed typo on line 6 "regcognizer --> recognizer" * Adding basic support for Sinhala language. (#2788) * adding Sinhala language package, stop words, examples and lex_attrs. * Adding contributor agreement * Updating contributor agreement * Also include lowercase norm exceptions * Fix error (#2802) * Fix error ValueError: cannot resize an array that references or is referenced by another array in this way. Use the resize function * added spaCy Contributor Agreement * Add charlax's contributor agreement (#2805) * agreement of contributor, may I introduce a tiny pl languge contribution (#2799) * Contributors agreement * Contributors agreement * Contributors agreement * Add jupyter=True to displacy.render in documentation (#2806) * Revert "Also include lowercase norm exceptions" This reverts commit 70f4e8adf37cfcfab60be2b97d6deae949b30e9e. * Remove deprecated encoding argument to msgpack * Set up dependency tree pattern matching skeleton (#2732) * Fix bug when too many entity types. Fixes #2800 * Fix Python 2 test failure * Require older msgpack-numpy * Restore encoding arg on msgpack-numpy * Try to fix version pin for msgpack-numpy * Update Portuguese Language (#2790) * Add words to portuguese language _num_words * Add words to portuguese language _num_words * Portuguese - Add/remove stopwords, fix tokenizer, add currency symbols * Extended punctuation and norm_exceptions in the Portuguese language * Correct error in spacy universe docs concerning spacy-lookup (#2814) * Update Keras Example for (Parikh et al, 2016) implementation (#2803) * bug fixes in keras example * created contributor agreement * baseline for Parikh model * initial version of parikh 2016 implemented * tested asymmetric models * fixed grevious error in normalization * use standard SNLI test file * begin to rework parikh example * initial version of running example * start to document the new version * start to document the new version * Update Decompositional Attention.ipynb * fixed calls to similarity * updated the README * import sys package duh * simplified indexing on mapping word to IDs * stupid python indent error * added code from https://github.com/tensorflow/tensorflow/issues/3388 for tf bug workaround * Fix typo (closes #2815) [ci skip] * Update regex version dependency * Set version to 2.0.13.dev3 * Skip seemingly problematic test * Remove problematic test * Try previous version of regex * Revert "Remove problematic test" This reverts commit bdebbef45552d698d390aa430b527ee27830f11b. * Unskip test * Try older version of regex * 💫 Update training examples and use minibatching (#2830) <!--- Provide a general summary of your changes in the title. --> ## Description Update the training examples in `/examples/training` to show usage of spaCy's `minibatch` and `compounding` helpers ([see here](https://spacy.io/usage/training#tips-batch-size) for details). The lack of batching in the examples has caused some confusion in the past, especially for beginners who would copy-paste the examples, update them with large training sets and experienced slow and unsatisfying results. ### Types of change enhancements ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. * Visual C++ link updated (#2842) (closes #2841) [ci skip] * New landing page * Add contribution agreement * Correcting lang/ru/examples.py (#2845) * Correct some grammatical inaccuracies in lang\ru\examples.py; filled Contributor Agreement * Correct some grammatical inaccuracies in lang\ru\examples.py * Move contributor agreement to separate file * Set version to 2.0.13.dev4 * Add Persian(Farsi) language support (#2797) * Also include lowercase norm exceptions * Remove in favour of https://github.com/explosion/spaCy/graphs/contributors * Rule-based French Lemmatizer (#2818) <!--- Provide a general summary of your changes in the title. --> ## Description <!--- Use this section to describe your changes. If your changes required testing, include information about the testing environment and the tests you ran. If your test fixes a bug reported in an issue, don't forget to include the issue number. If your PR is still a work in progress, that's totally fine – just include a note to let us know. --> Add a rule-based French Lemmatizer following the english one and the excellent PR for [greek language optimizations](https://github.com/explosion/spaCy/pull/2558) to adapt the Lemmatizer class. ### Types of change <!-- What type of change does your PR cover? Is it a bug fix, an enhancement or new feature, or a change to the documentation? --> - Lemma dictionary used can be found [here](http://infolingu.univ-mlv.fr/DonneesLinguistiques/Dictionnaires/telechargement.html), I used the XML version. - Add several files containing exhaustive list of words for each part of speech - Add some lemma rules - Add POS that are not checked in the standard Lemmatizer, i.e PRON, DET, ADV and AUX - Modify the Lemmatizer class to check in lookup table as a last resort if POS not mentionned - Modify the lemmatize function to check in lookup table as a last resort - Init files are updated so the model can support all the functionalities mentioned above - Add words to tokenizer_exceptions_list.py in respect to regex used in tokenizer_exceptions.py ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [X] I have submitted the spaCy Contributor Agreement. - [X] I ran the tests, and all new and existing tests passed. - [X] My changes don't require a change to the documentation, or if they do, I've added all required information. * Set version to 2.0.13 * Fix formatting and consistency * Update docs for new version [ci skip] * Increment version [ci skip] * Add info on wheels [ci skip] * Adding "This is a sentence" example to Sinhala (#2846) * Add wheels badge * Update badge [ci skip] * Update README.rst [ci skip] * Update murmurhash pin * Increment version to 2.0.14.dev0 * Update GPU docs for v2.0.14 * Add wheel to setup_requires * Import prefer_gpu and require_gpu functions from Thinc * Add tests for prefer_gpu() and require_gpu() * Update requirements and setup.py * Workaround bug in thinc require_gpu * Set version to v2.0.14 * Update push-tag script * Unhack prefer_gpu * Require thinc 6.10.6 * Update prefer_gpu and require_gpu docs [ci skip] * Fix specifiers for GPU * Set version to 2.0.14.dev1 * Set version to 2.0.14 * Update Thinc version pin * Increment version * Fix msgpack-numpy version pin * Increment version * Update version to 2.0.16 * Update version [ci skip] * Redundant ')' in the Stop words' example (#2856) <!--- Provide a general summary of your changes in the title. --> ## Description <!--- Use this section to describe your changes. If your changes required testing, include information about the testing environment and the tests you ran. If your test fixes a bug reported in an issue, don't forget to include the issue number. If your PR is still a work in progress, that's totally fine – just include a note to let us know. --> ### Types of change <!-- What type of change does your PR cover? Is it a bug fix, an enhancement or new feature, or a change to the documentation? --> ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [ ] I have submitted the spaCy Contributor Agreement. - [ ] I ran the tests, and all new and existing tests passed. - [ ] My changes don't require a change to the documentation, or if they do, I've added all required information. * Documentation improvement regarding joblib and SO (#2867) Some documentation improvements ## Description 1. Fixed the dead URL to joblib 2. Fixed Stack Overflow brand name (with space) ### Types of change Documentation ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. * raise error when setting overlapping entities as doc.ents (#2880) * Fix out-of-bounds access in NER training The helper method state.B(1) gets the index of the first token of the buffer, or -1 if no such token exists. Normally this is safe because we pass this to functions like state.safe_get(), which returns an empty token. Here we used it directly as an array index, which is not okay! This error may have been the cause of out-of-bounds access errors during training. Similar errors may still be around, so much be hunted down. Hunting this one down took a long time...I printed out values across training runs and diffed, looking for points of divergence between runs, when no randomness should be allowed. * Change PyThaiNLP Url (#2876) * Fix missing comma * Add example showing a fix-up rule for space entities * Set version to 2.0.17.dev0 * Update regex version * Revert "Update regex version" This reverts commit 62358dd867d15bc6a475942dff34effba69dd70a. * Try setting older regex version, to align with conda * Set version to 2.0.17 * Add spacy-js to universe [ci-skip] * Add spacy-raspberry to universe (closes #2889) * Add script to validate universe json [ci skip] * Removed space in docs + added contributor indo (#2909) * - removed unneeded space in documentation * - added contributor info * Allow input text of length up to max_length, inclusive (#2922) * Include universe spec for spacy-wordnet component (#2919) * feat: include universe spec for spacy-wordnet component * chore: include spaCy contributor agreement * Minor formatting changes [ci skip] * Fix image [ci skip] Twitter URL doesn't work on live site * Check if the word is in one of the regular lists specific to each POS (#2886) * 💫 Create random IDs for SVGs to prevent ID clashes (#2927) Resolves #2924. ## Description Fixes problem where multiple visualizations in Jupyter notebooks would have clashing arc IDs, resulting in weirdly positioned arc labels. Generating a random ID prefix so even identical parses won't receive the same IDs for consistency (even if effect of ID clash isn't noticable here.) ### Types of change bug fix ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. * Fix typo [ci skip] * fixes symbolic link on py3 and windows (#2949) * fixes symbolic link on py3 and windows during setup of spacy using command python -m spacy link en_core_web_sm en closes #2948 * Update spacy/compat.py Co-Authored-By: cicorias <cicorias@users.noreply.github.com> * Fix formatting * Update universe [ci skip] * Catalan Language Support (#2940) * Catalan language Support * Ddding Catalan to documentation * Sort languages alphabetically [ci skip] * Update tests for pytest 4.x (#2965) <!--- Provide a general summary of your changes in the title. --> ## Description - [x] Replace marks in params for pytest 4.0 compat ([see here](https://docs.pytest.org/en/latest/deprecations.html#marks-in-pytest-mark-parametrize)) - [x] Un-xfail passing tests (some fixes in a recent update resolved a bunch of issues, but tests were apparently never updated here) ### Types of change <!-- What type of change does your PR cover? Is it a bug fix, an enhancement or new feature, or a change to the documentation? --> ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. * Fix regex pin to harmonize with conda (#2964) * Update README.rst * Fix bug where Vocab.prune_vector did not use 'batch_size' (#2977) Fixes #2976 * Fix typo * Fix typo * Remove duplicate file * Require thinc 7.0.0.dev2 Fixes bug in gpu_ops that would use cupy instead of numpy on CPU * Add missing import * Fix error IDs * Fix tests
2018-11-29 15:30:29 +00:00
e-arena
e-billing
e-budgeting
e-cctv
e-class
e-commerce
e-counting
e-elektronik
e-entertainment
e-evolution
e-faktur
e-filing
e-fin
e-form
e-government
e-govt
e-hakcipta
e-id
e-info
e-katalog
e-ktp
e-leadership
e-lhkpn
e-library
e-loket
e-m1
e-money
e-news
e-nisn
e-npwp
e-paspor
e-paten
e-pay
e-perda
e-perizinan
e-planning
e-polisi
e-power
e-punten
e-retribusi
e-samsat
e-sport
e-store
e-tax
e-ticketing
e-tilang
e-toll
e-visa
e-voting
e-wallet
e-warong
2017-07-27 12:46:30 +00:00
ecek-ecek
2017-07-26 12:12:52 +00:00
eco-friendly
eco-park
2017-07-27 12:46:30 +00:00
edan-edanan
2017-07-26 12:12:52 +00:00
editor-editor
editor-in-chief
efek-efek
ekonomi-ekonomi
eksekutif-legislatif
ekspor-impor
elang-elang
elemen-elemen
emak-emak
2017-07-27 12:46:30 +00:00
embuh-embuhan
empat-empat
2017-07-26 12:12:52 +00:00
empek-empek
2017-07-27 12:46:30 +00:00
empet-empetan
empok-empok
empot-empotan
2017-07-26 12:12:52 +00:00
enak-enak
2017-07-27 12:46:30 +00:00
encal-encal
2017-07-26 12:12:52 +00:00
end-to-end
end-user
2017-07-27 12:46:30 +00:00
endap-endap
endut-endut
endut-endutan
engah-engah
2017-07-24 07:10:16 +00:00
engap-engap
2017-07-27 12:46:30 +00:00
enggan-enggan
engkah-engkah
2017-07-24 07:10:16 +00:00
engket-engket
2017-07-27 12:46:30 +00:00
entah-berentah
enten-enten
2017-07-26 12:12:52 +00:00
entry-level
equity-linked
2017-07-27 12:46:30 +00:00
erang-erot
erat-erat
2017-07-24 07:10:16 +00:00
erek-erek
2017-07-27 12:46:30 +00:00
ereng-ereng
erong-erong
2017-07-26 12:12:52 +00:00
esek-esek
ex-officio
exchange-traded
exercise-induced
extra-time
face-down
face-to-face
fair-play
fakta-fakta
faktor-faktor
fakultas-fakultas
fase-fase
fast-food
feed-in
fifty-fifty
file-file
first-leg
first-team
fitur-fitur
fitur-fiturnya
fixed-income
flip-flop
2017-07-24 07:10:16 +00:00
flip-plop
2017-07-26 12:12:52 +00:00
fly-in
follow-up
foto-foto
foya-foya
fraksi-fraksi
free-to-play
front-end
fungsi-fungsi
2017-07-24 07:10:16 +00:00
gaba-gaba
2017-07-27 12:46:30 +00:00
gabai-gabai
2017-07-24 07:10:16 +00:00
gada-gada
gading-gading
2017-07-26 12:12:52 +00:00
gadis-gadis
2017-07-24 07:10:16 +00:00
gado-gado
2017-07-27 12:46:30 +00:00
gail-gail
2017-07-26 12:12:52 +00:00
gajah-gajah
2017-07-27 12:46:30 +00:00
gajah-gajahan
2017-07-24 07:10:16 +00:00
gala-gala
2017-07-26 12:12:52 +00:00
galeri-galeri
gali-gali
2017-07-27 12:46:30 +00:00
gali-galian
2017-07-24 07:10:16 +00:00
galing-galing
galu-galu
2017-07-27 12:46:30 +00:00
gamak-gamak
2017-07-26 12:12:52 +00:00
gambar-gambar
2017-07-27 12:46:30 +00:00
gambar-menggambar
gamit-gamitan
gampang-gampangan
2017-07-24 07:10:16 +00:00
gana-gini
2017-07-27 12:46:30 +00:00
ganal-ganal
ganda-berganda
ganjal-mengganjal
2017-07-26 12:12:52 +00:00
ganjil-genap
ganteng-ganteng
gantung-gantung
2017-07-24 07:10:16 +00:00
gapah-gopoh
gara-gara
2017-07-27 12:46:30 +00:00
garah-garah
2017-07-26 12:12:52 +00:00
garis-garis
2017-07-27 12:46:30 +00:00
gasak-gasakan
2017-07-26 12:12:52 +00:00
gatal-gatal
gaun-gaun
2017-07-27 12:46:30 +00:00
gawar-gawar
gaya-gayanya
2017-07-24 07:10:16 +00:00
gayang-gayang
2017-07-26 12:12:52 +00:00
ge-er
2017-07-24 07:10:16 +00:00
gebyah-uyah
2017-07-27 12:46:30 +00:00
gebyar-gebyar
2017-07-24 07:10:16 +00:00
gedana-gedini
gedebak-gedebuk
gedebar-gedebur
2017-07-26 12:12:52 +00:00
gedung-gedung
gelang-gelang
2017-07-27 12:46:30 +00:00
gelap-gelapan
2017-07-26 12:12:52 +00:00
gelar-gelar
gelas-gelas
2017-07-27 12:46:30 +00:00
gelembung-gelembungan
2017-07-26 12:12:52 +00:00
geleng-geleng
2017-07-24 07:10:16 +00:00
geli-geli
2017-07-27 12:46:30 +00:00
geliang-geliut
geliat-geliut
2017-07-24 07:10:16 +00:00
gembar-gembor
gembrang-gembreng
gempul-gempul
2017-07-27 12:46:30 +00:00
gempur-menggempur
gendang-gendang
gengsi-gengsian
2017-07-24 07:10:16 +00:00
genjang-genjot
2017-07-27 12:46:30 +00:00
genjot-genjotan
genjrang-genjreng
2017-07-26 12:12:52 +00:00
genome-wide
geo-politik
2017-07-27 12:46:30 +00:00
gerabak-gerubuk
2017-07-26 12:12:52 +00:00
gerak-gerik
gerak-geriknya
gerakan-gerakan
2017-07-24 07:10:16 +00:00
gerbas-gerbus
2017-07-26 12:12:52 +00:00
gereja-gereja
2017-07-24 07:10:16 +00:00
gereng-gereng
2017-07-27 12:46:30 +00:00
geriak-geriuk
gerit-gerit
2017-07-24 07:10:16 +00:00
gerot-gerot
2017-07-27 12:46:30 +00:00
geruh-gerah
2017-07-24 07:10:16 +00:00
getak-getuk
getem-getem
geti-geti
2017-07-27 12:46:30 +00:00
gial-gial
gial-giul
gila-gila
2017-07-26 12:12:52 +00:00
gila-gilaan
2017-07-27 12:46:30 +00:00
gilang-gemilang
gilap-gemilap
2017-07-24 07:10:16 +00:00
gili-gili
2017-07-27 12:46:30 +00:00
giling-giling
gilir-bergilir
ginang-ginang
2017-07-24 07:10:16 +00:00
girap-girap
2017-07-27 12:46:30 +00:00
girik-girik
2017-07-24 07:10:16 +00:00
giring-giring
💫 Port master changes over to develop (#2979) * Create aryaprabhudesai.md (#2681) * Update _install.jade (#2688) Typo fix: "models" -> "model" * Add FAC to spacy.explain (resolves #2706) * Remove docstrings for deprecated arguments (see #2703) * When calling getoption() in conftest.py, pass a default option (#2709) * When calling getoption() in conftest.py, pass a default option This is necessary to allow testing an installed spacy by running: pytest --pyargs spacy * Add contributor agreement * update bengali token rules for hyphen and digits (#2731) * Less norm computations in token similarity (#2730) * Less norm computations in token similarity * Contributor agreement * Remove ')' for clarity (#2737) Sorry, don't mean to be nitpicky, I just noticed this when going through the CLI and thought it was a quick fix. That said, if this was intention than please let me know. * added contributor agreement for mbkupfer (#2738) * Basic support for Telugu language (#2751) * Lex _attrs for polish language (#2750) * Signed spaCy contributor agreement * Added polish version of english lex_attrs * Introduces a bulk merge function, in order to solve issue #653 (#2696) * Fix comment * Introduce bulk merge to increase performance on many span merges * Sign contributor agreement * Implement pull request suggestions * Describe converters more explicitly (see #2643) * Add multi-threading note to Language.pipe (resolves #2582) [ci skip] * Fix formatting * Fix dependency scheme docs (closes #2705) [ci skip] * Don't set stop word in example (closes #2657) [ci skip] * Add words to portuguese language _num_words (#2759) * Add words to portuguese language _num_words * Add words to portuguese language _num_words * Update Indonesian model (#2752) * adding e-KTP in tokenizer exceptions list * add exception token * removing lines with containing space as it won't matter since we use .split() method in the end, added new tokens in exception * add tokenizer exceptions list * combining base_norms with norm_exceptions * adding norm_exception * fix double key in lemmatizer * remove unused import on punctuation.py * reformat stop_words to reduce number of lines, improve readibility * updating tokenizer exception * implement is_currency for lang/id * adding orth_first_upper in tokenizer_exceptions * update the norm_exception list * remove bunch of abbreviations * adding contributors file * Fixed spaCy+Keras example (#2763) * bug fixes in keras example * created contributor agreement * Adding French hyphenated first name (#2786) * Fix typo (closes #2784) * Fix typo (#2795) [ci skip] Fixed typo on line 6 "regcognizer --> recognizer" * Adding basic support for Sinhala language. (#2788) * adding Sinhala language package, stop words, examples and lex_attrs. * Adding contributor agreement * Updating contributor agreement * Also include lowercase norm exceptions * Fix error (#2802) * Fix error ValueError: cannot resize an array that references or is referenced by another array in this way. Use the resize function * added spaCy Contributor Agreement * Add charlax's contributor agreement (#2805) * agreement of contributor, may I introduce a tiny pl languge contribution (#2799) * Contributors agreement * Contributors agreement * Contributors agreement * Add jupyter=True to displacy.render in documentation (#2806) * Revert "Also include lowercase norm exceptions" This reverts commit 70f4e8adf37cfcfab60be2b97d6deae949b30e9e. * Remove deprecated encoding argument to msgpack * Set up dependency tree pattern matching skeleton (#2732) * Fix bug when too many entity types. Fixes #2800 * Fix Python 2 test failure * Require older msgpack-numpy * Restore encoding arg on msgpack-numpy * Try to fix version pin for msgpack-numpy * Update Portuguese Language (#2790) * Add words to portuguese language _num_words * Add words to portuguese language _num_words * Portuguese - Add/remove stopwords, fix tokenizer, add currency symbols * Extended punctuation and norm_exceptions in the Portuguese language * Correct error in spacy universe docs concerning spacy-lookup (#2814) * Update Keras Example for (Parikh et al, 2016) implementation (#2803) * bug fixes in keras example * created contributor agreement * baseline for Parikh model * initial version of parikh 2016 implemented * tested asymmetric models * fixed grevious error in normalization * use standard SNLI test file * begin to rework parikh example * initial version of running example * start to document the new version * start to document the new version * Update Decompositional Attention.ipynb * fixed calls to similarity * updated the README * import sys package duh * simplified indexing on mapping word to IDs * stupid python indent error * added code from https://github.com/tensorflow/tensorflow/issues/3388 for tf bug workaround * Fix typo (closes #2815) [ci skip] * Update regex version dependency * Set version to 2.0.13.dev3 * Skip seemingly problematic test * Remove problematic test * Try previous version of regex * Revert "Remove problematic test" This reverts commit bdebbef45552d698d390aa430b527ee27830f11b. * Unskip test * Try older version of regex * 💫 Update training examples and use minibatching (#2830) <!--- Provide a general summary of your changes in the title. --> ## Description Update the training examples in `/examples/training` to show usage of spaCy's `minibatch` and `compounding` helpers ([see here](https://spacy.io/usage/training#tips-batch-size) for details). The lack of batching in the examples has caused some confusion in the past, especially for beginners who would copy-paste the examples, update them with large training sets and experienced slow and unsatisfying results. ### Types of change enhancements ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. * Visual C++ link updated (#2842) (closes #2841) [ci skip] * New landing page * Add contribution agreement * Correcting lang/ru/examples.py (#2845) * Correct some grammatical inaccuracies in lang\ru\examples.py; filled Contributor Agreement * Correct some grammatical inaccuracies in lang\ru\examples.py * Move contributor agreement to separate file * Set version to 2.0.13.dev4 * Add Persian(Farsi) language support (#2797) * Also include lowercase norm exceptions * Remove in favour of https://github.com/explosion/spaCy/graphs/contributors * Rule-based French Lemmatizer (#2818) <!--- Provide a general summary of your changes in the title. --> ## Description <!--- Use this section to describe your changes. If your changes required testing, include information about the testing environment and the tests you ran. If your test fixes a bug reported in an issue, don't forget to include the issue number. If your PR is still a work in progress, that's totally fine – just include a note to let us know. --> Add a rule-based French Lemmatizer following the english one and the excellent PR for [greek language optimizations](https://github.com/explosion/spaCy/pull/2558) to adapt the Lemmatizer class. ### Types of change <!-- What type of change does your PR cover? Is it a bug fix, an enhancement or new feature, or a change to the documentation? --> - Lemma dictionary used can be found [here](http://infolingu.univ-mlv.fr/DonneesLinguistiques/Dictionnaires/telechargement.html), I used the XML version. - Add several files containing exhaustive list of words for each part of speech - Add some lemma rules - Add POS that are not checked in the standard Lemmatizer, i.e PRON, DET, ADV and AUX - Modify the Lemmatizer class to check in lookup table as a last resort if POS not mentionned - Modify the lemmatize function to check in lookup table as a last resort - Init files are updated so the model can support all the functionalities mentioned above - Add words to tokenizer_exceptions_list.py in respect to regex used in tokenizer_exceptions.py ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [X] I have submitted the spaCy Contributor Agreement. - [X] I ran the tests, and all new and existing tests passed. - [X] My changes don't require a change to the documentation, or if they do, I've added all required information. * Set version to 2.0.13 * Fix formatting and consistency * Update docs for new version [ci skip] * Increment version [ci skip] * Add info on wheels [ci skip] * Adding "This is a sentence" example to Sinhala (#2846) * Add wheels badge * Update badge [ci skip] * Update README.rst [ci skip] * Update murmurhash pin * Increment version to 2.0.14.dev0 * Update GPU docs for v2.0.14 * Add wheel to setup_requires * Import prefer_gpu and require_gpu functions from Thinc * Add tests for prefer_gpu() and require_gpu() * Update requirements and setup.py * Workaround bug in thinc require_gpu * Set version to v2.0.14 * Update push-tag script * Unhack prefer_gpu * Require thinc 6.10.6 * Update prefer_gpu and require_gpu docs [ci skip] * Fix specifiers for GPU * Set version to 2.0.14.dev1 * Set version to 2.0.14 * Update Thinc version pin * Increment version * Fix msgpack-numpy version pin * Increment version * Update version to 2.0.16 * Update version [ci skip] * Redundant ')' in the Stop words' example (#2856) <!--- Provide a general summary of your changes in the title. --> ## Description <!--- Use this section to describe your changes. If your changes required testing, include information about the testing environment and the tests you ran. If your test fixes a bug reported in an issue, don't forget to include the issue number. If your PR is still a work in progress, that's totally fine – just include a note to let us know. --> ### Types of change <!-- What type of change does your PR cover? Is it a bug fix, an enhancement or new feature, or a change to the documentation? --> ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [ ] I have submitted the spaCy Contributor Agreement. - [ ] I ran the tests, and all new and existing tests passed. - [ ] My changes don't require a change to the documentation, or if they do, I've added all required information. * Documentation improvement regarding joblib and SO (#2867) Some documentation improvements ## Description 1. Fixed the dead URL to joblib 2. Fixed Stack Overflow brand name (with space) ### Types of change Documentation ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. * raise error when setting overlapping entities as doc.ents (#2880) * Fix out-of-bounds access in NER training The helper method state.B(1) gets the index of the first token of the buffer, or -1 if no such token exists. Normally this is safe because we pass this to functions like state.safe_get(), which returns an empty token. Here we used it directly as an array index, which is not okay! This error may have been the cause of out-of-bounds access errors during training. Similar errors may still be around, so much be hunted down. Hunting this one down took a long time...I printed out values across training runs and diffed, looking for points of divergence between runs, when no randomness should be allowed. * Change PyThaiNLP Url (#2876) * Fix missing comma * Add example showing a fix-up rule for space entities * Set version to 2.0.17.dev0 * Update regex version * Revert "Update regex version" This reverts commit 62358dd867d15bc6a475942dff34effba69dd70a. * Try setting older regex version, to align with conda * Set version to 2.0.17 * Add spacy-js to universe [ci-skip] * Add spacy-raspberry to universe (closes #2889) * Add script to validate universe json [ci skip] * Removed space in docs + added contributor indo (#2909) * - removed unneeded space in documentation * - added contributor info * Allow input text of length up to max_length, inclusive (#2922) * Include universe spec for spacy-wordnet component (#2919) * feat: include universe spec for spacy-wordnet component * chore: include spaCy contributor agreement * Minor formatting changes [ci skip] * Fix image [ci skip] Twitter URL doesn't work on live site * Check if the word is in one of the regular lists specific to each POS (#2886) * 💫 Create random IDs for SVGs to prevent ID clashes (#2927) Resolves #2924. ## Description Fixes problem where multiple visualizations in Jupyter notebooks would have clashing arc IDs, resulting in weirdly positioned arc labels. Generating a random ID prefix so even identical parses won't receive the same IDs for consistency (even if effect of ID clash isn't noticable here.) ### Types of change bug fix ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. * Fix typo [ci skip] * fixes symbolic link on py3 and windows (#2949) * fixes symbolic link on py3 and windows during setup of spacy using command python -m spacy link en_core_web_sm en closes #2948 * Update spacy/compat.py Co-Authored-By: cicorias <cicorias@users.noreply.github.com> * Fix formatting * Update universe [ci skip] * Catalan Language Support (#2940) * Catalan language Support * Ddding Catalan to documentation * Sort languages alphabetically [ci skip] * Update tests for pytest 4.x (#2965) <!--- Provide a general summary of your changes in the title. --> ## Description - [x] Replace marks in params for pytest 4.0 compat ([see here](https://docs.pytest.org/en/latest/deprecations.html#marks-in-pytest-mark-parametrize)) - [x] Un-xfail passing tests (some fixes in a recent update resolved a bunch of issues, but tests were apparently never updated here) ### Types of change <!-- What type of change does your PR cover? Is it a bug fix, an enhancement or new feature, or a change to the documentation? --> ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. * Fix regex pin to harmonize with conda (#2964) * Update README.rst * Fix bug where Vocab.prune_vector did not use 'batch_size' (#2977) Fixes #2976 * Fix typo * Fix typo * Remove duplicate file * Require thinc 7.0.0.dev2 Fixes bug in gpu_ops that would use cupy instead of numpy on CPU * Add missing import * Fix error IDs * Fix tests
2018-11-29 15:30:29 +00:00
go-auto
go-bills
go-bluebird
go-box
go-car
go-clean
go-food
go-glam
go-jek
2017-07-26 12:12:52 +00:00
go-kart
💫 Port master changes over to develop (#2979) * Create aryaprabhudesai.md (#2681) * Update _install.jade (#2688) Typo fix: "models" -> "model" * Add FAC to spacy.explain (resolves #2706) * Remove docstrings for deprecated arguments (see #2703) * When calling getoption() in conftest.py, pass a default option (#2709) * When calling getoption() in conftest.py, pass a default option This is necessary to allow testing an installed spacy by running: pytest --pyargs spacy * Add contributor agreement * update bengali token rules for hyphen and digits (#2731) * Less norm computations in token similarity (#2730) * Less norm computations in token similarity * Contributor agreement * Remove ')' for clarity (#2737) Sorry, don't mean to be nitpicky, I just noticed this when going through the CLI and thought it was a quick fix. That said, if this was intention than please let me know. * added contributor agreement for mbkupfer (#2738) * Basic support for Telugu language (#2751) * Lex _attrs for polish language (#2750) * Signed spaCy contributor agreement * Added polish version of english lex_attrs * Introduces a bulk merge function, in order to solve issue #653 (#2696) * Fix comment * Introduce bulk merge to increase performance on many span merges * Sign contributor agreement * Implement pull request suggestions * Describe converters more explicitly (see #2643) * Add multi-threading note to Language.pipe (resolves #2582) [ci skip] * Fix formatting * Fix dependency scheme docs (closes #2705) [ci skip] * Don't set stop word in example (closes #2657) [ci skip] * Add words to portuguese language _num_words (#2759) * Add words to portuguese language _num_words * Add words to portuguese language _num_words * Update Indonesian model (#2752) * adding e-KTP in tokenizer exceptions list * add exception token * removing lines with containing space as it won't matter since we use .split() method in the end, added new tokens in exception * add tokenizer exceptions list * combining base_norms with norm_exceptions * adding norm_exception * fix double key in lemmatizer * remove unused import on punctuation.py * reformat stop_words to reduce number of lines, improve readibility * updating tokenizer exception * implement is_currency for lang/id * adding orth_first_upper in tokenizer_exceptions * update the norm_exception list * remove bunch of abbreviations * adding contributors file * Fixed spaCy+Keras example (#2763) * bug fixes in keras example * created contributor agreement * Adding French hyphenated first name (#2786) * Fix typo (closes #2784) * Fix typo (#2795) [ci skip] Fixed typo on line 6 "regcognizer --> recognizer" * Adding basic support for Sinhala language. (#2788) * adding Sinhala language package, stop words, examples and lex_attrs. * Adding contributor agreement * Updating contributor agreement * Also include lowercase norm exceptions * Fix error (#2802) * Fix error ValueError: cannot resize an array that references or is referenced by another array in this way. Use the resize function * added spaCy Contributor Agreement * Add charlax's contributor agreement (#2805) * agreement of contributor, may I introduce a tiny pl languge contribution (#2799) * Contributors agreement * Contributors agreement * Contributors agreement * Add jupyter=True to displacy.render in documentation (#2806) * Revert "Also include lowercase norm exceptions" This reverts commit 70f4e8adf37cfcfab60be2b97d6deae949b30e9e. * Remove deprecated encoding argument to msgpack * Set up dependency tree pattern matching skeleton (#2732) * Fix bug when too many entity types. Fixes #2800 * Fix Python 2 test failure * Require older msgpack-numpy * Restore encoding arg on msgpack-numpy * Try to fix version pin for msgpack-numpy * Update Portuguese Language (#2790) * Add words to portuguese language _num_words * Add words to portuguese language _num_words * Portuguese - Add/remove stopwords, fix tokenizer, add currency symbols * Extended punctuation and norm_exceptions in the Portuguese language * Correct error in spacy universe docs concerning spacy-lookup (#2814) * Update Keras Example for (Parikh et al, 2016) implementation (#2803) * bug fixes in keras example * created contributor agreement * baseline for Parikh model * initial version of parikh 2016 implemented * tested asymmetric models * fixed grevious error in normalization * use standard SNLI test file * begin to rework parikh example * initial version of running example * start to document the new version * start to document the new version * Update Decompositional Attention.ipynb * fixed calls to similarity * updated the README * import sys package duh * simplified indexing on mapping word to IDs * stupid python indent error * added code from https://github.com/tensorflow/tensorflow/issues/3388 for tf bug workaround * Fix typo (closes #2815) [ci skip] * Update regex version dependency * Set version to 2.0.13.dev3 * Skip seemingly problematic test * Remove problematic test * Try previous version of regex * Revert "Remove problematic test" This reverts commit bdebbef45552d698d390aa430b527ee27830f11b. * Unskip test * Try older version of regex * 💫 Update training examples and use minibatching (#2830) <!--- Provide a general summary of your changes in the title. --> ## Description Update the training examples in `/examples/training` to show usage of spaCy's `minibatch` and `compounding` helpers ([see here](https://spacy.io/usage/training#tips-batch-size) for details). The lack of batching in the examples has caused some confusion in the past, especially for beginners who would copy-paste the examples, update them with large training sets and experienced slow and unsatisfying results. ### Types of change enhancements ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. * Visual C++ link updated (#2842) (closes #2841) [ci skip] * New landing page * Add contribution agreement * Correcting lang/ru/examples.py (#2845) * Correct some grammatical inaccuracies in lang\ru\examples.py; filled Contributor Agreement * Correct some grammatical inaccuracies in lang\ru\examples.py * Move contributor agreement to separate file * Set version to 2.0.13.dev4 * Add Persian(Farsi) language support (#2797) * Also include lowercase norm exceptions * Remove in favour of https://github.com/explosion/spaCy/graphs/contributors * Rule-based French Lemmatizer (#2818) <!--- Provide a general summary of your changes in the title. --> ## Description <!--- Use this section to describe your changes. If your changes required testing, include information about the testing environment and the tests you ran. If your test fixes a bug reported in an issue, don't forget to include the issue number. If your PR is still a work in progress, that's totally fine – just include a note to let us know. --> Add a rule-based French Lemmatizer following the english one and the excellent PR for [greek language optimizations](https://github.com/explosion/spaCy/pull/2558) to adapt the Lemmatizer class. ### Types of change <!-- What type of change does your PR cover? Is it a bug fix, an enhancement or new feature, or a change to the documentation? --> - Lemma dictionary used can be found [here](http://infolingu.univ-mlv.fr/DonneesLinguistiques/Dictionnaires/telechargement.html), I used the XML version. - Add several files containing exhaustive list of words for each part of speech - Add some lemma rules - Add POS that are not checked in the standard Lemmatizer, i.e PRON, DET, ADV and AUX - Modify the Lemmatizer class to check in lookup table as a last resort if POS not mentionned - Modify the lemmatize function to check in lookup table as a last resort - Init files are updated so the model can support all the functionalities mentioned above - Add words to tokenizer_exceptions_list.py in respect to regex used in tokenizer_exceptions.py ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [X] I have submitted the spaCy Contributor Agreement. - [X] I ran the tests, and all new and existing tests passed. - [X] My changes don't require a change to the documentation, or if they do, I've added all required information. * Set version to 2.0.13 * Fix formatting and consistency * Update docs for new version [ci skip] * Increment version [ci skip] * Add info on wheels [ci skip] * Adding "This is a sentence" example to Sinhala (#2846) * Add wheels badge * Update badge [ci skip] * Update README.rst [ci skip] * Update murmurhash pin * Increment version to 2.0.14.dev0 * Update GPU docs for v2.0.14 * Add wheel to setup_requires * Import prefer_gpu and require_gpu functions from Thinc * Add tests for prefer_gpu() and require_gpu() * Update requirements and setup.py * Workaround bug in thinc require_gpu * Set version to v2.0.14 * Update push-tag script * Unhack prefer_gpu * Require thinc 6.10.6 * Update prefer_gpu and require_gpu docs [ci skip] * Fix specifiers for GPU * Set version to 2.0.14.dev1 * Set version to 2.0.14 * Update Thinc version pin * Increment version * Fix msgpack-numpy version pin * Increment version * Update version to 2.0.16 * Update version [ci skip] * Redundant ')' in the Stop words' example (#2856) <!--- Provide a general summary of your changes in the title. --> ## Description <!--- Use this section to describe your changes. If your changes required testing, include information about the testing environment and the tests you ran. If your test fixes a bug reported in an issue, don't forget to include the issue number. If your PR is still a work in progress, that's totally fine – just include a note to let us know. --> ### Types of change <!-- What type of change does your PR cover? Is it a bug fix, an enhancement or new feature, or a change to the documentation? --> ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [ ] I have submitted the spaCy Contributor Agreement. - [ ] I ran the tests, and all new and existing tests passed. - [ ] My changes don't require a change to the documentation, or if they do, I've added all required information. * Documentation improvement regarding joblib and SO (#2867) Some documentation improvements ## Description 1. Fixed the dead URL to joblib 2. Fixed Stack Overflow brand name (with space) ### Types of change Documentation ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. * raise error when setting overlapping entities as doc.ents (#2880) * Fix out-of-bounds access in NER training The helper method state.B(1) gets the index of the first token of the buffer, or -1 if no such token exists. Normally this is safe because we pass this to functions like state.safe_get(), which returns an empty token. Here we used it directly as an array index, which is not okay! This error may have been the cause of out-of-bounds access errors during training. Similar errors may still be around, so much be hunted down. Hunting this one down took a long time...I printed out values across training runs and diffed, looking for points of divergence between runs, when no randomness should be allowed. * Change PyThaiNLP Url (#2876) * Fix missing comma * Add example showing a fix-up rule for space entities * Set version to 2.0.17.dev0 * Update regex version * Revert "Update regex version" This reverts commit 62358dd867d15bc6a475942dff34effba69dd70a. * Try setting older regex version, to align with conda * Set version to 2.0.17 * Add spacy-js to universe [ci-skip] * Add spacy-raspberry to universe (closes #2889) * Add script to validate universe json [ci skip] * Removed space in docs + added contributor indo (#2909) * - removed unneeded space in documentation * - added contributor info * Allow input text of length up to max_length, inclusive (#2922) * Include universe spec for spacy-wordnet component (#2919) * feat: include universe spec for spacy-wordnet component * chore: include spaCy contributor agreement * Minor formatting changes [ci skip] * Fix image [ci skip] Twitter URL doesn't work on live site * Check if the word is in one of the regular lists specific to each POS (#2886) * 💫 Create random IDs for SVGs to prevent ID clashes (#2927) Resolves #2924. ## Description Fixes problem where multiple visualizations in Jupyter notebooks would have clashing arc IDs, resulting in weirdly positioned arc labels. Generating a random ID prefix so even identical parses won't receive the same IDs for consistency (even if effect of ID clash isn't noticable here.) ### Types of change bug fix ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. * Fix typo [ci skip] * fixes symbolic link on py3 and windows (#2949) * fixes symbolic link on py3 and windows during setup of spacy using command python -m spacy link en_core_web_sm en closes #2948 * Update spacy/compat.py Co-Authored-By: cicorias <cicorias@users.noreply.github.com> * Fix formatting * Update universe [ci skip] * Catalan Language Support (#2940) * Catalan language Support * Ddding Catalan to documentation * Sort languages alphabetically [ci skip] * Update tests for pytest 4.x (#2965) <!--- Provide a general summary of your changes in the title. --> ## Description - [x] Replace marks in params for pytest 4.0 compat ([see here](https://docs.pytest.org/en/latest/deprecations.html#marks-in-pytest-mark-parametrize)) - [x] Un-xfail passing tests (some fixes in a recent update resolved a bunch of issues, but tests were apparently never updated here) ### Types of change <!-- What type of change does your PR cover? Is it a bug fix, an enhancement or new feature, or a change to the documentation? --> ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. * Fix regex pin to harmonize with conda (#2964) * Update README.rst * Fix bug where Vocab.prune_vector did not use 'batch_size' (#2977) Fixes #2976 * Fix typo * Fix typo * Remove duplicate file * Require thinc 7.0.0.dev2 Fixes bug in gpu_ops that would use cupy instead of numpy on CPU * Add missing import * Fix error IDs * Fix tests
2018-11-29 15:30:29 +00:00
go-mart
go-massage
go-med
go-points
go-pulsa
go-ride
go-send
go-shop
go-tix
2017-07-26 12:12:52 +00:00
go-to-market
2017-07-27 12:46:30 +00:00
goak-goak
2017-07-26 12:12:52 +00:00
goal-line
gol-gol
2017-07-24 07:10:16 +00:00
golak-galik
gondas-gandes
2017-07-26 12:12:52 +00:00
gonjang-ganjing
2017-07-27 12:46:30 +00:00
gonjlang-ganjling
2017-07-24 07:10:16 +00:00
gonta-ganti
2017-07-27 12:46:30 +00:00
gontok-gontokan
gorap-gorap
2017-07-24 07:10:16 +00:00
gorong-gorong
2017-07-26 12:12:52 +00:00
gotong-royong
2017-07-24 07:10:16 +00:00
gresek-gresek
2017-07-26 12:12:52 +00:00
gua-gua
2017-07-27 12:46:30 +00:00
gual-gail
2017-07-26 12:12:52 +00:00
gubernur-gubernur
2017-07-24 07:10:16 +00:00
gudu-gudu
2017-07-26 12:12:52 +00:00
gula-gula
2017-07-24 07:10:16 +00:00
gulang-gulang
2017-07-27 12:46:30 +00:00
gulung-menggulung
guna-ganah
guna-guna
gundala-gundala
guntang-guntang
gunung-ganang
gunung-gemunung
gunung-gunungan
2017-07-26 12:12:52 +00:00
guru-guru
2017-07-27 12:46:30 +00:00
habis-habis
2017-07-26 12:12:52 +00:00
habis-habisan
hak-hak
hak-hal
hakim-hakim
hal-hal
2017-07-24 07:10:16 +00:00
halai-balai
2017-07-26 12:12:52 +00:00
half-time
hama-hama
2017-07-27 12:46:30 +00:00
hampir-hampir
hancur-hancuran
hancur-menghancurkan
2017-07-26 12:12:52 +00:00
hands-free
hands-on
hang-out
hantu-hantu
happy-happy
harap-harap
2017-07-27 12:46:30 +00:00
harap-harapan
2017-07-26 12:12:52 +00:00
hard-disk
harga-harga
hari-hari
harimau-harimau
2017-07-27 12:46:30 +00:00
harum-haruman
2017-07-26 12:12:52 +00:00
hasil-hasil
2017-07-24 07:10:16 +00:00
hasta-wara
2017-07-26 12:12:52 +00:00
hat-trick
hati-hati
hati-hatilah
head-mounted
head-to-head
head-up
heads-up
heavy-duty
2017-07-27 12:46:30 +00:00
hebat-hebatan
2017-07-26 12:12:52 +00:00
hewan-hewan
hexa-core
hidup-hidup
hidup-mati
2017-07-24 07:10:16 +00:00
hila-hila
2017-07-27 12:46:30 +00:00
hilang-hilang
hina-menghinakan
2017-07-26 12:12:52 +00:00
hip-hop
2017-07-24 07:10:16 +00:00
hiru-biru
hiru-hara
2017-07-26 12:12:52 +00:00
hiruk-pikuk
hitam-putih
hitung-hitung
hitung-hitungan
2017-07-27 12:46:30 +00:00
hormat-menghormati
2017-07-26 12:12:52 +00:00
hot-swappable
hotel-hotel
how-to
2017-07-24 07:10:16 +00:00
hubar-habir
hubaya-hubaya
2017-07-26 12:12:52 +00:00
hukum-red
hukuman-hukuman
hula-hoop
2017-07-24 07:10:16 +00:00
hula-hula
2017-07-26 12:12:52 +00:00
hulu-hilir
humas-humas
2017-07-24 07:10:16 +00:00
hura-hura
huru-hara
ibar-ibar
2017-07-26 12:12:52 +00:00
ibu-anak
ibu-ibu
2017-07-24 07:10:16 +00:00
icak-icak
2017-07-26 12:12:52 +00:00
icip-icip
2017-07-27 12:46:30 +00:00
idam-idam
2017-07-26 12:12:52 +00:00
ide-ide
2017-07-27 12:46:30 +00:00
igau-igauan
2017-07-24 07:10:16 +00:00
ikan-ikan
2017-07-27 12:46:30 +00:00
ikut-ikut
ikut-ikutan
2017-07-24 07:10:16 +00:00
ilam-ilam
ilat-ilatan
2017-07-26 12:12:52 +00:00
ilmu-ilmu
2017-07-27 12:46:30 +00:00
imbang-imbangan
2017-07-24 07:10:16 +00:00
iming-iming
imut-imut
2017-07-27 12:46:30 +00:00
inang-inang
inca-binca
2017-07-24 07:10:16 +00:00
incang-incut
2017-07-26 12:12:52 +00:00
industri-industri
2017-07-27 12:46:30 +00:00
ingar-bingar
ingar-ingar
2017-07-24 07:10:16 +00:00
ingat-ingat
2017-07-27 12:46:30 +00:00
ingat-ingatan
ingau-ingauan
2017-07-24 07:10:16 +00:00
inggang-inggung
2017-07-27 12:46:30 +00:00
injak-injak
2017-07-26 12:12:52 +00:00
input-output
instansi-instansi
instant-on
instrumen-instrumen
inter-governmental
2017-07-24 07:10:16 +00:00
ira-ira
irah-irahan
2017-07-27 12:46:30 +00:00
iras-iras
2017-07-26 12:12:52 +00:00
iring-iringan
2017-07-27 12:46:30 +00:00
iris-irisan
2017-07-24 07:10:16 +00:00
isak-isak
2017-07-26 12:12:52 +00:00
isat-bb
2017-07-27 12:46:30 +00:00
iseng-iseng
2017-07-26 12:12:52 +00:00
istana-istana
istri-istri
isu-isu
iya-iya
jabatan-jabatan
jadi-jadian
jagoan-jagoan
2017-07-27 12:46:30 +00:00
jaja-jajaan
2017-07-26 12:12:52 +00:00
jaksa-jaksa
2017-07-27 12:46:30 +00:00
jala-jala
2017-07-26 12:12:52 +00:00
jalan-jalan
2017-07-24 07:10:16 +00:00
jali-jali
2017-07-27 12:46:30 +00:00
jalin-berjalin
jalin-menjalin
2017-07-26 12:12:52 +00:00
jam-jam
2017-07-27 12:46:30 +00:00
jamah-jamahan
jambak-jambakan
jambu-jambu
2017-07-26 12:12:52 +00:00
jampi-jampi
janda-janda
jangan-jangan
janji-janji
2017-07-27 12:46:30 +00:00
jarang-jarang
2017-07-26 12:12:52 +00:00
jari-jari
jaring-jaring
2017-07-24 07:10:16 +00:00
jarum-jarum
2017-07-26 12:12:52 +00:00
jasa-jasa
jatuh-bangun
jauh-dekat
jauh-jauh
2017-07-27 12:46:30 +00:00
jawi-jawi
jebar-jebur
jebat-jebatan
2017-07-24 07:10:16 +00:00
jegal-jegalan
2017-07-26 12:12:52 +00:00
jejak-jejak
2017-07-27 12:46:30 +00:00
jelang-menjelang
2017-07-26 12:12:52 +00:00
jelas-jelas
2017-07-24 07:10:16 +00:00
jelur-jelir
2017-07-26 12:12:52 +00:00
jembatan-jembatan
jenazah-jenazah
2017-07-27 12:46:30 +00:00
jendal-jendul
2017-07-26 12:12:52 +00:00
jenderal-jenderal
2017-07-27 12:46:30 +00:00
jenggar-jenggur
2017-07-26 12:12:52 +00:00
jenis-jenis
jenis-jenisnya
2017-07-24 07:10:16 +00:00
jentik-jentik
2017-07-27 12:46:30 +00:00
jerah-jerih
jinak-jinak
2017-07-26 12:12:52 +00:00
jiwa-jiwa
2017-07-27 12:46:30 +00:00
joli-joli
jolong-jolong
jongkang-jangking
2017-07-24 07:10:16 +00:00
jongkar-jangkir
jongkat-jangkit
2017-07-26 12:12:52 +00:00
jor-joran
2017-07-27 12:46:30 +00:00
jotos-jotosan
juak-juak
2017-07-26 12:12:52 +00:00
jual-beli
💫 Port master changes over to develop (#2979) * Create aryaprabhudesai.md (#2681) * Update _install.jade (#2688) Typo fix: "models" -> "model" * Add FAC to spacy.explain (resolves #2706) * Remove docstrings for deprecated arguments (see #2703) * When calling getoption() in conftest.py, pass a default option (#2709) * When calling getoption() in conftest.py, pass a default option This is necessary to allow testing an installed spacy by running: pytest --pyargs spacy * Add contributor agreement * update bengali token rules for hyphen and digits (#2731) * Less norm computations in token similarity (#2730) * Less norm computations in token similarity * Contributor agreement * Remove ')' for clarity (#2737) Sorry, don't mean to be nitpicky, I just noticed this when going through the CLI and thought it was a quick fix. That said, if this was intention than please let me know. * added contributor agreement for mbkupfer (#2738) * Basic support for Telugu language (#2751) * Lex _attrs for polish language (#2750) * Signed spaCy contributor agreement * Added polish version of english lex_attrs * Introduces a bulk merge function, in order to solve issue #653 (#2696) * Fix comment * Introduce bulk merge to increase performance on many span merges * Sign contributor agreement * Implement pull request suggestions * Describe converters more explicitly (see #2643) * Add multi-threading note to Language.pipe (resolves #2582) [ci skip] * Fix formatting * Fix dependency scheme docs (closes #2705) [ci skip] * Don't set stop word in example (closes #2657) [ci skip] * Add words to portuguese language _num_words (#2759) * Add words to portuguese language _num_words * Add words to portuguese language _num_words * Update Indonesian model (#2752) * adding e-KTP in tokenizer exceptions list * add exception token * removing lines with containing space as it won't matter since we use .split() method in the end, added new tokens in exception * add tokenizer exceptions list * combining base_norms with norm_exceptions * adding norm_exception * fix double key in lemmatizer * remove unused import on punctuation.py * reformat stop_words to reduce number of lines, improve readibility * updating tokenizer exception * implement is_currency for lang/id * adding orth_first_upper in tokenizer_exceptions * update the norm_exception list * remove bunch of abbreviations * adding contributors file * Fixed spaCy+Keras example (#2763) * bug fixes in keras example * created contributor agreement * Adding French hyphenated first name (#2786) * Fix typo (closes #2784) * Fix typo (#2795) [ci skip] Fixed typo on line 6 "regcognizer --> recognizer" * Adding basic support for Sinhala language. (#2788) * adding Sinhala language package, stop words, examples and lex_attrs. * Adding contributor agreement * Updating contributor agreement * Also include lowercase norm exceptions * Fix error (#2802) * Fix error ValueError: cannot resize an array that references or is referenced by another array in this way. Use the resize function * added spaCy Contributor Agreement * Add charlax's contributor agreement (#2805) * agreement of contributor, may I introduce a tiny pl languge contribution (#2799) * Contributors agreement * Contributors agreement * Contributors agreement * Add jupyter=True to displacy.render in documentation (#2806) * Revert "Also include lowercase norm exceptions" This reverts commit 70f4e8adf37cfcfab60be2b97d6deae949b30e9e. * Remove deprecated encoding argument to msgpack * Set up dependency tree pattern matching skeleton (#2732) * Fix bug when too many entity types. Fixes #2800 * Fix Python 2 test failure * Require older msgpack-numpy * Restore encoding arg on msgpack-numpy * Try to fix version pin for msgpack-numpy * Update Portuguese Language (#2790) * Add words to portuguese language _num_words * Add words to portuguese language _num_words * Portuguese - Add/remove stopwords, fix tokenizer, add currency symbols * Extended punctuation and norm_exceptions in the Portuguese language * Correct error in spacy universe docs concerning spacy-lookup (#2814) * Update Keras Example for (Parikh et al, 2016) implementation (#2803) * bug fixes in keras example * created contributor agreement * baseline for Parikh model * initial version of parikh 2016 implemented * tested asymmetric models * fixed grevious error in normalization * use standard SNLI test file * begin to rework parikh example * initial version of running example * start to document the new version * start to document the new version * Update Decompositional Attention.ipynb * fixed calls to similarity * updated the README * import sys package duh * simplified indexing on mapping word to IDs * stupid python indent error * added code from https://github.com/tensorflow/tensorflow/issues/3388 for tf bug workaround * Fix typo (closes #2815) [ci skip] * Update regex version dependency * Set version to 2.0.13.dev3 * Skip seemingly problematic test * Remove problematic test * Try previous version of regex * Revert "Remove problematic test" This reverts commit bdebbef45552d698d390aa430b527ee27830f11b. * Unskip test * Try older version of regex * 💫 Update training examples and use minibatching (#2830) <!--- Provide a general summary of your changes in the title. --> ## Description Update the training examples in `/examples/training` to show usage of spaCy's `minibatch` and `compounding` helpers ([see here](https://spacy.io/usage/training#tips-batch-size) for details). The lack of batching in the examples has caused some confusion in the past, especially for beginners who would copy-paste the examples, update them with large training sets and experienced slow and unsatisfying results. ### Types of change enhancements ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. * Visual C++ link updated (#2842) (closes #2841) [ci skip] * New landing page * Add contribution agreement * Correcting lang/ru/examples.py (#2845) * Correct some grammatical inaccuracies in lang\ru\examples.py; filled Contributor Agreement * Correct some grammatical inaccuracies in lang\ru\examples.py * Move contributor agreement to separate file * Set version to 2.0.13.dev4 * Add Persian(Farsi) language support (#2797) * Also include lowercase norm exceptions * Remove in favour of https://github.com/explosion/spaCy/graphs/contributors * Rule-based French Lemmatizer (#2818) <!--- Provide a general summary of your changes in the title. --> ## Description <!--- Use this section to describe your changes. If your changes required testing, include information about the testing environment and the tests you ran. If your test fixes a bug reported in an issue, don't forget to include the issue number. If your PR is still a work in progress, that's totally fine – just include a note to let us know. --> Add a rule-based French Lemmatizer following the english one and the excellent PR for [greek language optimizations](https://github.com/explosion/spaCy/pull/2558) to adapt the Lemmatizer class. ### Types of change <!-- What type of change does your PR cover? Is it a bug fix, an enhancement or new feature, or a change to the documentation? --> - Lemma dictionary used can be found [here](http://infolingu.univ-mlv.fr/DonneesLinguistiques/Dictionnaires/telechargement.html), I used the XML version. - Add several files containing exhaustive list of words for each part of speech - Add some lemma rules - Add POS that are not checked in the standard Lemmatizer, i.e PRON, DET, ADV and AUX - Modify the Lemmatizer class to check in lookup table as a last resort if POS not mentionned - Modify the lemmatize function to check in lookup table as a last resort - Init files are updated so the model can support all the functionalities mentioned above - Add words to tokenizer_exceptions_list.py in respect to regex used in tokenizer_exceptions.py ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [X] I have submitted the spaCy Contributor Agreement. - [X] I ran the tests, and all new and existing tests passed. - [X] My changes don't require a change to the documentation, or if they do, I've added all required information. * Set version to 2.0.13 * Fix formatting and consistency * Update docs for new version [ci skip] * Increment version [ci skip] * Add info on wheels [ci skip] * Adding "This is a sentence" example to Sinhala (#2846) * Add wheels badge * Update badge [ci skip] * Update README.rst [ci skip] * Update murmurhash pin * Increment version to 2.0.14.dev0 * Update GPU docs for v2.0.14 * Add wheel to setup_requires * Import prefer_gpu and require_gpu functions from Thinc * Add tests for prefer_gpu() and require_gpu() * Update requirements and setup.py * Workaround bug in thinc require_gpu * Set version to v2.0.14 * Update push-tag script * Unhack prefer_gpu * Require thinc 6.10.6 * Update prefer_gpu and require_gpu docs [ci skip] * Fix specifiers for GPU * Set version to 2.0.14.dev1 * Set version to 2.0.14 * Update Thinc version pin * Increment version * Fix msgpack-numpy version pin * Increment version * Update version to 2.0.16 * Update version [ci skip] * Redundant ')' in the Stop words' example (#2856) <!--- Provide a general summary of your changes in the title. --> ## Description <!--- Use this section to describe your changes. If your changes required testing, include information about the testing environment and the tests you ran. If your test fixes a bug reported in an issue, don't forget to include the issue number. If your PR is still a work in progress, that's totally fine – just include a note to let us know. --> ### Types of change <!-- What type of change does your PR cover? Is it a bug fix, an enhancement or new feature, or a change to the documentation? --> ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [ ] I have submitted the spaCy Contributor Agreement. - [ ] I ran the tests, and all new and existing tests passed. - [ ] My changes don't require a change to the documentation, or if they do, I've added all required information. * Documentation improvement regarding joblib and SO (#2867) Some documentation improvements ## Description 1. Fixed the dead URL to joblib 2. Fixed Stack Overflow brand name (with space) ### Types of change Documentation ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. * raise error when setting overlapping entities as doc.ents (#2880) * Fix out-of-bounds access in NER training The helper method state.B(1) gets the index of the first token of the buffer, or -1 if no such token exists. Normally this is safe because we pass this to functions like state.safe_get(), which returns an empty token. Here we used it directly as an array index, which is not okay! This error may have been the cause of out-of-bounds access errors during training. Similar errors may still be around, so much be hunted down. Hunting this one down took a long time...I printed out values across training runs and diffed, looking for points of divergence between runs, when no randomness should be allowed. * Change PyThaiNLP Url (#2876) * Fix missing comma * Add example showing a fix-up rule for space entities * Set version to 2.0.17.dev0 * Update regex version * Revert "Update regex version" This reverts commit 62358dd867d15bc6a475942dff34effba69dd70a. * Try setting older regex version, to align with conda * Set version to 2.0.17 * Add spacy-js to universe [ci-skip] * Add spacy-raspberry to universe (closes #2889) * Add script to validate universe json [ci skip] * Removed space in docs + added contributor indo (#2909) * - removed unneeded space in documentation * - added contributor info * Allow input text of length up to max_length, inclusive (#2922) * Include universe spec for spacy-wordnet component (#2919) * feat: include universe spec for spacy-wordnet component * chore: include spaCy contributor agreement * Minor formatting changes [ci skip] * Fix image [ci skip] Twitter URL doesn't work on live site * Check if the word is in one of the regular lists specific to each POS (#2886) * 💫 Create random IDs for SVGs to prevent ID clashes (#2927) Resolves #2924. ## Description Fixes problem where multiple visualizations in Jupyter notebooks would have clashing arc IDs, resulting in weirdly positioned arc labels. Generating a random ID prefix so even identical parses won't receive the same IDs for consistency (even if effect of ID clash isn't noticable here.) ### Types of change bug fix ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. * Fix typo [ci skip] * fixes symbolic link on py3 and windows (#2949) * fixes symbolic link on py3 and windows during setup of spacy using command python -m spacy link en_core_web_sm en closes #2948 * Update spacy/compat.py Co-Authored-By: cicorias <cicorias@users.noreply.github.com> * Fix formatting * Update universe [ci skip] * Catalan Language Support (#2940) * Catalan language Support * Ddding Catalan to documentation * Sort languages alphabetically [ci skip] * Update tests for pytest 4.x (#2965) <!--- Provide a general summary of your changes in the title. --> ## Description - [x] Replace marks in params for pytest 4.0 compat ([see here](https://docs.pytest.org/en/latest/deprecations.html#marks-in-pytest-mark-parametrize)) - [x] Un-xfail passing tests (some fixes in a recent update resolved a bunch of issues, but tests were apparently never updated here) ### Types of change <!-- What type of change does your PR cover? Is it a bug fix, an enhancement or new feature, or a change to the documentation? --> ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. * Fix regex pin to harmonize with conda (#2964) * Update README.rst * Fix bug where Vocab.prune_vector did not use 'batch_size' (#2977) Fixes #2976 * Fix typo * Fix typo * Remove duplicate file * Require thinc 7.0.0.dev2 Fixes bug in gpu_ops that would use cupy instead of numpy on CPU * Add missing import * Fix error IDs * Fix tests
2018-11-29 15:30:29 +00:00
juang-juang
2017-07-27 12:46:30 +00:00
julo-julo
2017-07-26 12:12:52 +00:00
julung-julung
2017-07-27 12:46:30 +00:00
julur-julur
jumbai-jumbai
jungkang-jungkit
jungkat-jungkit
jurai-jurai
2017-07-24 07:10:16 +00:00
kabang-kabang
2017-07-26 12:12:52 +00:00
kabar-kabari
2017-07-27 12:46:30 +00:00
kabir-kabiran
kabruk-kabrukan
2017-07-24 07:10:16 +00:00
kabu-kabu
2017-07-26 12:12:52 +00:00
kabupaten-kabupaten
kabupaten-kota
kaca-kaca
2017-07-24 07:10:16 +00:00
kacang-kacang
kacang-kacangan
2017-07-26 12:12:52 +00:00
kacau-balau
kadang-kadang
kader-kader
kades-kades
kadis-kadis
2017-07-27 12:46:30 +00:00
kail-kail
2017-07-26 12:12:52 +00:00
kain-kain
kait-kait
kakak-adik
kakak-beradik
kakak-kakak
2017-07-27 12:46:30 +00:00
kakek-kakek
2017-07-26 12:12:52 +00:00
kakek-nenek
kaki-kaki
2017-07-24 07:10:16 +00:00
kala-kala
kalau-kalau
2017-07-27 12:46:30 +00:00
kaleng-kalengan
kali-kalian
2017-07-26 12:12:52 +00:00
kalimat-kalimat
kalung-kalung
2017-07-24 07:10:16 +00:00
kalut-malut
2017-07-26 12:12:52 +00:00
kambing-kambing
2017-07-27 12:46:30 +00:00
kamit-kamit
2017-07-26 12:12:52 +00:00
kampung-kampung
kampus-kampus
2017-07-24 07:10:16 +00:00
kanak-kanak
2017-07-26 12:12:52 +00:00
kanak-kanan
kanan-kanak
kanan-kiri
2017-07-27 12:46:30 +00:00
kangen-kangenan
2017-07-26 12:12:52 +00:00
kanwil-kanwil
2017-07-27 12:46:30 +00:00
kapa-kapa
2017-07-26 12:12:52 +00:00
kapal-kapal
2017-07-27 12:46:30 +00:00
kapan-kapan
2017-07-26 12:12:52 +00:00
kapolda-kapolda
kapolres-kapolres
kapolsek-kapolsek
2017-07-27 12:46:30 +00:00
kapu-kapu
karang-karangan
karang-mengarang
2017-07-24 07:10:16 +00:00
kareseh-peseh
2017-07-26 12:12:52 +00:00
karut-marut
karya-karya
2017-07-24 07:10:16 +00:00
kasak-kusuk
2017-07-26 12:12:52 +00:00
kasus-kasus
kata-kata
2017-07-24 07:10:16 +00:00
katang-katang
2017-07-26 12:12:52 +00:00
kava-kava
2017-07-24 07:10:16 +00:00
kawa-kawa
2017-07-26 12:12:52 +00:00
kawan-kawan
kawin-cerai
2017-07-27 12:46:30 +00:00
kawin-mawin
kayu-kayu
kayu-kayuan
ke-Allah-an
keabu-abuan
kearab-araban
keasyik-asyikan
kebarat-baratan
kebasah-basahan
kebat-kebit
kebata-bataan
kebayi-bayian
kebelanda-belandaan
keberlarut-larutan
kebesar-hatian
2017-07-26 12:12:52 +00:00
kebiasaan-kebiasaan
kebijakan-kebijakan
2017-07-27 12:46:30 +00:00
kebiru-biruan
kebudak-budakan
2017-07-26 12:12:52 +00:00
kebun-kebun
kebut-kebutan
kecamatan-kecamatan
2017-07-27 12:46:30 +00:00
kecentang-perenangan
2017-07-26 12:12:52 +00:00
kecil-kecil
kecil-kecilan
2017-07-27 12:46:30 +00:00
kecil-mengecil
kecokelat-cokelatan
kecomak-kecimik
2017-07-24 07:10:16 +00:00
kecuh-kecah
2017-07-27 12:46:30 +00:00
kedek-kedek
kedekak-kedekik
kedesa-desaan
2017-07-26 12:12:52 +00:00
kedubes-kedubes
kedutaan-kedutaan
2017-07-27 12:46:30 +00:00
keempat-empatnya
kegadis-gadisan
kegelap-gelapan
2017-07-26 12:12:52 +00:00
kegiatan-kegiatan
2017-07-27 12:46:30 +00:00
kegila-gilaan
kegirang-girangan
2017-07-26 12:12:52 +00:00
kehati-hatian
2017-07-27 12:46:30 +00:00
keheran-heranan
kehijau-hijauan
kehitam-hitaman
keinggris-inggrisan
kejaga-jagaan
2017-07-26 12:12:52 +00:00
kejahatan-kejahatan
kejang-kejang
kejar-kejar
kejar-kejaran
2017-07-27 12:46:30 +00:00
kejar-mengejar
kejingga-jinggaan
kejut-kejut
2017-07-26 12:12:52 +00:00
kejutan-kejutan
2017-07-27 12:46:30 +00:00
kekabur-kaburan
kekanak-kanakan
kekoboi-koboian
kekota-kotaan
2017-07-26 12:12:52 +00:00
kekuasaan-kekuasaan
2017-07-27 12:46:30 +00:00
kekuning-kuningan
2017-07-24 07:10:16 +00:00
kelak-kelik
kelak-keluk
2017-07-27 12:46:30 +00:00
kelaki-lakian
2017-07-24 07:10:16 +00:00
kelang-kelok
kelap-kelip
2017-07-27 12:46:30 +00:00
kelasah-kelusuh
kelek-kelek
kelek-kelekan
2017-07-24 07:10:16 +00:00
kelemak-kelemek
kelik-kelik
2017-07-27 12:46:30 +00:00
kelip-kelip
2017-07-26 12:12:52 +00:00
kelompok-kelompok
2017-07-24 07:10:16 +00:00
kelontang-kelantung
2017-07-26 12:12:52 +00:00
keluar-masuk
kelurahan-kelurahan
2017-07-24 07:10:16 +00:00
kelusuh-kelasah
2017-07-27 12:46:30 +00:00
kelut-melut
kemak-kemik
kemalu-maluan
2017-07-26 12:12:52 +00:00
kemana-mana
2017-07-27 12:46:30 +00:00
kemanja-manjaan
kemarah-marahan
kemasam-masaman
kemati-matian
2017-07-26 12:12:52 +00:00
kembang-kembang
💫 Port master changes over to develop (#2979) * Create aryaprabhudesai.md (#2681) * Update _install.jade (#2688) Typo fix: "models" -> "model" * Add FAC to spacy.explain (resolves #2706) * Remove docstrings for deprecated arguments (see #2703) * When calling getoption() in conftest.py, pass a default option (#2709) * When calling getoption() in conftest.py, pass a default option This is necessary to allow testing an installed spacy by running: pytest --pyargs spacy * Add contributor agreement * update bengali token rules for hyphen and digits (#2731) * Less norm computations in token similarity (#2730) * Less norm computations in token similarity * Contributor agreement * Remove ')' for clarity (#2737) Sorry, don't mean to be nitpicky, I just noticed this when going through the CLI and thought it was a quick fix. That said, if this was intention than please let me know. * added contributor agreement for mbkupfer (#2738) * Basic support for Telugu language (#2751) * Lex _attrs for polish language (#2750) * Signed spaCy contributor agreement * Added polish version of english lex_attrs * Introduces a bulk merge function, in order to solve issue #653 (#2696) * Fix comment * Introduce bulk merge to increase performance on many span merges * Sign contributor agreement * Implement pull request suggestions * Describe converters more explicitly (see #2643) * Add multi-threading note to Language.pipe (resolves #2582) [ci skip] * Fix formatting * Fix dependency scheme docs (closes #2705) [ci skip] * Don't set stop word in example (closes #2657) [ci skip] * Add words to portuguese language _num_words (#2759) * Add words to portuguese language _num_words * Add words to portuguese language _num_words * Update Indonesian model (#2752) * adding e-KTP in tokenizer exceptions list * add exception token * removing lines with containing space as it won't matter since we use .split() method in the end, added new tokens in exception * add tokenizer exceptions list * combining base_norms with norm_exceptions * adding norm_exception * fix double key in lemmatizer * remove unused import on punctuation.py * reformat stop_words to reduce number of lines, improve readibility * updating tokenizer exception * implement is_currency for lang/id * adding orth_first_upper in tokenizer_exceptions * update the norm_exception list * remove bunch of abbreviations * adding contributors file * Fixed spaCy+Keras example (#2763) * bug fixes in keras example * created contributor agreement * Adding French hyphenated first name (#2786) * Fix typo (closes #2784) * Fix typo (#2795) [ci skip] Fixed typo on line 6 "regcognizer --> recognizer" * Adding basic support for Sinhala language. (#2788) * adding Sinhala language package, stop words, examples and lex_attrs. * Adding contributor agreement * Updating contributor agreement * Also include lowercase norm exceptions * Fix error (#2802) * Fix error ValueError: cannot resize an array that references or is referenced by another array in this way. Use the resize function * added spaCy Contributor Agreement * Add charlax's contributor agreement (#2805) * agreement of contributor, may I introduce a tiny pl languge contribution (#2799) * Contributors agreement * Contributors agreement * Contributors agreement * Add jupyter=True to displacy.render in documentation (#2806) * Revert "Also include lowercase norm exceptions" This reverts commit 70f4e8adf37cfcfab60be2b97d6deae949b30e9e. * Remove deprecated encoding argument to msgpack * Set up dependency tree pattern matching skeleton (#2732) * Fix bug when too many entity types. Fixes #2800 * Fix Python 2 test failure * Require older msgpack-numpy * Restore encoding arg on msgpack-numpy * Try to fix version pin for msgpack-numpy * Update Portuguese Language (#2790) * Add words to portuguese language _num_words * Add words to portuguese language _num_words * Portuguese - Add/remove stopwords, fix tokenizer, add currency symbols * Extended punctuation and norm_exceptions in the Portuguese language * Correct error in spacy universe docs concerning spacy-lookup (#2814) * Update Keras Example for (Parikh et al, 2016) implementation (#2803) * bug fixes in keras example * created contributor agreement * baseline for Parikh model * initial version of parikh 2016 implemented * tested asymmetric models * fixed grevious error in normalization * use standard SNLI test file * begin to rework parikh example * initial version of running example * start to document the new version * start to document the new version * Update Decompositional Attention.ipynb * fixed calls to similarity * updated the README * import sys package duh * simplified indexing on mapping word to IDs * stupid python indent error * added code from https://github.com/tensorflow/tensorflow/issues/3388 for tf bug workaround * Fix typo (closes #2815) [ci skip] * Update regex version dependency * Set version to 2.0.13.dev3 * Skip seemingly problematic test * Remove problematic test * Try previous version of regex * Revert "Remove problematic test" This reverts commit bdebbef45552d698d390aa430b527ee27830f11b. * Unskip test * Try older version of regex * 💫 Update training examples and use minibatching (#2830) <!--- Provide a general summary of your changes in the title. --> ## Description Update the training examples in `/examples/training` to show usage of spaCy's `minibatch` and `compounding` helpers ([see here](https://spacy.io/usage/training#tips-batch-size) for details). The lack of batching in the examples has caused some confusion in the past, especially for beginners who would copy-paste the examples, update them with large training sets and experienced slow and unsatisfying results. ### Types of change enhancements ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. * Visual C++ link updated (#2842) (closes #2841) [ci skip] * New landing page * Add contribution agreement * Correcting lang/ru/examples.py (#2845) * Correct some grammatical inaccuracies in lang\ru\examples.py; filled Contributor Agreement * Correct some grammatical inaccuracies in lang\ru\examples.py * Move contributor agreement to separate file * Set version to 2.0.13.dev4 * Add Persian(Farsi) language support (#2797) * Also include lowercase norm exceptions * Remove in favour of https://github.com/explosion/spaCy/graphs/contributors * Rule-based French Lemmatizer (#2818) <!--- Provide a general summary of your changes in the title. --> ## Description <!--- Use this section to describe your changes. If your changes required testing, include information about the testing environment and the tests you ran. If your test fixes a bug reported in an issue, don't forget to include the issue number. If your PR is still a work in progress, that's totally fine – just include a note to let us know. --> Add a rule-based French Lemmatizer following the english one and the excellent PR for [greek language optimizations](https://github.com/explosion/spaCy/pull/2558) to adapt the Lemmatizer class. ### Types of change <!-- What type of change does your PR cover? Is it a bug fix, an enhancement or new feature, or a change to the documentation? --> - Lemma dictionary used can be found [here](http://infolingu.univ-mlv.fr/DonneesLinguistiques/Dictionnaires/telechargement.html), I used the XML version. - Add several files containing exhaustive list of words for each part of speech - Add some lemma rules - Add POS that are not checked in the standard Lemmatizer, i.e PRON, DET, ADV and AUX - Modify the Lemmatizer class to check in lookup table as a last resort if POS not mentionned - Modify the lemmatize function to check in lookup table as a last resort - Init files are updated so the model can support all the functionalities mentioned above - Add words to tokenizer_exceptions_list.py in respect to regex used in tokenizer_exceptions.py ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [X] I have submitted the spaCy Contributor Agreement. - [X] I ran the tests, and all new and existing tests passed. - [X] My changes don't require a change to the documentation, or if they do, I've added all required information. * Set version to 2.0.13 * Fix formatting and consistency * Update docs for new version [ci skip] * Increment version [ci skip] * Add info on wheels [ci skip] * Adding "This is a sentence" example to Sinhala (#2846) * Add wheels badge * Update badge [ci skip] * Update README.rst [ci skip] * Update murmurhash pin * Increment version to 2.0.14.dev0 * Update GPU docs for v2.0.14 * Add wheel to setup_requires * Import prefer_gpu and require_gpu functions from Thinc * Add tests for prefer_gpu() and require_gpu() * Update requirements and setup.py * Workaround bug in thinc require_gpu * Set version to v2.0.14 * Update push-tag script * Unhack prefer_gpu * Require thinc 6.10.6 * Update prefer_gpu and require_gpu docs [ci skip] * Fix specifiers for GPU * Set version to 2.0.14.dev1 * Set version to 2.0.14 * Update Thinc version pin * Increment version * Fix msgpack-numpy version pin * Increment version * Update version to 2.0.16 * Update version [ci skip] * Redundant ')' in the Stop words' example (#2856) <!--- Provide a general summary of your changes in the title. --> ## Description <!--- Use this section to describe your changes. If your changes required testing, include information about the testing environment and the tests you ran. If your test fixes a bug reported in an issue, don't forget to include the issue number. If your PR is still a work in progress, that's totally fine – just include a note to let us know. --> ### Types of change <!-- What type of change does your PR cover? Is it a bug fix, an enhancement or new feature, or a change to the documentation? --> ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [ ] I have submitted the spaCy Contributor Agreement. - [ ] I ran the tests, and all new and existing tests passed. - [ ] My changes don't require a change to the documentation, or if they do, I've added all required information. * Documentation improvement regarding joblib and SO (#2867) Some documentation improvements ## Description 1. Fixed the dead URL to joblib 2. Fixed Stack Overflow brand name (with space) ### Types of change Documentation ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. * raise error when setting overlapping entities as doc.ents (#2880) * Fix out-of-bounds access in NER training The helper method state.B(1) gets the index of the first token of the buffer, or -1 if no such token exists. Normally this is safe because we pass this to functions like state.safe_get(), which returns an empty token. Here we used it directly as an array index, which is not okay! This error may have been the cause of out-of-bounds access errors during training. Similar errors may still be around, so much be hunted down. Hunting this one down took a long time...I printed out values across training runs and diffed, looking for points of divergence between runs, when no randomness should be allowed. * Change PyThaiNLP Url (#2876) * Fix missing comma * Add example showing a fix-up rule for space entities * Set version to 2.0.17.dev0 * Update regex version * Revert "Update regex version" This reverts commit 62358dd867d15bc6a475942dff34effba69dd70a. * Try setting older regex version, to align with conda * Set version to 2.0.17 * Add spacy-js to universe [ci-skip] * Add spacy-raspberry to universe (closes #2889) * Add script to validate universe json [ci skip] * Removed space in docs + added contributor indo (#2909) * - removed unneeded space in documentation * - added contributor info * Allow input text of length up to max_length, inclusive (#2922) * Include universe spec for spacy-wordnet component (#2919) * feat: include universe spec for spacy-wordnet component * chore: include spaCy contributor agreement * Minor formatting changes [ci skip] * Fix image [ci skip] Twitter URL doesn't work on live site * Check if the word is in one of the regular lists specific to each POS (#2886) * 💫 Create random IDs for SVGs to prevent ID clashes (#2927) Resolves #2924. ## Description Fixes problem where multiple visualizations in Jupyter notebooks would have clashing arc IDs, resulting in weirdly positioned arc labels. Generating a random ID prefix so even identical parses won't receive the same IDs for consistency (even if effect of ID clash isn't noticable here.) ### Types of change bug fix ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. * Fix typo [ci skip] * fixes symbolic link on py3 and windows (#2949) * fixes symbolic link on py3 and windows during setup of spacy using command python -m spacy link en_core_web_sm en closes #2948 * Update spacy/compat.py Co-Authored-By: cicorias <cicorias@users.noreply.github.com> * Fix formatting * Update universe [ci skip] * Catalan Language Support (#2940) * Catalan language Support * Ddding Catalan to documentation * Sort languages alphabetically [ci skip] * Update tests for pytest 4.x (#2965) <!--- Provide a general summary of your changes in the title. --> ## Description - [x] Replace marks in params for pytest 4.0 compat ([see here](https://docs.pytest.org/en/latest/deprecations.html#marks-in-pytest-mark-parametrize)) - [x] Un-xfail passing tests (some fixes in a recent update resolved a bunch of issues, but tests were apparently never updated here) ### Types of change <!-- What type of change does your PR cover? Is it a bug fix, an enhancement or new feature, or a change to the documentation? --> ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. * Fix regex pin to harmonize with conda (#2964) * Update README.rst * Fix bug where Vocab.prune_vector did not use 'batch_size' (#2977) Fixes #2976 * Fix typo * Fix typo * Remove duplicate file * Require thinc 7.0.0.dev2 Fixes bug in gpu_ops that would use cupy instead of numpy on CPU * Add missing import * Fix error IDs * Fix tests
2018-11-29 15:30:29 +00:00
kemenpan-rb
2017-07-26 12:12:52 +00:00
kementerian-kementerian
2017-07-27 12:46:30 +00:00
kemerah-merahan
kempang-kempis
kempas-kempis
kemuda-mudaan
kena-mengena
kenal-mengenal
2017-07-26 12:12:52 +00:00
kenang-kenangan
2017-07-27 12:46:30 +00:00
kencang-kencung
kencing-mengencingi
2017-07-24 07:10:16 +00:00
kencrang-kencring
2017-07-27 12:46:30 +00:00
kendang-kendang
kendang-kendangan
keningrat-ningratan
kentung-kentung
kenyat-kenyit
2017-07-26 12:12:52 +00:00
kepala-kepala
2017-07-27 12:46:30 +00:00
kepala-kepalaan
kepandir-pandiran
kepang-kepot
keperak-perakan
kepetah-lidahan
kepilu-piluan
2017-07-26 12:12:52 +00:00
keping-keping
2017-07-27 12:46:30 +00:00
kepucat-pucatan
kepuh-kepuh
kepura-puraan
keputih-putihan
kerah-kerahan
kerancak-rancakan
kerang-kerangan
2017-07-24 07:10:16 +00:00
kerang-keroh
2017-07-27 12:46:30 +00:00
kerang-kerot
kerang-keruk
kerang-kerung
kerap-kerap
keras-mengerasi
2017-07-24 07:10:16 +00:00
kercap-kercip
kercap-kercup
keriang-keriut
2017-07-26 12:12:52 +00:00
kerja-kerja
2017-07-24 07:10:16 +00:00
kernyat-kernyut
2017-07-27 12:46:30 +00:00
kerobak-kerabit
kerobak-kerobek
kerobak-kerobik
kerobat-kerabit
kerong-kerong
2017-07-24 07:10:16 +00:00
keropas-kerapis
2017-07-27 12:46:30 +00:00
kertak-kertuk
kertap-kertap
2017-07-24 07:10:16 +00:00
keruntang-pungkang
2017-07-26 12:12:52 +00:00
kesalahan-kesalahan
2017-07-24 07:10:16 +00:00
kesap-kesip
2017-07-27 12:46:30 +00:00
kesemena-menaan
kesenak-senakan
kesewenang-wenangan
kesia-siaan
kesik-kesik
kesipu-sipuan
2017-07-24 07:10:16 +00:00
kesu-kesi
kesuh-kesih
kesuk-kesik
ketakar-keteker
2017-07-26 12:12:52 +00:00
ketakutan-ketakutan
2017-07-24 07:10:16 +00:00
ketap-ketap
2017-07-27 12:46:30 +00:00
ketap-ketip
2017-07-26 12:12:52 +00:00
ketar-ketir
ketentuan-ketentuan
2017-07-27 12:46:30 +00:00
ketergesa-gesaan
keti-keti
ketidur-tiduran
ketiga-tiganya
2017-07-24 07:10:16 +00:00
ketir-ketir
2017-07-26 12:12:52 +00:00
ketua-ketua
2017-07-27 12:46:30 +00:00
ketua-tuaan
ketuan-tuanan
keungu-unguan
kewangi-wangian
2017-07-26 12:12:52 +00:00
ki-ka
2017-07-27 12:46:30 +00:00
kia-kia
2017-07-26 12:12:52 +00:00
kiai-kiai
2017-07-27 12:46:30 +00:00
kiak-kiak
kial-kial
2017-07-24 07:10:16 +00:00
kiang-kiut
2017-07-26 12:12:52 +00:00
kiat-kiat
2017-07-24 07:10:16 +00:00
kibang-kibut
kicang-kecoh
kicang-kicu
2017-07-26 12:12:52 +00:00
kick-off
2017-07-24 07:10:16 +00:00
kida-kida
kijang-kijang
2017-07-27 12:46:30 +00:00
kilau-mengilau
kili-kili
kilik-kilik
2017-07-26 12:12:52 +00:00
kincir-kincir
kios-kios
2017-07-24 07:10:16 +00:00
kira-kira
2017-07-27 12:46:30 +00:00
kira-kiraan
2017-07-26 12:12:52 +00:00
kiri-kanan
2017-07-27 12:46:30 +00:00
kirim-berkirim
2017-07-26 12:12:52 +00:00
kisah-kisah
kisi-kisi
kitab-kitab
2017-07-27 12:46:30 +00:00
kitang-kitang
2017-07-24 07:10:16 +00:00
kiu-kiu
2017-07-26 12:12:52 +00:00
klaim-klaim
2017-07-27 12:46:30 +00:00
klik-klikan
2017-07-26 12:12:52 +00:00
klip-klip
klub-klub
2017-07-24 07:10:16 +00:00
kluntang-klantung
2017-07-26 12:12:52 +00:00
knock-knock
knock-on
knock-out
ko-as
ko-pilot
2017-07-27 12:46:30 +00:00
koak-koak
koboi-koboian
2017-07-24 07:10:16 +00:00
kocah-kacih
kocar-kacir
2017-07-26 12:12:52 +00:00
kodam-kodam
kode-kode
kodim-kodim
2017-07-24 07:10:16 +00:00
kodok-kodok
kolang-kaling
kole-kole
koleh-koleh
2017-07-27 12:46:30 +00:00
kolong-kolong
koma-koma
2017-07-24 07:10:16 +00:00
komat-kamit
2017-07-26 12:12:52 +00:00
komisaris-komisaris
komisi-komisi
komite-komite
komoditas-komoditas
2017-07-27 12:46:30 +00:00
kongko-kongko
2017-07-26 12:12:52 +00:00
konsulat-konsulat
konsultan-konsultan
2017-07-24 07:10:16 +00:00
kontal-kantil
kontang-kanting
2017-07-26 12:12:52 +00:00
kontra-terorisme
kontrak-kontrak
konvensi-konvensi
2017-07-24 07:10:16 +00:00
kopat-kapit
2017-07-26 12:12:52 +00:00
koperasi-koperasi
kopi-kopi
koran-koran
2017-07-27 12:46:30 +00:00
koreng-koreng
2017-07-26 12:12:52 +00:00
kos-kosan
2017-07-24 07:10:16 +00:00
kosak-kasik
2017-07-26 12:12:52 +00:00
kota-kota
kota-wakil
2017-07-24 07:10:16 +00:00
kotak-katik
2017-07-26 12:12:52 +00:00
kotak-kotak
2017-07-27 12:46:30 +00:00
koyak-koyak
kuas-kuas
kuat-kuat
2017-07-26 12:12:52 +00:00
kubu-kubuan
2017-07-24 07:10:16 +00:00
kucar-kacir
2017-07-27 12:46:30 +00:00
kucing-kucing
2017-07-26 12:12:52 +00:00
kucing-kucingan
2017-07-24 07:10:16 +00:00
kuda-kuda
2017-07-27 12:46:30 +00:00
kuda-kudaan
kudap-kudap
2017-07-26 12:12:52 +00:00
kue-kue
2017-07-27 12:46:30 +00:00
kulah-kulah
💫 Port master changes over to develop (#2979) * Create aryaprabhudesai.md (#2681) * Update _install.jade (#2688) Typo fix: "models" -> "model" * Add FAC to spacy.explain (resolves #2706) * Remove docstrings for deprecated arguments (see #2703) * When calling getoption() in conftest.py, pass a default option (#2709) * When calling getoption() in conftest.py, pass a default option This is necessary to allow testing an installed spacy by running: pytest --pyargs spacy * Add contributor agreement * update bengali token rules for hyphen and digits (#2731) * Less norm computations in token similarity (#2730) * Less norm computations in token similarity * Contributor agreement * Remove ')' for clarity (#2737) Sorry, don't mean to be nitpicky, I just noticed this when going through the CLI and thought it was a quick fix. That said, if this was intention than please let me know. * added contributor agreement for mbkupfer (#2738) * Basic support for Telugu language (#2751) * Lex _attrs for polish language (#2750) * Signed spaCy contributor agreement * Added polish version of english lex_attrs * Introduces a bulk merge function, in order to solve issue #653 (#2696) * Fix comment * Introduce bulk merge to increase performance on many span merges * Sign contributor agreement * Implement pull request suggestions * Describe converters more explicitly (see #2643) * Add multi-threading note to Language.pipe (resolves #2582) [ci skip] * Fix formatting * Fix dependency scheme docs (closes #2705) [ci skip] * Don't set stop word in example (closes #2657) [ci skip] * Add words to portuguese language _num_words (#2759) * Add words to portuguese language _num_words * Add words to portuguese language _num_words * Update Indonesian model (#2752) * adding e-KTP in tokenizer exceptions list * add exception token * removing lines with containing space as it won't matter since we use .split() method in the end, added new tokens in exception * add tokenizer exceptions list * combining base_norms with norm_exceptions * adding norm_exception * fix double key in lemmatizer * remove unused import on punctuation.py * reformat stop_words to reduce number of lines, improve readibility * updating tokenizer exception * implement is_currency for lang/id * adding orth_first_upper in tokenizer_exceptions * update the norm_exception list * remove bunch of abbreviations * adding contributors file * Fixed spaCy+Keras example (#2763) * bug fixes in keras example * created contributor agreement * Adding French hyphenated first name (#2786) * Fix typo (closes #2784) * Fix typo (#2795) [ci skip] Fixed typo on line 6 "regcognizer --> recognizer" * Adding basic support for Sinhala language. (#2788) * adding Sinhala language package, stop words, examples and lex_attrs. * Adding contributor agreement * Updating contributor agreement * Also include lowercase norm exceptions * Fix error (#2802) * Fix error ValueError: cannot resize an array that references or is referenced by another array in this way. Use the resize function * added spaCy Contributor Agreement * Add charlax's contributor agreement (#2805) * agreement of contributor, may I introduce a tiny pl languge contribution (#2799) * Contributors agreement * Contributors agreement * Contributors agreement * Add jupyter=True to displacy.render in documentation (#2806) * Revert "Also include lowercase norm exceptions" This reverts commit 70f4e8adf37cfcfab60be2b97d6deae949b30e9e. * Remove deprecated encoding argument to msgpack * Set up dependency tree pattern matching skeleton (#2732) * Fix bug when too many entity types. Fixes #2800 * Fix Python 2 test failure * Require older msgpack-numpy * Restore encoding arg on msgpack-numpy * Try to fix version pin for msgpack-numpy * Update Portuguese Language (#2790) * Add words to portuguese language _num_words * Add words to portuguese language _num_words * Portuguese - Add/remove stopwords, fix tokenizer, add currency symbols * Extended punctuation and norm_exceptions in the Portuguese language * Correct error in spacy universe docs concerning spacy-lookup (#2814) * Update Keras Example for (Parikh et al, 2016) implementation (#2803) * bug fixes in keras example * created contributor agreement * baseline for Parikh model * initial version of parikh 2016 implemented * tested asymmetric models * fixed grevious error in normalization * use standard SNLI test file * begin to rework parikh example * initial version of running example * start to document the new version * start to document the new version * Update Decompositional Attention.ipynb * fixed calls to similarity * updated the README * import sys package duh * simplified indexing on mapping word to IDs * stupid python indent error * added code from https://github.com/tensorflow/tensorflow/issues/3388 for tf bug workaround * Fix typo (closes #2815) [ci skip] * Update regex version dependency * Set version to 2.0.13.dev3 * Skip seemingly problematic test * Remove problematic test * Try previous version of regex * Revert "Remove problematic test" This reverts commit bdebbef45552d698d390aa430b527ee27830f11b. * Unskip test * Try older version of regex * 💫 Update training examples and use minibatching (#2830) <!--- Provide a general summary of your changes in the title. --> ## Description Update the training examples in `/examples/training` to show usage of spaCy's `minibatch` and `compounding` helpers ([see here](https://spacy.io/usage/training#tips-batch-size) for details). The lack of batching in the examples has caused some confusion in the past, especially for beginners who would copy-paste the examples, update them with large training sets and experienced slow and unsatisfying results. ### Types of change enhancements ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. * Visual C++ link updated (#2842) (closes #2841) [ci skip] * New landing page * Add contribution agreement * Correcting lang/ru/examples.py (#2845) * Correct some grammatical inaccuracies in lang\ru\examples.py; filled Contributor Agreement * Correct some grammatical inaccuracies in lang\ru\examples.py * Move contributor agreement to separate file * Set version to 2.0.13.dev4 * Add Persian(Farsi) language support (#2797) * Also include lowercase norm exceptions * Remove in favour of https://github.com/explosion/spaCy/graphs/contributors * Rule-based French Lemmatizer (#2818) <!--- Provide a general summary of your changes in the title. --> ## Description <!--- Use this section to describe your changes. If your changes required testing, include information about the testing environment and the tests you ran. If your test fixes a bug reported in an issue, don't forget to include the issue number. If your PR is still a work in progress, that's totally fine – just include a note to let us know. --> Add a rule-based French Lemmatizer following the english one and the excellent PR for [greek language optimizations](https://github.com/explosion/spaCy/pull/2558) to adapt the Lemmatizer class. ### Types of change <!-- What type of change does your PR cover? Is it a bug fix, an enhancement or new feature, or a change to the documentation? --> - Lemma dictionary used can be found [here](http://infolingu.univ-mlv.fr/DonneesLinguistiques/Dictionnaires/telechargement.html), I used the XML version. - Add several files containing exhaustive list of words for each part of speech - Add some lemma rules - Add POS that are not checked in the standard Lemmatizer, i.e PRON, DET, ADV and AUX - Modify the Lemmatizer class to check in lookup table as a last resort if POS not mentionned - Modify the lemmatize function to check in lookup table as a last resort - Init files are updated so the model can support all the functionalities mentioned above - Add words to tokenizer_exceptions_list.py in respect to regex used in tokenizer_exceptions.py ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [X] I have submitted the spaCy Contributor Agreement. - [X] I ran the tests, and all new and existing tests passed. - [X] My changes don't require a change to the documentation, or if they do, I've added all required information. * Set version to 2.0.13 * Fix formatting and consistency * Update docs for new version [ci skip] * Increment version [ci skip] * Add info on wheels [ci skip] * Adding "This is a sentence" example to Sinhala (#2846) * Add wheels badge * Update badge [ci skip] * Update README.rst [ci skip] * Update murmurhash pin * Increment version to 2.0.14.dev0 * Update GPU docs for v2.0.14 * Add wheel to setup_requires * Import prefer_gpu and require_gpu functions from Thinc * Add tests for prefer_gpu() and require_gpu() * Update requirements and setup.py * Workaround bug in thinc require_gpu * Set version to v2.0.14 * Update push-tag script * Unhack prefer_gpu * Require thinc 6.10.6 * Update prefer_gpu and require_gpu docs [ci skip] * Fix specifiers for GPU * Set version to 2.0.14.dev1 * Set version to 2.0.14 * Update Thinc version pin * Increment version * Fix msgpack-numpy version pin * Increment version * Update version to 2.0.16 * Update version [ci skip] * Redundant ')' in the Stop words' example (#2856) <!--- Provide a general summary of your changes in the title. --> ## Description <!--- Use this section to describe your changes. If your changes required testing, include information about the testing environment and the tests you ran. If your test fixes a bug reported in an issue, don't forget to include the issue number. If your PR is still a work in progress, that's totally fine – just include a note to let us know. --> ### Types of change <!-- What type of change does your PR cover? Is it a bug fix, an enhancement or new feature, or a change to the documentation? --> ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [ ] I have submitted the spaCy Contributor Agreement. - [ ] I ran the tests, and all new and existing tests passed. - [ ] My changes don't require a change to the documentation, or if they do, I've added all required information. * Documentation improvement regarding joblib and SO (#2867) Some documentation improvements ## Description 1. Fixed the dead URL to joblib 2. Fixed Stack Overflow brand name (with space) ### Types of change Documentation ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. * raise error when setting overlapping entities as doc.ents (#2880) * Fix out-of-bounds access in NER training The helper method state.B(1) gets the index of the first token of the buffer, or -1 if no such token exists. Normally this is safe because we pass this to functions like state.safe_get(), which returns an empty token. Here we used it directly as an array index, which is not okay! This error may have been the cause of out-of-bounds access errors during training. Similar errors may still be around, so much be hunted down. Hunting this one down took a long time...I printed out values across training runs and diffed, looking for points of divergence between runs, when no randomness should be allowed. * Change PyThaiNLP Url (#2876) * Fix missing comma * Add example showing a fix-up rule for space entities * Set version to 2.0.17.dev0 * Update regex version * Revert "Update regex version" This reverts commit 62358dd867d15bc6a475942dff34effba69dd70a. * Try setting older regex version, to align with conda * Set version to 2.0.17 * Add spacy-js to universe [ci-skip] * Add spacy-raspberry to universe (closes #2889) * Add script to validate universe json [ci skip] * Removed space in docs + added contributor indo (#2909) * - removed unneeded space in documentation * - added contributor info * Allow input text of length up to max_length, inclusive (#2922) * Include universe spec for spacy-wordnet component (#2919) * feat: include universe spec for spacy-wordnet component * chore: include spaCy contributor agreement * Minor formatting changes [ci skip] * Fix image [ci skip] Twitter URL doesn't work on live site * Check if the word is in one of the regular lists specific to each POS (#2886) * 💫 Create random IDs for SVGs to prevent ID clashes (#2927) Resolves #2924. ## Description Fixes problem where multiple visualizations in Jupyter notebooks would have clashing arc IDs, resulting in weirdly positioned arc labels. Generating a random ID prefix so even identical parses won't receive the same IDs for consistency (even if effect of ID clash isn't noticable here.) ### Types of change bug fix ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. * Fix typo [ci skip] * fixes symbolic link on py3 and windows (#2949) * fixes symbolic link on py3 and windows during setup of spacy using command python -m spacy link en_core_web_sm en closes #2948 * Update spacy/compat.py Co-Authored-By: cicorias <cicorias@users.noreply.github.com> * Fix formatting * Update universe [ci skip] * Catalan Language Support (#2940) * Catalan language Support * Ddding Catalan to documentation * Sort languages alphabetically [ci skip] * Update tests for pytest 4.x (#2965) <!--- Provide a general summary of your changes in the title. --> ## Description - [x] Replace marks in params for pytest 4.0 compat ([see here](https://docs.pytest.org/en/latest/deprecations.html#marks-in-pytest-mark-parametrize)) - [x] Un-xfail passing tests (some fixes in a recent update resolved a bunch of issues, but tests were apparently never updated here) ### Types of change <!-- What type of change does your PR cover? Is it a bug fix, an enhancement or new feature, or a change to the documentation? --> ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. * Fix regex pin to harmonize with conda (#2964) * Update README.rst * Fix bug where Vocab.prune_vector did not use 'batch_size' (#2977) Fixes #2976 * Fix typo * Fix typo * Remove duplicate file * Require thinc 7.0.0.dev2 Fixes bug in gpu_ops that would use cupy instead of numpy on CPU * Add missing import * Fix error IDs * Fix tests
2018-11-29 15:30:29 +00:00
kulak-kulak
2017-07-24 07:10:16 +00:00
kulik-kulik
2017-07-27 12:46:30 +00:00
kulum-kulum
kumat-kamit
2017-07-26 12:12:52 +00:00
kumpul-kumpul
2017-07-24 07:10:16 +00:00
kunang-kunang
kunar-kunar
2017-07-26 12:12:52 +00:00
kung-fu
kuning-hitam
2017-07-24 07:10:16 +00:00
kupat-kapit
kupu-kupu
kura-kura
2017-07-27 12:46:30 +00:00
kurang-kurang
2017-07-24 07:10:16 +00:00
kusat-mesat
kutat-kutet
2017-07-27 12:46:30 +00:00
kuti-kuti
2017-07-24 07:10:16 +00:00
kuwung-kuwung
2017-07-26 12:12:52 +00:00
kyai-kyai
2017-07-24 07:10:16 +00:00
laba-laba
labi-labi
2017-07-27 12:46:30 +00:00
labu-labu
2017-07-26 12:12:52 +00:00
laga-laga
lagi-lagi
lagu-lagu
2017-07-24 07:10:16 +00:00
laguh-lagah
2017-07-26 12:12:52 +00:00
lain-lain
laki-laki
2017-07-24 07:10:16 +00:00
lalu-lalang
2017-07-26 12:12:52 +00:00
lalu-lintas
2017-07-27 12:46:30 +00:00
lama-kelamaan
lama-lama
2017-07-24 07:10:16 +00:00
lamat-lamat
2017-07-27 12:46:30 +00:00
lambat-lambat
2017-07-26 12:12:52 +00:00
lampion-lampion
lampu-lampu
2017-07-27 12:46:30 +00:00
lancang-lancang
2017-07-24 07:10:16 +00:00
lancar-lancar
langak-longok
2017-07-27 12:46:30 +00:00
langgar-melanggar
2017-07-24 07:10:16 +00:00
langit-langit
2017-07-26 12:12:52 +00:00
langkah-langka
langkah-langkah
2017-07-27 12:46:30 +00:00
lanja-lanjaan
2017-07-26 12:12:52 +00:00
lapas-lapas
2017-07-24 07:10:16 +00:00
lapat-lapat
2017-07-26 12:12:52 +00:00
laporan-laporan
laptop-tablet
large-scale
2017-07-27 12:46:30 +00:00
lari-lari
2017-07-26 12:12:52 +00:00
lari-larian
laskar-laskar
lauk-pauk
2017-07-27 12:46:30 +00:00
laun-laun
2017-07-26 12:12:52 +00:00
laut-timur
2017-07-27 12:46:30 +00:00
lawah-lawah
lawak-lawak
2017-07-26 12:12:52 +00:00
lawan-lawan
2017-07-27 12:46:30 +00:00
lawi-lawi
2017-07-26 12:12:52 +00:00
layang-layang
2017-07-27 12:46:30 +00:00
layu-layuan
lebih-lebih
2017-07-26 12:12:52 +00:00
lecet-lecet
2017-07-24 07:10:16 +00:00
legak-legok
2017-07-27 12:46:30 +00:00
legum-legum
2017-07-24 07:10:16 +00:00
legup-legup
leha-leha
lekak-lekuk
lekap-lekup
2017-07-27 12:46:30 +00:00
lekas-lekas
lekat-lekat
2017-07-24 07:10:16 +00:00
lekuh-lekih
lekum-lekum
lekup-lekap
2017-07-26 12:12:52 +00:00
lembaga-lembaga
2017-07-27 12:46:30 +00:00
lempar-lemparan
2017-07-24 07:10:16 +00:00
lenggak-lenggok
2017-07-27 12:46:30 +00:00
lenggok-lenggok
lenggut-lenggut
lengket-lengket
2017-07-24 07:10:16 +00:00
lentam-lentum
lentang-lentok
2017-07-27 12:46:30 +00:00
lentang-lentung
lepa-lepa
lerang-lerang
lereng-lereng
2017-07-26 12:12:52 +00:00
lese-majeste
2017-07-27 12:46:30 +00:00
letah-letai
2017-07-24 07:10:16 +00:00
lete-lete
2017-07-27 12:46:30 +00:00
letuk-letuk
letum-letum
letup-letup
2017-07-26 12:12:52 +00:00
leyeh-leyeh
2017-07-27 12:46:30 +00:00
liang-liuk
liang-liut
2017-07-26 12:12:52 +00:00
liar-liar
2017-07-27 12:46:30 +00:00
liat-liut
2017-07-24 07:10:16 +00:00
lidah-lidah
2017-07-26 12:12:52 +00:00
life-toxins
liga-liga
light-emitting
lika-liku
lil-alamin
lilin-lilin
line-up
lintas-selat
2017-07-27 12:46:30 +00:00
lipat-melipat
2017-07-26 12:12:52 +00:00
liquid-cooled
lithium-ion
lithium-polymer
2017-07-24 07:10:16 +00:00
liuk-liuk
liung-liung
lobi-lobi
2017-07-26 12:12:52 +00:00
lock-up
locked-in
lokasi-lokasi
long-term
2017-07-24 07:10:16 +00:00
longak-longok
lontang-lanting
lontang-lantung
2017-07-27 12:46:30 +00:00
lopak-lapik
lopak-lopak
2017-07-26 12:12:52 +00:00
low-cost
low-density
low-end
low-light
low-multi
low-pass
lucu-lucu
luka-luka
lukisan-lukisan
2017-07-24 07:10:16 +00:00
lumba-lumba
lumi-lumi
luntang-lantung
lupa-lupa
2017-07-27 12:46:30 +00:00
lupa-lupaan
2017-07-26 12:12:52 +00:00
lurah-camat
2017-07-27 12:46:30 +00:00
maaf-memaafkan
mabuk-mabukan
mabul-mabul
macam-macam
macan-macanan
2017-07-26 12:12:52 +00:00
machine-to-machine
mafia-mafia
mahasiswa-mahasiswi
💫 Port master changes over to develop (#2979) * Create aryaprabhudesai.md (#2681) * Update _install.jade (#2688) Typo fix: "models" -> "model" * Add FAC to spacy.explain (resolves #2706) * Remove docstrings for deprecated arguments (see #2703) * When calling getoption() in conftest.py, pass a default option (#2709) * When calling getoption() in conftest.py, pass a default option This is necessary to allow testing an installed spacy by running: pytest --pyargs spacy * Add contributor agreement * update bengali token rules for hyphen and digits (#2731) * Less norm computations in token similarity (#2730) * Less norm computations in token similarity * Contributor agreement * Remove ')' for clarity (#2737) Sorry, don't mean to be nitpicky, I just noticed this when going through the CLI and thought it was a quick fix. That said, if this was intention than please let me know. * added contributor agreement for mbkupfer (#2738) * Basic support for Telugu language (#2751) * Lex _attrs for polish language (#2750) * Signed spaCy contributor agreement * Added polish version of english lex_attrs * Introduces a bulk merge function, in order to solve issue #653 (#2696) * Fix comment * Introduce bulk merge to increase performance on many span merges * Sign contributor agreement * Implement pull request suggestions * Describe converters more explicitly (see #2643) * Add multi-threading note to Language.pipe (resolves #2582) [ci skip] * Fix formatting * Fix dependency scheme docs (closes #2705) [ci skip] * Don't set stop word in example (closes #2657) [ci skip] * Add words to portuguese language _num_words (#2759) * Add words to portuguese language _num_words * Add words to portuguese language _num_words * Update Indonesian model (#2752) * adding e-KTP in tokenizer exceptions list * add exception token * removing lines with containing space as it won't matter since we use .split() method in the end, added new tokens in exception * add tokenizer exceptions list * combining base_norms with norm_exceptions * adding norm_exception * fix double key in lemmatizer * remove unused import on punctuation.py * reformat stop_words to reduce number of lines, improve readibility * updating tokenizer exception * implement is_currency for lang/id * adding orth_first_upper in tokenizer_exceptions * update the norm_exception list * remove bunch of abbreviations * adding contributors file * Fixed spaCy+Keras example (#2763) * bug fixes in keras example * created contributor agreement * Adding French hyphenated first name (#2786) * Fix typo (closes #2784) * Fix typo (#2795) [ci skip] Fixed typo on line 6 "regcognizer --> recognizer" * Adding basic support for Sinhala language. (#2788) * adding Sinhala language package, stop words, examples and lex_attrs. * Adding contributor agreement * Updating contributor agreement * Also include lowercase norm exceptions * Fix error (#2802) * Fix error ValueError: cannot resize an array that references or is referenced by another array in this way. Use the resize function * added spaCy Contributor Agreement * Add charlax's contributor agreement (#2805) * agreement of contributor, may I introduce a tiny pl languge contribution (#2799) * Contributors agreement * Contributors agreement * Contributors agreement * Add jupyter=True to displacy.render in documentation (#2806) * Revert "Also include lowercase norm exceptions" This reverts commit 70f4e8adf37cfcfab60be2b97d6deae949b30e9e. * Remove deprecated encoding argument to msgpack * Set up dependency tree pattern matching skeleton (#2732) * Fix bug when too many entity types. Fixes #2800 * Fix Python 2 test failure * Require older msgpack-numpy * Restore encoding arg on msgpack-numpy * Try to fix version pin for msgpack-numpy * Update Portuguese Language (#2790) * Add words to portuguese language _num_words * Add words to portuguese language _num_words * Portuguese - Add/remove stopwords, fix tokenizer, add currency symbols * Extended punctuation and norm_exceptions in the Portuguese language * Correct error in spacy universe docs concerning spacy-lookup (#2814) * Update Keras Example for (Parikh et al, 2016) implementation (#2803) * bug fixes in keras example * created contributor agreement * baseline for Parikh model * initial version of parikh 2016 implemented * tested asymmetric models * fixed grevious error in normalization * use standard SNLI test file * begin to rework parikh example * initial version of running example * start to document the new version * start to document the new version * Update Decompositional Attention.ipynb * fixed calls to similarity * updated the README * import sys package duh * simplified indexing on mapping word to IDs * stupid python indent error * added code from https://github.com/tensorflow/tensorflow/issues/3388 for tf bug workaround * Fix typo (closes #2815) [ci skip] * Update regex version dependency * Set version to 2.0.13.dev3 * Skip seemingly problematic test * Remove problematic test * Try previous version of regex * Revert "Remove problematic test" This reverts commit bdebbef45552d698d390aa430b527ee27830f11b. * Unskip test * Try older version of regex * 💫 Update training examples and use minibatching (#2830) <!--- Provide a general summary of your changes in the title. --> ## Description Update the training examples in `/examples/training` to show usage of spaCy's `minibatch` and `compounding` helpers ([see here](https://spacy.io/usage/training#tips-batch-size) for details). The lack of batching in the examples has caused some confusion in the past, especially for beginners who would copy-paste the examples, update them with large training sets and experienced slow and unsatisfying results. ### Types of change enhancements ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. * Visual C++ link updated (#2842) (closes #2841) [ci skip] * New landing page * Add contribution agreement * Correcting lang/ru/examples.py (#2845) * Correct some grammatical inaccuracies in lang\ru\examples.py; filled Contributor Agreement * Correct some grammatical inaccuracies in lang\ru\examples.py * Move contributor agreement to separate file * Set version to 2.0.13.dev4 * Add Persian(Farsi) language support (#2797) * Also include lowercase norm exceptions * Remove in favour of https://github.com/explosion/spaCy/graphs/contributors * Rule-based French Lemmatizer (#2818) <!--- Provide a general summary of your changes in the title. --> ## Description <!--- Use this section to describe your changes. If your changes required testing, include information about the testing environment and the tests you ran. If your test fixes a bug reported in an issue, don't forget to include the issue number. If your PR is still a work in progress, that's totally fine – just include a note to let us know. --> Add a rule-based French Lemmatizer following the english one and the excellent PR for [greek language optimizations](https://github.com/explosion/spaCy/pull/2558) to adapt the Lemmatizer class. ### Types of change <!-- What type of change does your PR cover? Is it a bug fix, an enhancement or new feature, or a change to the documentation? --> - Lemma dictionary used can be found [here](http://infolingu.univ-mlv.fr/DonneesLinguistiques/Dictionnaires/telechargement.html), I used the XML version. - Add several files containing exhaustive list of words for each part of speech - Add some lemma rules - Add POS that are not checked in the standard Lemmatizer, i.e PRON, DET, ADV and AUX - Modify the Lemmatizer class to check in lookup table as a last resort if POS not mentionned - Modify the lemmatize function to check in lookup table as a last resort - Init files are updated so the model can support all the functionalities mentioned above - Add words to tokenizer_exceptions_list.py in respect to regex used in tokenizer_exceptions.py ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [X] I have submitted the spaCy Contributor Agreement. - [X] I ran the tests, and all new and existing tests passed. - [X] My changes don't require a change to the documentation, or if they do, I've added all required information. * Set version to 2.0.13 * Fix formatting and consistency * Update docs for new version [ci skip] * Increment version [ci skip] * Add info on wheels [ci skip] * Adding "This is a sentence" example to Sinhala (#2846) * Add wheels badge * Update badge [ci skip] * Update README.rst [ci skip] * Update murmurhash pin * Increment version to 2.0.14.dev0 * Update GPU docs for v2.0.14 * Add wheel to setup_requires * Import prefer_gpu and require_gpu functions from Thinc * Add tests for prefer_gpu() and require_gpu() * Update requirements and setup.py * Workaround bug in thinc require_gpu * Set version to v2.0.14 * Update push-tag script * Unhack prefer_gpu * Require thinc 6.10.6 * Update prefer_gpu and require_gpu docs [ci skip] * Fix specifiers for GPU * Set version to 2.0.14.dev1 * Set version to 2.0.14 * Update Thinc version pin * Increment version * Fix msgpack-numpy version pin * Increment version * Update version to 2.0.16 * Update version [ci skip] * Redundant ')' in the Stop words' example (#2856) <!--- Provide a general summary of your changes in the title. --> ## Description <!--- Use this section to describe your changes. If your changes required testing, include information about the testing environment and the tests you ran. If your test fixes a bug reported in an issue, don't forget to include the issue number. If your PR is still a work in progress, that's totally fine – just include a note to let us know. --> ### Types of change <!-- What type of change does your PR cover? Is it a bug fix, an enhancement or new feature, or a change to the documentation? --> ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [ ] I have submitted the spaCy Contributor Agreement. - [ ] I ran the tests, and all new and existing tests passed. - [ ] My changes don't require a change to the documentation, or if they do, I've added all required information. * Documentation improvement regarding joblib and SO (#2867) Some documentation improvements ## Description 1. Fixed the dead URL to joblib 2. Fixed Stack Overflow brand name (with space) ### Types of change Documentation ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. * raise error when setting overlapping entities as doc.ents (#2880) * Fix out-of-bounds access in NER training The helper method state.B(1) gets the index of the first token of the buffer, or -1 if no such token exists. Normally this is safe because we pass this to functions like state.safe_get(), which returns an empty token. Here we used it directly as an array index, which is not okay! This error may have been the cause of out-of-bounds access errors during training. Similar errors may still be around, so much be hunted down. Hunting this one down took a long time...I printed out values across training runs and diffed, looking for points of divergence between runs, when no randomness should be allowed. * Change PyThaiNLP Url (#2876) * Fix missing comma * Add example showing a fix-up rule for space entities * Set version to 2.0.17.dev0 * Update regex version * Revert "Update regex version" This reverts commit 62358dd867d15bc6a475942dff34effba69dd70a. * Try setting older regex version, to align with conda * Set version to 2.0.17 * Add spacy-js to universe [ci-skip] * Add spacy-raspberry to universe (closes #2889) * Add script to validate universe json [ci skip] * Removed space in docs + added contributor indo (#2909) * - removed unneeded space in documentation * - added contributor info * Allow input text of length up to max_length, inclusive (#2922) * Include universe spec for spacy-wordnet component (#2919) * feat: include universe spec for spacy-wordnet component * chore: include spaCy contributor agreement * Minor formatting changes [ci skip] * Fix image [ci skip] Twitter URL doesn't work on live site * Check if the word is in one of the regular lists specific to each POS (#2886) * 💫 Create random IDs for SVGs to prevent ID clashes (#2927) Resolves #2924. ## Description Fixes problem where multiple visualizations in Jupyter notebooks would have clashing arc IDs, resulting in weirdly positioned arc labels. Generating a random ID prefix so even identical parses won't receive the same IDs for consistency (even if effect of ID clash isn't noticable here.) ### Types of change bug fix ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. * Fix typo [ci skip] * fixes symbolic link on py3 and windows (#2949) * fixes symbolic link on py3 and windows during setup of spacy using command python -m spacy link en_core_web_sm en closes #2948 * Update spacy/compat.py Co-Authored-By: cicorias <cicorias@users.noreply.github.com> * Fix formatting * Update universe [ci skip] * Catalan Language Support (#2940) * Catalan language Support * Ddding Catalan to documentation * Sort languages alphabetically [ci skip] * Update tests for pytest 4.x (#2965) <!--- Provide a general summary of your changes in the title. --> ## Description - [x] Replace marks in params for pytest 4.0 compat ([see here](https://docs.pytest.org/en/latest/deprecations.html#marks-in-pytest-mark-parametrize)) - [x] Un-xfail passing tests (some fixes in a recent update resolved a bunch of issues, but tests were apparently never updated here) ### Types of change <!-- What type of change does your PR cover? Is it a bug fix, an enhancement or new feature, or a change to the documentation? --> ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. * Fix regex pin to harmonize with conda (#2964) * Update README.rst * Fix bug where Vocab.prune_vector did not use 'batch_size' (#2977) Fixes #2976 * Fix typo * Fix typo * Remove duplicate file * Require thinc 7.0.0.dev2 Fixes bug in gpu_ops that would use cupy instead of numpy on CPU * Add missing import * Fix error IDs * Fix tests
2018-11-29 15:30:29 +00:00
mahasiswa/i
2017-07-24 07:10:16 +00:00
mahi-mahi
2017-07-26 12:12:52 +00:00
main-main
2017-07-27 12:46:30 +00:00
main-mainan
2017-07-26 12:12:52 +00:00
main-mainlah
majelis-majelis
maju-mundur
makam-makam
2017-07-27 12:46:30 +00:00
makan-makan
makan-makanan
2017-07-26 12:12:52 +00:00
makanan-red
make-up
maki-maki
2017-07-27 12:46:30 +00:00
maki-makian
2017-07-26 12:12:52 +00:00
mal-mal
2017-07-27 12:46:30 +00:00
malai-malai
2017-07-26 12:12:52 +00:00
malam-malam
2017-07-27 12:46:30 +00:00
malar-malar
malas-malasan
2017-07-24 07:10:16 +00:00
mali-mali
2017-07-26 12:12:52 +00:00
malu-malu
mama-mama
man-in-the-middle
mana-mana
manajer-manajer
2017-07-27 12:46:30 +00:00
manik-manik
2017-07-26 12:12:52 +00:00
manis-manis
2017-07-27 12:46:30 +00:00
manis-manisan
2017-07-26 12:12:52 +00:00
marah-marah
mark-up
mas-mas
masa-masa
2017-07-27 12:46:30 +00:00
masak-masak
2017-07-26 12:12:52 +00:00
masalah-masalah
mash-up
2017-07-24 07:10:16 +00:00
masing-masing
2017-07-26 12:12:52 +00:00
masjid-masjid
masuk-keluar
2017-07-27 12:46:30 +00:00
mat-matan
2017-07-24 07:10:16 +00:00
mata-mata
2017-07-26 12:12:52 +00:00
match-fixing
mati-mati
2017-07-27 12:46:30 +00:00
mati-matian
maya-maya
2017-07-26 12:12:52 +00:00
mayat-mayat
mayday-mayday
media-media
mega-bintang
mega-tsunami
2017-07-24 07:10:16 +00:00
megal-megol
megap-megap
2017-07-27 12:46:30 +00:00
meger-meger
2017-07-24 07:10:16 +00:00
megrek-megrek
melak-melak
2017-07-27 12:46:30 +00:00
melambai-lambai
melambai-lambaikan
melambat-lambatkan
melaun-laun
melawak-lawak
melayang-layang
melayap-layap
melayap-layapkan
melebih-lebihi
melebih-lebihkan
melejang-lejangkan
melek-melekan
meleleh-leleh
melengah-lengah
melihat-lihat
melimpah-limpah
melincah-lincah
meliuk-liuk
melolong-lolong
melompat-lompat
meloncat-loncat
melonco-lonco
melongak-longok
melonjak-lonjak
memacak-macak
memada-madai
memadan-madan
memaki-maki
memaksa-maksa
memanas-manasi
memancit-mancitkan
memandai-mandai
memanggil-manggil
memanis-manis
memanjut-manjut
💫 Port master changes over to develop (#2979) * Create aryaprabhudesai.md (#2681) * Update _install.jade (#2688) Typo fix: "models" -> "model" * Add FAC to spacy.explain (resolves #2706) * Remove docstrings for deprecated arguments (see #2703) * When calling getoption() in conftest.py, pass a default option (#2709) * When calling getoption() in conftest.py, pass a default option This is necessary to allow testing an installed spacy by running: pytest --pyargs spacy * Add contributor agreement * update bengali token rules for hyphen and digits (#2731) * Less norm computations in token similarity (#2730) * Less norm computations in token similarity * Contributor agreement * Remove ')' for clarity (#2737) Sorry, don't mean to be nitpicky, I just noticed this when going through the CLI and thought it was a quick fix. That said, if this was intention than please let me know. * added contributor agreement for mbkupfer (#2738) * Basic support for Telugu language (#2751) * Lex _attrs for polish language (#2750) * Signed spaCy contributor agreement * Added polish version of english lex_attrs * Introduces a bulk merge function, in order to solve issue #653 (#2696) * Fix comment * Introduce bulk merge to increase performance on many span merges * Sign contributor agreement * Implement pull request suggestions * Describe converters more explicitly (see #2643) * Add multi-threading note to Language.pipe (resolves #2582) [ci skip] * Fix formatting * Fix dependency scheme docs (closes #2705) [ci skip] * Don't set stop word in example (closes #2657) [ci skip] * Add words to portuguese language _num_words (#2759) * Add words to portuguese language _num_words * Add words to portuguese language _num_words * Update Indonesian model (#2752) * adding e-KTP in tokenizer exceptions list * add exception token * removing lines with containing space as it won't matter since we use .split() method in the end, added new tokens in exception * add tokenizer exceptions list * combining base_norms with norm_exceptions * adding norm_exception * fix double key in lemmatizer * remove unused import on punctuation.py * reformat stop_words to reduce number of lines, improve readibility * updating tokenizer exception * implement is_currency for lang/id * adding orth_first_upper in tokenizer_exceptions * update the norm_exception list * remove bunch of abbreviations * adding contributors file * Fixed spaCy+Keras example (#2763) * bug fixes in keras example * created contributor agreement * Adding French hyphenated first name (#2786) * Fix typo (closes #2784) * Fix typo (#2795) [ci skip] Fixed typo on line 6 "regcognizer --> recognizer" * Adding basic support for Sinhala language. (#2788) * adding Sinhala language package, stop words, examples and lex_attrs. * Adding contributor agreement * Updating contributor agreement * Also include lowercase norm exceptions * Fix error (#2802) * Fix error ValueError: cannot resize an array that references or is referenced by another array in this way. Use the resize function * added spaCy Contributor Agreement * Add charlax's contributor agreement (#2805) * agreement of contributor, may I introduce a tiny pl languge contribution (#2799) * Contributors agreement * Contributors agreement * Contributors agreement * Add jupyter=True to displacy.render in documentation (#2806) * Revert "Also include lowercase norm exceptions" This reverts commit 70f4e8adf37cfcfab60be2b97d6deae949b30e9e. * Remove deprecated encoding argument to msgpack * Set up dependency tree pattern matching skeleton (#2732) * Fix bug when too many entity types. Fixes #2800 * Fix Python 2 test failure * Require older msgpack-numpy * Restore encoding arg on msgpack-numpy * Try to fix version pin for msgpack-numpy * Update Portuguese Language (#2790) * Add words to portuguese language _num_words * Add words to portuguese language _num_words * Portuguese - Add/remove stopwords, fix tokenizer, add currency symbols * Extended punctuation and norm_exceptions in the Portuguese language * Correct error in spacy universe docs concerning spacy-lookup (#2814) * Update Keras Example for (Parikh et al, 2016) implementation (#2803) * bug fixes in keras example * created contributor agreement * baseline for Parikh model * initial version of parikh 2016 implemented * tested asymmetric models * fixed grevious error in normalization * use standard SNLI test file * begin to rework parikh example * initial version of running example * start to document the new version * start to document the new version * Update Decompositional Attention.ipynb * fixed calls to similarity * updated the README * import sys package duh * simplified indexing on mapping word to IDs * stupid python indent error * added code from https://github.com/tensorflow/tensorflow/issues/3388 for tf bug workaround * Fix typo (closes #2815) [ci skip] * Update regex version dependency * Set version to 2.0.13.dev3 * Skip seemingly problematic test * Remove problematic test * Try previous version of regex * Revert "Remove problematic test" This reverts commit bdebbef45552d698d390aa430b527ee27830f11b. * Unskip test * Try older version of regex * 💫 Update training examples and use minibatching (#2830) <!--- Provide a general summary of your changes in the title. --> ## Description Update the training examples in `/examples/training` to show usage of spaCy's `minibatch` and `compounding` helpers ([see here](https://spacy.io/usage/training#tips-batch-size) for details). The lack of batching in the examples has caused some confusion in the past, especially for beginners who would copy-paste the examples, update them with large training sets and experienced slow and unsatisfying results. ### Types of change enhancements ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. * Visual C++ link updated (#2842) (closes #2841) [ci skip] * New landing page * Add contribution agreement * Correcting lang/ru/examples.py (#2845) * Correct some grammatical inaccuracies in lang\ru\examples.py; filled Contributor Agreement * Correct some grammatical inaccuracies in lang\ru\examples.py * Move contributor agreement to separate file * Set version to 2.0.13.dev4 * Add Persian(Farsi) language support (#2797) * Also include lowercase norm exceptions * Remove in favour of https://github.com/explosion/spaCy/graphs/contributors * Rule-based French Lemmatizer (#2818) <!--- Provide a general summary of your changes in the title. --> ## Description <!--- Use this section to describe your changes. If your changes required testing, include information about the testing environment and the tests you ran. If your test fixes a bug reported in an issue, don't forget to include the issue number. If your PR is still a work in progress, that's totally fine – just include a note to let us know. --> Add a rule-based French Lemmatizer following the english one and the excellent PR for [greek language optimizations](https://github.com/explosion/spaCy/pull/2558) to adapt the Lemmatizer class. ### Types of change <!-- What type of change does your PR cover? Is it a bug fix, an enhancement or new feature, or a change to the documentation? --> - Lemma dictionary used can be found [here](http://infolingu.univ-mlv.fr/DonneesLinguistiques/Dictionnaires/telechargement.html), I used the XML version. - Add several files containing exhaustive list of words for each part of speech - Add some lemma rules - Add POS that are not checked in the standard Lemmatizer, i.e PRON, DET, ADV and AUX - Modify the Lemmatizer class to check in lookup table as a last resort if POS not mentionned - Modify the lemmatize function to check in lookup table as a last resort - Init files are updated so the model can support all the functionalities mentioned above - Add words to tokenizer_exceptions_list.py in respect to regex used in tokenizer_exceptions.py ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [X] I have submitted the spaCy Contributor Agreement. - [X] I ran the tests, and all new and existing tests passed. - [X] My changes don't require a change to the documentation, or if they do, I've added all required information. * Set version to 2.0.13 * Fix formatting and consistency * Update docs for new version [ci skip] * Increment version [ci skip] * Add info on wheels [ci skip] * Adding "This is a sentence" example to Sinhala (#2846) * Add wheels badge * Update badge [ci skip] * Update README.rst [ci skip] * Update murmurhash pin * Increment version to 2.0.14.dev0 * Update GPU docs for v2.0.14 * Add wheel to setup_requires * Import prefer_gpu and require_gpu functions from Thinc * Add tests for prefer_gpu() and require_gpu() * Update requirements and setup.py * Workaround bug in thinc require_gpu * Set version to v2.0.14 * Update push-tag script * Unhack prefer_gpu * Require thinc 6.10.6 * Update prefer_gpu and require_gpu docs [ci skip] * Fix specifiers for GPU * Set version to 2.0.14.dev1 * Set version to 2.0.14 * Update Thinc version pin * Increment version * Fix msgpack-numpy version pin * Increment version * Update version to 2.0.16 * Update version [ci skip] * Redundant ')' in the Stop words' example (#2856) <!--- Provide a general summary of your changes in the title. --> ## Description <!--- Use this section to describe your changes. If your changes required testing, include information about the testing environment and the tests you ran. If your test fixes a bug reported in an issue, don't forget to include the issue number. If your PR is still a work in progress, that's totally fine – just include a note to let us know. --> ### Types of change <!-- What type of change does your PR cover? Is it a bug fix, an enhancement or new feature, or a change to the documentation? --> ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [ ] I have submitted the spaCy Contributor Agreement. - [ ] I ran the tests, and all new and existing tests passed. - [ ] My changes don't require a change to the documentation, or if they do, I've added all required information. * Documentation improvement regarding joblib and SO (#2867) Some documentation improvements ## Description 1. Fixed the dead URL to joblib 2. Fixed Stack Overflow brand name (with space) ### Types of change Documentation ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. * raise error when setting overlapping entities as doc.ents (#2880) * Fix out-of-bounds access in NER training The helper method state.B(1) gets the index of the first token of the buffer, or -1 if no such token exists. Normally this is safe because we pass this to functions like state.safe_get(), which returns an empty token. Here we used it directly as an array index, which is not okay! This error may have been the cause of out-of-bounds access errors during training. Similar errors may still be around, so much be hunted down. Hunting this one down took a long time...I printed out values across training runs and diffed, looking for points of divergence between runs, when no randomness should be allowed. * Change PyThaiNLP Url (#2876) * Fix missing comma * Add example showing a fix-up rule for space entities * Set version to 2.0.17.dev0 * Update regex version * Revert "Update regex version" This reverts commit 62358dd867d15bc6a475942dff34effba69dd70a. * Try setting older regex version, to align with conda * Set version to 2.0.17 * Add spacy-js to universe [ci-skip] * Add spacy-raspberry to universe (closes #2889) * Add script to validate universe json [ci skip] * Removed space in docs + added contributor indo (#2909) * - removed unneeded space in documentation * - added contributor info * Allow input text of length up to max_length, inclusive (#2922) * Include universe spec for spacy-wordnet component (#2919) * feat: include universe spec for spacy-wordnet component * chore: include spaCy contributor agreement * Minor formatting changes [ci skip] * Fix image [ci skip] Twitter URL doesn't work on live site * Check if the word is in one of the regular lists specific to each POS (#2886) * 💫 Create random IDs for SVGs to prevent ID clashes (#2927) Resolves #2924. ## Description Fixes problem where multiple visualizations in Jupyter notebooks would have clashing arc IDs, resulting in weirdly positioned arc labels. Generating a random ID prefix so even identical parses won't receive the same IDs for consistency (even if effect of ID clash isn't noticable here.) ### Types of change bug fix ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. * Fix typo [ci skip] * fixes symbolic link on py3 and windows (#2949) * fixes symbolic link on py3 and windows during setup of spacy using command python -m spacy link en_core_web_sm en closes #2948 * Update spacy/compat.py Co-Authored-By: cicorias <cicorias@users.noreply.github.com> * Fix formatting * Update universe [ci skip] * Catalan Language Support (#2940) * Catalan language Support * Ddding Catalan to documentation * Sort languages alphabetically [ci skip] * Update tests for pytest 4.x (#2965) <!--- Provide a general summary of your changes in the title. --> ## Description - [x] Replace marks in params for pytest 4.0 compat ([see here](https://docs.pytest.org/en/latest/deprecations.html#marks-in-pytest-mark-parametrize)) - [x] Un-xfail passing tests (some fixes in a recent update resolved a bunch of issues, but tests were apparently never updated here) ### Types of change <!-- What type of change does your PR cover? Is it a bug fix, an enhancement or new feature, or a change to the documentation? --> ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. * Fix regex pin to harmonize with conda (#2964) * Update README.rst * Fix bug where Vocab.prune_vector did not use 'batch_size' (#2977) Fixes #2976 * Fix typo * Fix typo * Remove duplicate file * Require thinc 7.0.0.dev2 Fixes bug in gpu_ops that would use cupy instead of numpy on CPU * Add missing import * Fix error IDs * Fix tests
2018-11-29 15:30:29 +00:00
memantas-mantas
2017-07-27 12:46:30 +00:00
memasak-masak
2017-07-26 12:12:52 +00:00
memata-matai
2017-07-27 12:46:30 +00:00
mematah-matah
mematuk-matuk
mematut-matut
memau-mau
💫 Port master changes over to develop (#2979) * Create aryaprabhudesai.md (#2681) * Update _install.jade (#2688) Typo fix: "models" -> "model" * Add FAC to spacy.explain (resolves #2706) * Remove docstrings for deprecated arguments (see #2703) * When calling getoption() in conftest.py, pass a default option (#2709) * When calling getoption() in conftest.py, pass a default option This is necessary to allow testing an installed spacy by running: pytest --pyargs spacy * Add contributor agreement * update bengali token rules for hyphen and digits (#2731) * Less norm computations in token similarity (#2730) * Less norm computations in token similarity * Contributor agreement * Remove ')' for clarity (#2737) Sorry, don't mean to be nitpicky, I just noticed this when going through the CLI and thought it was a quick fix. That said, if this was intention than please let me know. * added contributor agreement for mbkupfer (#2738) * Basic support for Telugu language (#2751) * Lex _attrs for polish language (#2750) * Signed spaCy contributor agreement * Added polish version of english lex_attrs * Introduces a bulk merge function, in order to solve issue #653 (#2696) * Fix comment * Introduce bulk merge to increase performance on many span merges * Sign contributor agreement * Implement pull request suggestions * Describe converters more explicitly (see #2643) * Add multi-threading note to Language.pipe (resolves #2582) [ci skip] * Fix formatting * Fix dependency scheme docs (closes #2705) [ci skip] * Don't set stop word in example (closes #2657) [ci skip] * Add words to portuguese language _num_words (#2759) * Add words to portuguese language _num_words * Add words to portuguese language _num_words * Update Indonesian model (#2752) * adding e-KTP in tokenizer exceptions list * add exception token * removing lines with containing space as it won't matter since we use .split() method in the end, added new tokens in exception * add tokenizer exceptions list * combining base_norms with norm_exceptions * adding norm_exception * fix double key in lemmatizer * remove unused import on punctuation.py * reformat stop_words to reduce number of lines, improve readibility * updating tokenizer exception * implement is_currency for lang/id * adding orth_first_upper in tokenizer_exceptions * update the norm_exception list * remove bunch of abbreviations * adding contributors file * Fixed spaCy+Keras example (#2763) * bug fixes in keras example * created contributor agreement * Adding French hyphenated first name (#2786) * Fix typo (closes #2784) * Fix typo (#2795) [ci skip] Fixed typo on line 6 "regcognizer --> recognizer" * Adding basic support for Sinhala language. (#2788) * adding Sinhala language package, stop words, examples and lex_attrs. * Adding contributor agreement * Updating contributor agreement * Also include lowercase norm exceptions * Fix error (#2802) * Fix error ValueError: cannot resize an array that references or is referenced by another array in this way. Use the resize function * added spaCy Contributor Agreement * Add charlax's contributor agreement (#2805) * agreement of contributor, may I introduce a tiny pl languge contribution (#2799) * Contributors agreement * Contributors agreement * Contributors agreement * Add jupyter=True to displacy.render in documentation (#2806) * Revert "Also include lowercase norm exceptions" This reverts commit 70f4e8adf37cfcfab60be2b97d6deae949b30e9e. * Remove deprecated encoding argument to msgpack * Set up dependency tree pattern matching skeleton (#2732) * Fix bug when too many entity types. Fixes #2800 * Fix Python 2 test failure * Require older msgpack-numpy * Restore encoding arg on msgpack-numpy * Try to fix version pin for msgpack-numpy * Update Portuguese Language (#2790) * Add words to portuguese language _num_words * Add words to portuguese language _num_words * Portuguese - Add/remove stopwords, fix tokenizer, add currency symbols * Extended punctuation and norm_exceptions in the Portuguese language * Correct error in spacy universe docs concerning spacy-lookup (#2814) * Update Keras Example for (Parikh et al, 2016) implementation (#2803) * bug fixes in keras example * created contributor agreement * baseline for Parikh model * initial version of parikh 2016 implemented * tested asymmetric models * fixed grevious error in normalization * use standard SNLI test file * begin to rework parikh example * initial version of running example * start to document the new version * start to document the new version * Update Decompositional Attention.ipynb * fixed calls to similarity * updated the README * import sys package duh * simplified indexing on mapping word to IDs * stupid python indent error * added code from https://github.com/tensorflow/tensorflow/issues/3388 for tf bug workaround * Fix typo (closes #2815) [ci skip] * Update regex version dependency * Set version to 2.0.13.dev3 * Skip seemingly problematic test * Remove problematic test * Try previous version of regex * Revert "Remove problematic test" This reverts commit bdebbef45552d698d390aa430b527ee27830f11b. * Unskip test * Try older version of regex * 💫 Update training examples and use minibatching (#2830) <!--- Provide a general summary of your changes in the title. --> ## Description Update the training examples in `/examples/training` to show usage of spaCy's `minibatch` and `compounding` helpers ([see here](https://spacy.io/usage/training#tips-batch-size) for details). The lack of batching in the examples has caused some confusion in the past, especially for beginners who would copy-paste the examples, update them with large training sets and experienced slow and unsatisfying results. ### Types of change enhancements ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. * Visual C++ link updated (#2842) (closes #2841) [ci skip] * New landing page * Add contribution agreement * Correcting lang/ru/examples.py (#2845) * Correct some grammatical inaccuracies in lang\ru\examples.py; filled Contributor Agreement * Correct some grammatical inaccuracies in lang\ru\examples.py * Move contributor agreement to separate file * Set version to 2.0.13.dev4 * Add Persian(Farsi) language support (#2797) * Also include lowercase norm exceptions * Remove in favour of https://github.com/explosion/spaCy/graphs/contributors * Rule-based French Lemmatizer (#2818) <!--- Provide a general summary of your changes in the title. --> ## Description <!--- Use this section to describe your changes. If your changes required testing, include information about the testing environment and the tests you ran. If your test fixes a bug reported in an issue, don't forget to include the issue number. If your PR is still a work in progress, that's totally fine – just include a note to let us know. --> Add a rule-based French Lemmatizer following the english one and the excellent PR for [greek language optimizations](https://github.com/explosion/spaCy/pull/2558) to adapt the Lemmatizer class. ### Types of change <!-- What type of change does your PR cover? Is it a bug fix, an enhancement or new feature, or a change to the documentation? --> - Lemma dictionary used can be found [here](http://infolingu.univ-mlv.fr/DonneesLinguistiques/Dictionnaires/telechargement.html), I used the XML version. - Add several files containing exhaustive list of words for each part of speech - Add some lemma rules - Add POS that are not checked in the standard Lemmatizer, i.e PRON, DET, ADV and AUX - Modify the Lemmatizer class to check in lookup table as a last resort if POS not mentionned - Modify the lemmatize function to check in lookup table as a last resort - Init files are updated so the model can support all the functionalities mentioned above - Add words to tokenizer_exceptions_list.py in respect to regex used in tokenizer_exceptions.py ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [X] I have submitted the spaCy Contributor Agreement. - [X] I ran the tests, and all new and existing tests passed. - [X] My changes don't require a change to the documentation, or if they do, I've added all required information. * Set version to 2.0.13 * Fix formatting and consistency * Update docs for new version [ci skip] * Increment version [ci skip] * Add info on wheels [ci skip] * Adding "This is a sentence" example to Sinhala (#2846) * Add wheels badge * Update badge [ci skip] * Update README.rst [ci skip] * Update murmurhash pin * Increment version to 2.0.14.dev0 * Update GPU docs for v2.0.14 * Add wheel to setup_requires * Import prefer_gpu and require_gpu functions from Thinc * Add tests for prefer_gpu() and require_gpu() * Update requirements and setup.py * Workaround bug in thinc require_gpu * Set version to v2.0.14 * Update push-tag script * Unhack prefer_gpu * Require thinc 6.10.6 * Update prefer_gpu and require_gpu docs [ci skip] * Fix specifiers for GPU * Set version to 2.0.14.dev1 * Set version to 2.0.14 * Update Thinc version pin * Increment version * Fix msgpack-numpy version pin * Increment version * Update version to 2.0.16 * Update version [ci skip] * Redundant ')' in the Stop words' example (#2856) <!--- Provide a general summary of your changes in the title. --> ## Description <!--- Use this section to describe your changes. If your changes required testing, include information about the testing environment and the tests you ran. If your test fixes a bug reported in an issue, don't forget to include the issue number. If your PR is still a work in progress, that's totally fine – just include a note to let us know. --> ### Types of change <!-- What type of change does your PR cover? Is it a bug fix, an enhancement or new feature, or a change to the documentation? --> ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [ ] I have submitted the spaCy Contributor Agreement. - [ ] I ran the tests, and all new and existing tests passed. - [ ] My changes don't require a change to the documentation, or if they do, I've added all required information. * Documentation improvement regarding joblib and SO (#2867) Some documentation improvements ## Description 1. Fixed the dead URL to joblib 2. Fixed Stack Overflow brand name (with space) ### Types of change Documentation ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. * raise error when setting overlapping entities as doc.ents (#2880) * Fix out-of-bounds access in NER training The helper method state.B(1) gets the index of the first token of the buffer, or -1 if no such token exists. Normally this is safe because we pass this to functions like state.safe_get(), which returns an empty token. Here we used it directly as an array index, which is not okay! This error may have been the cause of out-of-bounds access errors during training. Similar errors may still be around, so much be hunted down. Hunting this one down took a long time...I printed out values across training runs and diffed, looking for points of divergence between runs, when no randomness should be allowed. * Change PyThaiNLP Url (#2876) * Fix missing comma * Add example showing a fix-up rule for space entities * Set version to 2.0.17.dev0 * Update regex version * Revert "Update regex version" This reverts commit 62358dd867d15bc6a475942dff34effba69dd70a. * Try setting older regex version, to align with conda * Set version to 2.0.17 * Add spacy-js to universe [ci-skip] * Add spacy-raspberry to universe (closes #2889) * Add script to validate universe json [ci skip] * Removed space in docs + added contributor indo (#2909) * - removed unneeded space in documentation * - added contributor info * Allow input text of length up to max_length, inclusive (#2922) * Include universe spec for spacy-wordnet component (#2919) * feat: include universe spec for spacy-wordnet component * chore: include spaCy contributor agreement * Minor formatting changes [ci skip] * Fix image [ci skip] Twitter URL doesn't work on live site * Check if the word is in one of the regular lists specific to each POS (#2886) * 💫 Create random IDs for SVGs to prevent ID clashes (#2927) Resolves #2924. ## Description Fixes problem where multiple visualizations in Jupyter notebooks would have clashing arc IDs, resulting in weirdly positioned arc labels. Generating a random ID prefix so even identical parses won't receive the same IDs for consistency (even if effect of ID clash isn't noticable here.) ### Types of change bug fix ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. * Fix typo [ci skip] * fixes symbolic link on py3 and windows (#2949) * fixes symbolic link on py3 and windows during setup of spacy using command python -m spacy link en_core_web_sm en closes #2948 * Update spacy/compat.py Co-Authored-By: cicorias <cicorias@users.noreply.github.com> * Fix formatting * Update universe [ci skip] * Catalan Language Support (#2940) * Catalan language Support * Ddding Catalan to documentation * Sort languages alphabetically [ci skip] * Update tests for pytest 4.x (#2965) <!--- Provide a general summary of your changes in the title. --> ## Description - [x] Replace marks in params for pytest 4.0 compat ([see here](https://docs.pytest.org/en/latest/deprecations.html#marks-in-pytest-mark-parametrize)) - [x] Un-xfail passing tests (some fixes in a recent update resolved a bunch of issues, but tests were apparently never updated here) ### Types of change <!-- What type of change does your PR cover? Is it a bug fix, an enhancement or new feature, or a change to the documentation? --> ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. * Fix regex pin to harmonize with conda (#2964) * Update README.rst * Fix bug where Vocab.prune_vector did not use 'batch_size' (#2977) Fixes #2976 * Fix typo * Fix typo * Remove duplicate file * Require thinc 7.0.0.dev2 Fixes bug in gpu_ops that would use cupy instead of numpy on CPU * Add missing import * Fix error IDs * Fix tests
2018-11-29 15:30:29 +00:00
memayah-mayahkan
2017-07-27 12:46:30 +00:00
membaca-baca
membacah-bacah
membagi-bagikan
membalik-balik
membangkit-bangkit
membarut-barut
membawa-bawa
membayang-bayangi
membayang-bayangkan
membeda-bedakan
membelai-belai
membeli-beli
membelit-belitkan
membelu-belai
membenar-benar
membenar-benari
memberai-beraikan
membesar-besar
membesar-besarkan
membikin-bikin
membilah-bilah
membolak-balikkan
membongkar-bangkir
membongkar-bongkar
2017-07-26 12:12:52 +00:00
membuang-buang
2017-07-27 12:46:30 +00:00
membuat-buat
membulan-bulani
membunga-bungai
membungkuk-bungkuk
memburu-buru
memburu-burukan
memburuk-burukkan
memelintir-melintir
memencak-mencak
memencar-mencar
memercik-mercik
memetak-metak
memetang-metangkan
memetir-metir
memijar-mijar
memikir-mikir
memikir-mikirkan
memilih-milih
memilin-milin
meminang-minang
meminta-minta
memisah-misahkan
memontang-mantingkan
memorak-perandakan
memorak-porandakan
memotong-motong
memperamat-amat
memperamat-amatkan
memperbagai-bagaikan
memperganda-gandakan
memperganduh-ganduhkan
memperimpit-impitkan
memperkuda-kudakan
memperlengah-lengah
memperlengah-lengahkan
mempermacam-macamkan
memperolok-olok
memperolok-olokkan
mempersama-samakan
mempertubi-tubi
mempertubi-tubikan
memperturut-turutkan
memuja-muja
memukang-mukang
memulun-mulun
memundi-mundi
memundi-mundikan
memutar-mutar
memuyu-muyu
2017-07-26 12:12:52 +00:00
men-tweet
2017-07-27 12:46:30 +00:00
menagak-nagak
menakut-nakuti
2017-07-26 12:12:52 +00:00
menang-kalah
2017-07-27 12:46:30 +00:00
menanjur-nanjur
2017-07-24 07:10:16 +00:00
menanti-nanti
2017-07-26 12:12:52 +00:00
menari-nari
2017-07-27 12:46:30 +00:00
mencabik-cabik
mencabik-cabikkan
mencacah-cacah
mencaing-caing
mencak-mencak
mencakup-cakup
mencapak-capak
mencari-cari
mencarik-carik
mencarik-carikkan
mencarut-carut
mencengis-cengis
mencepak-cepak
mencepuk-cepuk
mencerai-beraikan
mencetai-cetai
menciak-ciak
menciap-ciap
menciar-ciar
mencita-citakan
mencium-cium
menciut-ciut
2017-07-24 07:10:16 +00:00
mencla-mencle
2017-07-27 12:46:30 +00:00
mencoang-coang
mencoba-coba
mencocok-cocok
mencolek-colek
menconteng-conteng
mencubit-cubit
mencucuh-cucuh
mencucuh-cucuhkan
mencuri-curi
mendecap-decap
mendegam-degam
mendengar-dengar
mendengking-dengking
mendengus-dengus
mendengut-dengut
menderai-deraikan
menderak-derakkan
menderau-derau
menderu-deru
mendesas-desuskan
mendesus-desus
mendetap-detap
mendewa-dewakan
mendudu-dudu
menduga-duga
menebu-nebu
menegur-neguri
menepak-nepak
menepak-nepakkan
mengabung-ngabung
mengaci-acikan
mengacu-acu
mengada-ada
2017-07-26 12:12:52 +00:00
mengada-ngada
2017-07-27 12:46:30 +00:00
mengadang-adangi
mengaduk-aduk
mengagak-agak
mengagak-agihkan
mengagut-agut
mengais-ngais
mengalang-alangi
mengali-ali
mengalur-alur
mengamang-amang
mengamat-amati
mengambai-ambaikan
mengambang-ambang
mengambung-ambung
mengambung-ambungkan
mengamit-ngamitkan
mengancai-ancaikan
mengancak-ancak
mengancar-ancar
mengangan-angan
mengangan-angankan
mengangguk-angguk
menganggut-anggut
mengangin-anginkan
mengangkat-angkat
menganjung-anjung
menganjung-anjungkan
mengap-mengap
mengapa-apai
mengapi-apikan
mengarah-arahi
mengarang-ngarang
mengata-ngatai
mengatup-ngatupkan
mengaum-aum
mengaum-aumkan
mengejan-ejan
mengejar-ngejar
mengejut-ngejuti
mengelai-ngelai
mengelepik-ngelepik
mengelip-ngelip
mengelu-elukan
mengelus-elus
mengembut-embut
mengempas-empaskan
mengenap-enapkan
mengendap-endap
mengenjak-enjak
mengentak-entak
mengentak-entakkan
mengepak-ngepak
mengepak-ngepakkan
mengepal-ngepalkan
mengerjap-ngerjap
mengerling-ngerling
mengertak-ngertakkan
mengesot-esot
menggaba-gabai
menggali-gali
menggalur-galur
menggamak-gamak
menggamit-gamitkan
menggapai-gapai
menggapai-gapaikan
menggaruk-garuk
menggebu-gebu
menggebyah-uyah
menggeleng-gelengkan
menggelepar-gelepar
menggelepar-geleparkan
menggeliang-geliutkan
menggelinding-gelinding
menggemak-gemak
menggembar-gemborkan
menggerak-gerakkan
menggerecak-gerecak
menggesa-gesakan
menggili-gili
menggodot-godot
menggolak-galikkan
menggorek-gorek
menggoreng-goreng
menggosok-gosok
menggoyang-goyangkan
mengguit-guit
menghalai-balaikan
menghalang-halangi
menghambur-hamburkan
menghinap-hinap
menghitam-memutihkan
menghitung-hitung
menghubung-hubungkan
menghujan-hujankan
mengiang-ngiang
mengibar-ngibarkan
mengibas-ngibas
mengibas-ngibaskan
mengidam-idamkan
mengilah-ngilahkan
mengilai-ilai
mengilat-ngilatkan
mengilik-ngilik
mengimak-imak
mengimbak-imbak
mengiming-iming
mengincrit-incrit
mengingat-ingat
menginjak-injak
mengipas-ngipas
mengira-ngira
mengira-ngirakan
mengiras-iras
mengiras-irasi
mengiris-iris
mengitar-ngitar
mengitik-ngitik
mengodol-odol
mengogok-ogok
mengolak-alik
mengolak-alikkan
mengolang-aling
mengolang-alingkan
mengoleng-oleng
2017-07-26 12:12:52 +00:00
mengolok-olok
2017-07-27 12:46:30 +00:00
mengombang-ambing
mengombang-ambingkan
mengongkang-ongkang
mengongkok-ongkok
mengonyah-anyih
mengopak-apik
mengorak-arik
mengorat-oret
mengorek-ngorek
mengoret-oret
mengorok-orok
mengotak-atik
mengotak-ngatikkan
mengotak-ngotakkan
mengoyak-ngoyak
mengoyak-ngoyakkan
mengoyak-oyak
menguar-nguarkan
menguar-uarkan
2017-07-26 12:12:52 +00:00
mengubah-ubah
2017-07-27 12:46:30 +00:00
mengubek-ubek
menguber-uber
mengubit-ubit
mengubrak-abrik
mengucar-ngacirkan
mengucek-ngucek
mengucek-ucek
menguik-uik
menguis-uis
mengulang-ulang
mengulas-ulas
mengulit-ulit
mengulum-ngulum
mengulur-ulur
menguman-uman
mengumbang-ambingkan
mengumpak-umpak
mengungkat-ungkat
mengungkit-ungkit
mengupa-upa
mengurik-urik
mengusil-usil
mengusil-usilkan
mengutak-atik
mengutak-ngatikkan
mengutik-ngutik
mengutik-utik
menika-nika
menimang-nimang
menimbang-nimbang
menimbun-nimbun
menimpang-nimpangkan
meningkat-ningkat
meniru-niru
2017-07-26 12:12:52 +00:00
menit-menit
2017-07-27 12:46:30 +00:00
menitar-nitarkan
meniup-niup
menjadi-jadi
menjadi-jadikan
menjedot-jedotkan
menjelek-jelekkan
menjengek-jengek
menjengit-jengit
menjerit-jerit
menjilat-jilat
menjungkat-jungkit
2017-07-26 12:12:52 +00:00
menko-menko
menlu-menlu
2017-07-27 12:46:30 +00:00
menonjol-nonjolkan
2017-07-26 12:12:52 +00:00
mentah-mentah
2017-07-27 12:46:30 +00:00
mentang-mentang
2017-07-26 12:12:52 +00:00
menteri-menteri
2017-07-24 07:10:16 +00:00
mentul-mentul
2017-07-27 12:46:30 +00:00
menuding-nuding
menumpah-numpahkan
menunda-nunda
menunduk-nunduk
menusuk-nusuk
menyala-nyala
menyama-nyama
menyama-nyamai
menyambar-nyambar
menyangkut-nyangkutkan
menyanjung-nyanjung
menyanjung-nyanjungkan
menyapu-nyapu
menyarat-nyarat
menyayat-nyayat
menyedang-nyedang
menyedang-nyedangkan
menyelang-nyelangkan
menyelang-nyeling
menyelang-nyelingkan
menyenak-nyenak
menyendi-nyendi
menyentak-nyentak
menyentuh-nyentuh
menyepak-nyepakkan
menyerak-nyerakkan
menyeret-nyeret
menyeru-nyerukan
menyetel-nyetel
menyia-nyiakan
menyibak-nyibak
menyobek-nyobek
menyorong-nyorongkan
menyungguh-nyungguhi
menyuruk-nyuruk
meraba-raba
2017-07-26 12:12:52 +00:00
merah-hitam
merah-merah
2017-07-27 12:46:30 +00:00
merambang-rambang
merangkak-rangkak
merasa-rasai
merata-ratakan
meraung-raung
meraung-raungkan
merayau-rayau
merayu-rayu
2017-07-24 07:10:16 +00:00
mercak-mercik
💫 Port master changes over to develop (#2979) * Create aryaprabhudesai.md (#2681) * Update _install.jade (#2688) Typo fix: "models" -> "model" * Add FAC to spacy.explain (resolves #2706) * Remove docstrings for deprecated arguments (see #2703) * When calling getoption() in conftest.py, pass a default option (#2709) * When calling getoption() in conftest.py, pass a default option This is necessary to allow testing an installed spacy by running: pytest --pyargs spacy * Add contributor agreement * update bengali token rules for hyphen and digits (#2731) * Less norm computations in token similarity (#2730) * Less norm computations in token similarity * Contributor agreement * Remove ')' for clarity (#2737) Sorry, don't mean to be nitpicky, I just noticed this when going through the CLI and thought it was a quick fix. That said, if this was intention than please let me know. * added contributor agreement for mbkupfer (#2738) * Basic support for Telugu language (#2751) * Lex _attrs for polish language (#2750) * Signed spaCy contributor agreement * Added polish version of english lex_attrs * Introduces a bulk merge function, in order to solve issue #653 (#2696) * Fix comment * Introduce bulk merge to increase performance on many span merges * Sign contributor agreement * Implement pull request suggestions * Describe converters more explicitly (see #2643) * Add multi-threading note to Language.pipe (resolves #2582) [ci skip] * Fix formatting * Fix dependency scheme docs (closes #2705) [ci skip] * Don't set stop word in example (closes #2657) [ci skip] * Add words to portuguese language _num_words (#2759) * Add words to portuguese language _num_words * Add words to portuguese language _num_words * Update Indonesian model (#2752) * adding e-KTP in tokenizer exceptions list * add exception token * removing lines with containing space as it won't matter since we use .split() method in the end, added new tokens in exception * add tokenizer exceptions list * combining base_norms with norm_exceptions * adding norm_exception * fix double key in lemmatizer * remove unused import on punctuation.py * reformat stop_words to reduce number of lines, improve readibility * updating tokenizer exception * implement is_currency for lang/id * adding orth_first_upper in tokenizer_exceptions * update the norm_exception list * remove bunch of abbreviations * adding contributors file * Fixed spaCy+Keras example (#2763) * bug fixes in keras example * created contributor agreement * Adding French hyphenated first name (#2786) * Fix typo (closes #2784) * Fix typo (#2795) [ci skip] Fixed typo on line 6 "regcognizer --> recognizer" * Adding basic support for Sinhala language. (#2788) * adding Sinhala language package, stop words, examples and lex_attrs. * Adding contributor agreement * Updating contributor agreement * Also include lowercase norm exceptions * Fix error (#2802) * Fix error ValueError: cannot resize an array that references or is referenced by another array in this way. Use the resize function * added spaCy Contributor Agreement * Add charlax's contributor agreement (#2805) * agreement of contributor, may I introduce a tiny pl languge contribution (#2799) * Contributors agreement * Contributors agreement * Contributors agreement * Add jupyter=True to displacy.render in documentation (#2806) * Revert "Also include lowercase norm exceptions" This reverts commit 70f4e8adf37cfcfab60be2b97d6deae949b30e9e. * Remove deprecated encoding argument to msgpack * Set up dependency tree pattern matching skeleton (#2732) * Fix bug when too many entity types. Fixes #2800 * Fix Python 2 test failure * Require older msgpack-numpy * Restore encoding arg on msgpack-numpy * Try to fix version pin for msgpack-numpy * Update Portuguese Language (#2790) * Add words to portuguese language _num_words * Add words to portuguese language _num_words * Portuguese - Add/remove stopwords, fix tokenizer, add currency symbols * Extended punctuation and norm_exceptions in the Portuguese language * Correct error in spacy universe docs concerning spacy-lookup (#2814) * Update Keras Example for (Parikh et al, 2016) implementation (#2803) * bug fixes in keras example * created contributor agreement * baseline for Parikh model * initial version of parikh 2016 implemented * tested asymmetric models * fixed grevious error in normalization * use standard SNLI test file * begin to rework parikh example * initial version of running example * start to document the new version * start to document the new version * Update Decompositional Attention.ipynb * fixed calls to similarity * updated the README * import sys package duh * simplified indexing on mapping word to IDs * stupid python indent error * added code from https://github.com/tensorflow/tensorflow/issues/3388 for tf bug workaround * Fix typo (closes #2815) [ci skip] * Update regex version dependency * Set version to 2.0.13.dev3 * Skip seemingly problematic test * Remove problematic test * Try previous version of regex * Revert "Remove problematic test" This reverts commit bdebbef45552d698d390aa430b527ee27830f11b. * Unskip test * Try older version of regex * 💫 Update training examples and use minibatching (#2830) <!--- Provide a general summary of your changes in the title. --> ## Description Update the training examples in `/examples/training` to show usage of spaCy's `minibatch` and `compounding` helpers ([see here](https://spacy.io/usage/training#tips-batch-size) for details). The lack of batching in the examples has caused some confusion in the past, especially for beginners who would copy-paste the examples, update them with large training sets and experienced slow and unsatisfying results. ### Types of change enhancements ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. * Visual C++ link updated (#2842) (closes #2841) [ci skip] * New landing page * Add contribution agreement * Correcting lang/ru/examples.py (#2845) * Correct some grammatical inaccuracies in lang\ru\examples.py; filled Contributor Agreement * Correct some grammatical inaccuracies in lang\ru\examples.py * Move contributor agreement to separate file * Set version to 2.0.13.dev4 * Add Persian(Farsi) language support (#2797) * Also include lowercase norm exceptions * Remove in favour of https://github.com/explosion/spaCy/graphs/contributors * Rule-based French Lemmatizer (#2818) <!--- Provide a general summary of your changes in the title. --> ## Description <!--- Use this section to describe your changes. If your changes required testing, include information about the testing environment and the tests you ran. If your test fixes a bug reported in an issue, don't forget to include the issue number. If your PR is still a work in progress, that's totally fine – just include a note to let us know. --> Add a rule-based French Lemmatizer following the english one and the excellent PR for [greek language optimizations](https://github.com/explosion/spaCy/pull/2558) to adapt the Lemmatizer class. ### Types of change <!-- What type of change does your PR cover? Is it a bug fix, an enhancement or new feature, or a change to the documentation? --> - Lemma dictionary used can be found [here](http://infolingu.univ-mlv.fr/DonneesLinguistiques/Dictionnaires/telechargement.html), I used the XML version. - Add several files containing exhaustive list of words for each part of speech - Add some lemma rules - Add POS that are not checked in the standard Lemmatizer, i.e PRON, DET, ADV and AUX - Modify the Lemmatizer class to check in lookup table as a last resort if POS not mentionned - Modify the lemmatize function to check in lookup table as a last resort - Init files are updated so the model can support all the functionalities mentioned above - Add words to tokenizer_exceptions_list.py in respect to regex used in tokenizer_exceptions.py ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [X] I have submitted the spaCy Contributor Agreement. - [X] I ran the tests, and all new and existing tests passed. - [X] My changes don't require a change to the documentation, or if they do, I've added all required information. * Set version to 2.0.13 * Fix formatting and consistency * Update docs for new version [ci skip] * Increment version [ci skip] * Add info on wheels [ci skip] * Adding "This is a sentence" example to Sinhala (#2846) * Add wheels badge * Update badge [ci skip] * Update README.rst [ci skip] * Update murmurhash pin * Increment version to 2.0.14.dev0 * Update GPU docs for v2.0.14 * Add wheel to setup_requires * Import prefer_gpu and require_gpu functions from Thinc * Add tests for prefer_gpu() and require_gpu() * Update requirements and setup.py * Workaround bug in thinc require_gpu * Set version to v2.0.14 * Update push-tag script * Unhack prefer_gpu * Require thinc 6.10.6 * Update prefer_gpu and require_gpu docs [ci skip] * Fix specifiers for GPU * Set version to 2.0.14.dev1 * Set version to 2.0.14 * Update Thinc version pin * Increment version * Fix msgpack-numpy version pin * Increment version * Update version to 2.0.16 * Update version [ci skip] * Redundant ')' in the Stop words' example (#2856) <!--- Provide a general summary of your changes in the title. --> ## Description <!--- Use this section to describe your changes. If your changes required testing, include information about the testing environment and the tests you ran. If your test fixes a bug reported in an issue, don't forget to include the issue number. If your PR is still a work in progress, that's totally fine – just include a note to let us know. --> ### Types of change <!-- What type of change does your PR cover? Is it a bug fix, an enhancement or new feature, or a change to the documentation? --> ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [ ] I have submitted the spaCy Contributor Agreement. - [ ] I ran the tests, and all new and existing tests passed. - [ ] My changes don't require a change to the documentation, or if they do, I've added all required information. * Documentation improvement regarding joblib and SO (#2867) Some documentation improvements ## Description 1. Fixed the dead URL to joblib 2. Fixed Stack Overflow brand name (with space) ### Types of change Documentation ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. * raise error when setting overlapping entities as doc.ents (#2880) * Fix out-of-bounds access in NER training The helper method state.B(1) gets the index of the first token of the buffer, or -1 if no such token exists. Normally this is safe because we pass this to functions like state.safe_get(), which returns an empty token. Here we used it directly as an array index, which is not okay! This error may have been the cause of out-of-bounds access errors during training. Similar errors may still be around, so much be hunted down. Hunting this one down took a long time...I printed out values across training runs and diffed, looking for points of divergence between runs, when no randomness should be allowed. * Change PyThaiNLP Url (#2876) * Fix missing comma * Add example showing a fix-up rule for space entities * Set version to 2.0.17.dev0 * Update regex version * Revert "Update regex version" This reverts commit 62358dd867d15bc6a475942dff34effba69dd70a. * Try setting older regex version, to align with conda * Set version to 2.0.17 * Add spacy-js to universe [ci-skip] * Add spacy-raspberry to universe (closes #2889) * Add script to validate universe json [ci skip] * Removed space in docs + added contributor indo (#2909) * - removed unneeded space in documentation * - added contributor info * Allow input text of length up to max_length, inclusive (#2922) * Include universe spec for spacy-wordnet component (#2919) * feat: include universe spec for spacy-wordnet component * chore: include spaCy contributor agreement * Minor formatting changes [ci skip] * Fix image [ci skip] Twitter URL doesn't work on live site * Check if the word is in one of the regular lists specific to each POS (#2886) * 💫 Create random IDs for SVGs to prevent ID clashes (#2927) Resolves #2924. ## Description Fixes problem where multiple visualizations in Jupyter notebooks would have clashing arc IDs, resulting in weirdly positioned arc labels. Generating a random ID prefix so even identical parses won't receive the same IDs for consistency (even if effect of ID clash isn't noticable here.) ### Types of change bug fix ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. * Fix typo [ci skip] * fixes symbolic link on py3 and windows (#2949) * fixes symbolic link on py3 and windows during setup of spacy using command python -m spacy link en_core_web_sm en closes #2948 * Update spacy/compat.py Co-Authored-By: cicorias <cicorias@users.noreply.github.com> * Fix formatting * Update universe [ci skip] * Catalan Language Support (#2940) * Catalan language Support * Ddding Catalan to documentation * Sort languages alphabetically [ci skip] * Update tests for pytest 4.x (#2965) <!--- Provide a general summary of your changes in the title. --> ## Description - [x] Replace marks in params for pytest 4.0 compat ([see here](https://docs.pytest.org/en/latest/deprecations.html#marks-in-pytest-mark-parametrize)) - [x] Un-xfail passing tests (some fixes in a recent update resolved a bunch of issues, but tests were apparently never updated here) ### Types of change <!-- What type of change does your PR cover? Is it a bug fix, an enhancement or new feature, or a change to the documentation? --> ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. * Fix regex pin to harmonize with conda (#2964) * Update README.rst * Fix bug where Vocab.prune_vector did not use 'batch_size' (#2977) Fixes #2976 * Fix typo * Fix typo * Remove duplicate file * Require thinc 7.0.0.dev2 Fixes bug in gpu_ops that would use cupy instead of numpy on CPU * Add missing import * Fix error IDs * Fix tests
2018-11-29 15:30:29 +00:00
mercedes-benz
2017-07-26 12:12:52 +00:00
merek-merek
mereka-mereka
2017-07-27 12:46:30 +00:00
mereka-reka
merelap-relap
merem-merem
meremah-remah
meremas-remas
meremeh-temehkan
merempah-rempah
merempah-rempahi
merengek-rengek
merengeng-rengeng
merenik-renik
merenta-renta
merenyai-renyai
meresek-resek
merintang-rintang
merintik-rintik
merobek-robek
meronta-ronta
meruap-ruap
merubu-rubu
merungus-rungus
merungut-rungut
2017-07-26 12:12:52 +00:00
meta-analysis
metode-metode
2017-07-27 12:46:30 +00:00
mewanti-wanti
mewarna-warnikan
meyakin-yakini
2017-07-26 12:12:52 +00:00
mid-range
mid-size
2017-07-27 12:46:30 +00:00
miju-miju
2017-07-26 12:12:52 +00:00
mikro-kecil
mimpi-mimpi
minggu-minggu
2017-07-27 12:46:30 +00:00
minta-minta
2017-07-26 12:12:52 +00:00
minuman-minuman
mixed-use
mobil-mobil
mobile-first
mobile-friendly
2017-07-27 12:46:30 +00:00
moga-moga
2017-07-26 12:12:52 +00:00
mola-mola
momen-momen
2017-07-24 07:10:16 +00:00
mondar-mandir
2017-07-26 12:12:52 +00:00
monyet-monyet
2017-07-27 12:46:30 +00:00
morak-marik
2017-07-24 07:10:16 +00:00
morat-marit
2017-07-26 12:12:52 +00:00
move-on
muda-muda
muda-mudi
💫 Port master changes over to develop (#2979) * Create aryaprabhudesai.md (#2681) * Update _install.jade (#2688) Typo fix: "models" -> "model" * Add FAC to spacy.explain (resolves #2706) * Remove docstrings for deprecated arguments (see #2703) * When calling getoption() in conftest.py, pass a default option (#2709) * When calling getoption() in conftest.py, pass a default option This is necessary to allow testing an installed spacy by running: pytest --pyargs spacy * Add contributor agreement * update bengali token rules for hyphen and digits (#2731) * Less norm computations in token similarity (#2730) * Less norm computations in token similarity * Contributor agreement * Remove ')' for clarity (#2737) Sorry, don't mean to be nitpicky, I just noticed this when going through the CLI and thought it was a quick fix. That said, if this was intention than please let me know. * added contributor agreement for mbkupfer (#2738) * Basic support for Telugu language (#2751) * Lex _attrs for polish language (#2750) * Signed spaCy contributor agreement * Added polish version of english lex_attrs * Introduces a bulk merge function, in order to solve issue #653 (#2696) * Fix comment * Introduce bulk merge to increase performance on many span merges * Sign contributor agreement * Implement pull request suggestions * Describe converters more explicitly (see #2643) * Add multi-threading note to Language.pipe (resolves #2582) [ci skip] * Fix formatting * Fix dependency scheme docs (closes #2705) [ci skip] * Don't set stop word in example (closes #2657) [ci skip] * Add words to portuguese language _num_words (#2759) * Add words to portuguese language _num_words * Add words to portuguese language _num_words * Update Indonesian model (#2752) * adding e-KTP in tokenizer exceptions list * add exception token * removing lines with containing space as it won't matter since we use .split() method in the end, added new tokens in exception * add tokenizer exceptions list * combining base_norms with norm_exceptions * adding norm_exception * fix double key in lemmatizer * remove unused import on punctuation.py * reformat stop_words to reduce number of lines, improve readibility * updating tokenizer exception * implement is_currency for lang/id * adding orth_first_upper in tokenizer_exceptions * update the norm_exception list * remove bunch of abbreviations * adding contributors file * Fixed spaCy+Keras example (#2763) * bug fixes in keras example * created contributor agreement * Adding French hyphenated first name (#2786) * Fix typo (closes #2784) * Fix typo (#2795) [ci skip] Fixed typo on line 6 "regcognizer --> recognizer" * Adding basic support for Sinhala language. (#2788) * adding Sinhala language package, stop words, examples and lex_attrs. * Adding contributor agreement * Updating contributor agreement * Also include lowercase norm exceptions * Fix error (#2802) * Fix error ValueError: cannot resize an array that references or is referenced by another array in this way. Use the resize function * added spaCy Contributor Agreement * Add charlax's contributor agreement (#2805) * agreement of contributor, may I introduce a tiny pl languge contribution (#2799) * Contributors agreement * Contributors agreement * Contributors agreement * Add jupyter=True to displacy.render in documentation (#2806) * Revert "Also include lowercase norm exceptions" This reverts commit 70f4e8adf37cfcfab60be2b97d6deae949b30e9e. * Remove deprecated encoding argument to msgpack * Set up dependency tree pattern matching skeleton (#2732) * Fix bug when too many entity types. Fixes #2800 * Fix Python 2 test failure * Require older msgpack-numpy * Restore encoding arg on msgpack-numpy * Try to fix version pin for msgpack-numpy * Update Portuguese Language (#2790) * Add words to portuguese language _num_words * Add words to portuguese language _num_words * Portuguese - Add/remove stopwords, fix tokenizer, add currency symbols * Extended punctuation and norm_exceptions in the Portuguese language * Correct error in spacy universe docs concerning spacy-lookup (#2814) * Update Keras Example for (Parikh et al, 2016) implementation (#2803) * bug fixes in keras example * created contributor agreement * baseline for Parikh model * initial version of parikh 2016 implemented * tested asymmetric models * fixed grevious error in normalization * use standard SNLI test file * begin to rework parikh example * initial version of running example * start to document the new version * start to document the new version * Update Decompositional Attention.ipynb * fixed calls to similarity * updated the README * import sys package duh * simplified indexing on mapping word to IDs * stupid python indent error * added code from https://github.com/tensorflow/tensorflow/issues/3388 for tf bug workaround * Fix typo (closes #2815) [ci skip] * Update regex version dependency * Set version to 2.0.13.dev3 * Skip seemingly problematic test * Remove problematic test * Try previous version of regex * Revert "Remove problematic test" This reverts commit bdebbef45552d698d390aa430b527ee27830f11b. * Unskip test * Try older version of regex * 💫 Update training examples and use minibatching (#2830) <!--- Provide a general summary of your changes in the title. --> ## Description Update the training examples in `/examples/training` to show usage of spaCy's `minibatch` and `compounding` helpers ([see here](https://spacy.io/usage/training#tips-batch-size) for details). The lack of batching in the examples has caused some confusion in the past, especially for beginners who would copy-paste the examples, update them with large training sets and experienced slow and unsatisfying results. ### Types of change enhancements ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. * Visual C++ link updated (#2842) (closes #2841) [ci skip] * New landing page * Add contribution agreement * Correcting lang/ru/examples.py (#2845) * Correct some grammatical inaccuracies in lang\ru\examples.py; filled Contributor Agreement * Correct some grammatical inaccuracies in lang\ru\examples.py * Move contributor agreement to separate file * Set version to 2.0.13.dev4 * Add Persian(Farsi) language support (#2797) * Also include lowercase norm exceptions * Remove in favour of https://github.com/explosion/spaCy/graphs/contributors * Rule-based French Lemmatizer (#2818) <!--- Provide a general summary of your changes in the title. --> ## Description <!--- Use this section to describe your changes. If your changes required testing, include information about the testing environment and the tests you ran. If your test fixes a bug reported in an issue, don't forget to include the issue number. If your PR is still a work in progress, that's totally fine – just include a note to let us know. --> Add a rule-based French Lemmatizer following the english one and the excellent PR for [greek language optimizations](https://github.com/explosion/spaCy/pull/2558) to adapt the Lemmatizer class. ### Types of change <!-- What type of change does your PR cover? Is it a bug fix, an enhancement or new feature, or a change to the documentation? --> - Lemma dictionary used can be found [here](http://infolingu.univ-mlv.fr/DonneesLinguistiques/Dictionnaires/telechargement.html), I used the XML version. - Add several files containing exhaustive list of words for each part of speech - Add some lemma rules - Add POS that are not checked in the standard Lemmatizer, i.e PRON, DET, ADV and AUX - Modify the Lemmatizer class to check in lookup table as a last resort if POS not mentionned - Modify the lemmatize function to check in lookup table as a last resort - Init files are updated so the model can support all the functionalities mentioned above - Add words to tokenizer_exceptions_list.py in respect to regex used in tokenizer_exceptions.py ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [X] I have submitted the spaCy Contributor Agreement. - [X] I ran the tests, and all new and existing tests passed. - [X] My changes don't require a change to the documentation, or if they do, I've added all required information. * Set version to 2.0.13 * Fix formatting and consistency * Update docs for new version [ci skip] * Increment version [ci skip] * Add info on wheels [ci skip] * Adding "This is a sentence" example to Sinhala (#2846) * Add wheels badge * Update badge [ci skip] * Update README.rst [ci skip] * Update murmurhash pin * Increment version to 2.0.14.dev0 * Update GPU docs for v2.0.14 * Add wheel to setup_requires * Import prefer_gpu and require_gpu functions from Thinc * Add tests for prefer_gpu() and require_gpu() * Update requirements and setup.py * Workaround bug in thinc require_gpu * Set version to v2.0.14 * Update push-tag script * Unhack prefer_gpu * Require thinc 6.10.6 * Update prefer_gpu and require_gpu docs [ci skip] * Fix specifiers for GPU * Set version to 2.0.14.dev1 * Set version to 2.0.14 * Update Thinc version pin * Increment version * Fix msgpack-numpy version pin * Increment version * Update version to 2.0.16 * Update version [ci skip] * Redundant ')' in the Stop words' example (#2856) <!--- Provide a general summary of your changes in the title. --> ## Description <!--- Use this section to describe your changes. If your changes required testing, include information about the testing environment and the tests you ran. If your test fixes a bug reported in an issue, don't forget to include the issue number. If your PR is still a work in progress, that's totally fine – just include a note to let us know. --> ### Types of change <!-- What type of change does your PR cover? Is it a bug fix, an enhancement or new feature, or a change to the documentation? --> ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [ ] I have submitted the spaCy Contributor Agreement. - [ ] I ran the tests, and all new and existing tests passed. - [ ] My changes don't require a change to the documentation, or if they do, I've added all required information. * Documentation improvement regarding joblib and SO (#2867) Some documentation improvements ## Description 1. Fixed the dead URL to joblib 2. Fixed Stack Overflow brand name (with space) ### Types of change Documentation ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. * raise error when setting overlapping entities as doc.ents (#2880) * Fix out-of-bounds access in NER training The helper method state.B(1) gets the index of the first token of the buffer, or -1 if no such token exists. Normally this is safe because we pass this to functions like state.safe_get(), which returns an empty token. Here we used it directly as an array index, which is not okay! This error may have been the cause of out-of-bounds access errors during training. Similar errors may still be around, so much be hunted down. Hunting this one down took a long time...I printed out values across training runs and diffed, looking for points of divergence between runs, when no randomness should be allowed. * Change PyThaiNLP Url (#2876) * Fix missing comma * Add example showing a fix-up rule for space entities * Set version to 2.0.17.dev0 * Update regex version * Revert "Update regex version" This reverts commit 62358dd867d15bc6a475942dff34effba69dd70a. * Try setting older regex version, to align with conda * Set version to 2.0.17 * Add spacy-js to universe [ci-skip] * Add spacy-raspberry to universe (closes #2889) * Add script to validate universe json [ci skip] * Removed space in docs + added contributor indo (#2909) * - removed unneeded space in documentation * - added contributor info * Allow input text of length up to max_length, inclusive (#2922) * Include universe spec for spacy-wordnet component (#2919) * feat: include universe spec for spacy-wordnet component * chore: include spaCy contributor agreement * Minor formatting changes [ci skip] * Fix image [ci skip] Twitter URL doesn't work on live site * Check if the word is in one of the regular lists specific to each POS (#2886) * 💫 Create random IDs for SVGs to prevent ID clashes (#2927) Resolves #2924. ## Description Fixes problem where multiple visualizations in Jupyter notebooks would have clashing arc IDs, resulting in weirdly positioned arc labels. Generating a random ID prefix so even identical parses won't receive the same IDs for consistency (even if effect of ID clash isn't noticable here.) ### Types of change bug fix ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. * Fix typo [ci skip] * fixes symbolic link on py3 and windows (#2949) * fixes symbolic link on py3 and windows during setup of spacy using command python -m spacy link en_core_web_sm en closes #2948 * Update spacy/compat.py Co-Authored-By: cicorias <cicorias@users.noreply.github.com> * Fix formatting * Update universe [ci skip] * Catalan Language Support (#2940) * Catalan language Support * Ddding Catalan to documentation * Sort languages alphabetically [ci skip] * Update tests for pytest 4.x (#2965) <!--- Provide a general summary of your changes in the title. --> ## Description - [x] Replace marks in params for pytest 4.0 compat ([see here](https://docs.pytest.org/en/latest/deprecations.html#marks-in-pytest-mark-parametrize)) - [x] Un-xfail passing tests (some fixes in a recent update resolved a bunch of issues, but tests were apparently never updated here) ### Types of change <!-- What type of change does your PR cover? Is it a bug fix, an enhancement or new feature, or a change to the documentation? --> ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. * Fix regex pin to harmonize with conda (#2964) * Update README.rst * Fix bug where Vocab.prune_vector did not use 'batch_size' (#2977) Fixes #2976 * Fix typo * Fix typo * Remove duplicate file * Require thinc 7.0.0.dev2 Fixes bug in gpu_ops that would use cupy instead of numpy on CPU * Add missing import * Fix error IDs * Fix tests
2018-11-29 15:30:29 +00:00
muda/i
2017-07-26 12:12:52 +00:00
mudah-mudahan
muka-muka
2017-07-27 12:46:30 +00:00
mula-mula
2017-07-26 12:12:52 +00:00
multiple-output
muluk-muluk
2017-07-27 12:46:30 +00:00
mulut-mulutan
2017-07-26 12:12:52 +00:00
mumi-mumi
mundur-mundur
muntah-muntah
murid-muridnya
musda-musda
museum-museum
muslim-muslimah
musuh-musuh
musuh-musuhnya
nabi-nabi
2017-07-27 12:46:30 +00:00
nada-nadanya
naga-naga
naga-naganya
2017-07-26 12:12:52 +00:00
naik-naik
naik-turun
2017-07-27 12:46:30 +00:00
nakal-nakalan
2017-07-26 12:12:52 +00:00
nama-nama
2017-07-27 12:46:30 +00:00
nanti-nantian
2017-07-26 12:12:52 +00:00
nanya-nanya
2017-07-24 07:10:16 +00:00
nasi-nasi
2017-07-27 12:46:30 +00:00
nasib-nasiban
2017-07-26 12:12:52 +00:00
near-field
negara-negara
negera-negara
negeri-negeri
negeri-red
2017-07-24 07:10:16 +00:00
neka-neka
2017-07-27 12:46:30 +00:00
nekat-nekat
2017-07-26 12:12:52 +00:00
neko-neko
nenek-nenek
neo-liberalisme
next-gen
next-generation
2017-07-27 12:46:30 +00:00
ngeang-ngeang
2017-07-26 12:12:52 +00:00
ngeri-ngeri
nggak-nggak
ngobrol-ngobrol
ngumpul-ngumpul
nilai-nilai
nine-dash
nipa-nipa
2017-07-24 07:10:16 +00:00
nong-nong
2017-07-26 12:12:52 +00:00
norma-norma
novel-novel
2017-07-27 12:46:30 +00:00
nyai-nyai
nyolong-nyolong
nyut-nyutan
2017-07-26 12:12:52 +00:00
ob-gyn
obat-obat
obat-obatan
objek-objek
obok-obok
obrak-abrik
octa-core
odong-odong
oedipus-kompleks
off-road
2017-07-24 07:10:16 +00:00
ogah-agih
ogah-ogah
2017-07-27 12:46:30 +00:00
ogah-ogahan
2017-07-24 07:10:16 +00:00
ogak-agik
ogak-ogak
2017-07-26 12:12:52 +00:00
ogoh-ogoh
2017-07-24 07:10:16 +00:00
olak-alik
olak-olak
olang-aling
2017-07-27 12:46:30 +00:00
olang-alingan
2017-07-26 12:12:52 +00:00
ole-ole
2017-07-24 07:10:16 +00:00
oleh-oleh
2017-07-26 12:12:52 +00:00
olok-olok
2017-07-27 12:46:30 +00:00
olok-olokan
2017-07-24 07:10:16 +00:00
olong-olong
2017-07-26 12:12:52 +00:00
om-om
2017-07-24 07:10:16 +00:00
ombang-ambing
2017-07-26 12:12:52 +00:00
omni-channel
on-board
on-demand
on-fire
on-line
on-off
on-premises
on-roll
on-screen
on-the-go
2017-07-24 07:10:16 +00:00
onde-onde
ondel-ondel
2017-07-27 12:46:30 +00:00
ondos-ondos
2017-07-26 12:12:52 +00:00
one-click
one-to-one
one-touch
one-two
2017-07-24 07:10:16 +00:00
oneng-oneng
2017-07-27 12:46:30 +00:00
ongkang-ongkang
2017-07-24 07:10:16 +00:00
ongol-ongol
2017-07-26 12:12:52 +00:00
online-to-offline
2017-07-24 07:10:16 +00:00
ontran-ontran
onyah-anyih
onyak-anyik
opak-apik
2017-07-26 12:12:52 +00:00
opsi-opsi
opt-in
2017-07-24 07:10:16 +00:00
orak-arik
orang-aring
2017-07-26 12:12:52 +00:00
orang-orang
2017-07-27 12:46:30 +00:00
orang-orangan
2017-07-24 07:10:16 +00:00
orat-oret
2017-07-26 12:12:52 +00:00
organisasi-organisasi
ormas-ormas
2017-07-24 07:10:16 +00:00
orok-orok
orong-orong
2017-07-26 12:12:52 +00:00
oseng-oseng
2017-07-24 07:10:16 +00:00
otak-atik
otak-otak
2017-07-27 12:46:30 +00:00
otak-otakan
2017-07-26 12:12:52 +00:00
over-heating
over-the-air
over-the-top
pa-pa
pabrik-pabrik
2017-07-27 12:46:30 +00:00
padi-padian
2017-07-26 12:12:52 +00:00
pagi-pagi
pagi-sore
pajak-pajak
paket-paket
2017-07-24 07:10:16 +00:00
palas-palas
palato-alveolar
2017-07-27 12:46:30 +00:00
paling-paling
2017-07-26 12:12:52 +00:00
palu-arit
2017-07-27 12:46:30 +00:00
palu-memalu
2017-07-26 12:12:52 +00:00
panas-dingin
2017-07-27 12:46:30 +00:00
panas-panas
2017-07-26 12:12:52 +00:00
pandai-pandai
2017-07-27 12:46:30 +00:00
pandang-memandang
2017-07-26 12:12:52 +00:00
panel-panel
pangeran-pangeran
panggung-panggung
pangkalan-pangkalan
panja-panja
panji-panji
pansus-pansus
pantai-pantai
2017-07-24 07:10:16 +00:00
pao-pao
2017-07-27 12:46:30 +00:00
para-para
2017-07-24 07:10:16 +00:00
parang-parang
2017-07-26 12:12:52 +00:00
parpol-parpol
partai-partai
paru-paru
pas-pasan
pasal-pasal
2017-07-27 12:46:30 +00:00
pasang-memasang
2017-07-26 12:12:52 +00:00
pasang-surut
pasar-pasar
2017-07-24 07:10:16 +00:00
pasu-pasu
2017-07-26 12:12:52 +00:00
paus-paus
2017-07-27 12:46:30 +00:00
paut-memaut
2017-07-26 12:12:52 +00:00
pay-per-click
2017-07-24 07:10:16 +00:00
paya-paya
💫 Port master changes over to develop (#2979) * Create aryaprabhudesai.md (#2681) * Update _install.jade (#2688) Typo fix: "models" -> "model" * Add FAC to spacy.explain (resolves #2706) * Remove docstrings for deprecated arguments (see #2703) * When calling getoption() in conftest.py, pass a default option (#2709) * When calling getoption() in conftest.py, pass a default option This is necessary to allow testing an installed spacy by running: pytest --pyargs spacy * Add contributor agreement * update bengali token rules for hyphen and digits (#2731) * Less norm computations in token similarity (#2730) * Less norm computations in token similarity * Contributor agreement * Remove ')' for clarity (#2737) Sorry, don't mean to be nitpicky, I just noticed this when going through the CLI and thought it was a quick fix. That said, if this was intention than please let me know. * added contributor agreement for mbkupfer (#2738) * Basic support for Telugu language (#2751) * Lex _attrs for polish language (#2750) * Signed spaCy contributor agreement * Added polish version of english lex_attrs * Introduces a bulk merge function, in order to solve issue #653 (#2696) * Fix comment * Introduce bulk merge to increase performance on many span merges * Sign contributor agreement * Implement pull request suggestions * Describe converters more explicitly (see #2643) * Add multi-threading note to Language.pipe (resolves #2582) [ci skip] * Fix formatting * Fix dependency scheme docs (closes #2705) [ci skip] * Don't set stop word in example (closes #2657) [ci skip] * Add words to portuguese language _num_words (#2759) * Add words to portuguese language _num_words * Add words to portuguese language _num_words * Update Indonesian model (#2752) * adding e-KTP in tokenizer exceptions list * add exception token * removing lines with containing space as it won't matter since we use .split() method in the end, added new tokens in exception * add tokenizer exceptions list * combining base_norms with norm_exceptions * adding norm_exception * fix double key in lemmatizer * remove unused import on punctuation.py * reformat stop_words to reduce number of lines, improve readibility * updating tokenizer exception * implement is_currency for lang/id * adding orth_first_upper in tokenizer_exceptions * update the norm_exception list * remove bunch of abbreviations * adding contributors file * Fixed spaCy+Keras example (#2763) * bug fixes in keras example * created contributor agreement * Adding French hyphenated first name (#2786) * Fix typo (closes #2784) * Fix typo (#2795) [ci skip] Fixed typo on line 6 "regcognizer --> recognizer" * Adding basic support for Sinhala language. (#2788) * adding Sinhala language package, stop words, examples and lex_attrs. * Adding contributor agreement * Updating contributor agreement * Also include lowercase norm exceptions * Fix error (#2802) * Fix error ValueError: cannot resize an array that references or is referenced by another array in this way. Use the resize function * added spaCy Contributor Agreement * Add charlax's contributor agreement (#2805) * agreement of contributor, may I introduce a tiny pl languge contribution (#2799) * Contributors agreement * Contributors agreement * Contributors agreement * Add jupyter=True to displacy.render in documentation (#2806) * Revert "Also include lowercase norm exceptions" This reverts commit 70f4e8adf37cfcfab60be2b97d6deae949b30e9e. * Remove deprecated encoding argument to msgpack * Set up dependency tree pattern matching skeleton (#2732) * Fix bug when too many entity types. Fixes #2800 * Fix Python 2 test failure * Require older msgpack-numpy * Restore encoding arg on msgpack-numpy * Try to fix version pin for msgpack-numpy * Update Portuguese Language (#2790) * Add words to portuguese language _num_words * Add words to portuguese language _num_words * Portuguese - Add/remove stopwords, fix tokenizer, add currency symbols * Extended punctuation and norm_exceptions in the Portuguese language * Correct error in spacy universe docs concerning spacy-lookup (#2814) * Update Keras Example for (Parikh et al, 2016) implementation (#2803) * bug fixes in keras example * created contributor agreement * baseline for Parikh model * initial version of parikh 2016 implemented * tested asymmetric models * fixed grevious error in normalization * use standard SNLI test file * begin to rework parikh example * initial version of running example * start to document the new version * start to document the new version * Update Decompositional Attention.ipynb * fixed calls to similarity * updated the README * import sys package duh * simplified indexing on mapping word to IDs * stupid python indent error * added code from https://github.com/tensorflow/tensorflow/issues/3388 for tf bug workaround * Fix typo (closes #2815) [ci skip] * Update regex version dependency * Set version to 2.0.13.dev3 * Skip seemingly problematic test * Remove problematic test * Try previous version of regex * Revert "Remove problematic test" This reverts commit bdebbef45552d698d390aa430b527ee27830f11b. * Unskip test * Try older version of regex * 💫 Update training examples and use minibatching (#2830) <!--- Provide a general summary of your changes in the title. --> ## Description Update the training examples in `/examples/training` to show usage of spaCy's `minibatch` and `compounding` helpers ([see here](https://spacy.io/usage/training#tips-batch-size) for details). The lack of batching in the examples has caused some confusion in the past, especially for beginners who would copy-paste the examples, update them with large training sets and experienced slow and unsatisfying results. ### Types of change enhancements ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. * Visual C++ link updated (#2842) (closes #2841) [ci skip] * New landing page * Add contribution agreement * Correcting lang/ru/examples.py (#2845) * Correct some grammatical inaccuracies in lang\ru\examples.py; filled Contributor Agreement * Correct some grammatical inaccuracies in lang\ru\examples.py * Move contributor agreement to separate file * Set version to 2.0.13.dev4 * Add Persian(Farsi) language support (#2797) * Also include lowercase norm exceptions * Remove in favour of https://github.com/explosion/spaCy/graphs/contributors * Rule-based French Lemmatizer (#2818) <!--- Provide a general summary of your changes in the title. --> ## Description <!--- Use this section to describe your changes. If your changes required testing, include information about the testing environment and the tests you ran. If your test fixes a bug reported in an issue, don't forget to include the issue number. If your PR is still a work in progress, that's totally fine – just include a note to let us know. --> Add a rule-based French Lemmatizer following the english one and the excellent PR for [greek language optimizations](https://github.com/explosion/spaCy/pull/2558) to adapt the Lemmatizer class. ### Types of change <!-- What type of change does your PR cover? Is it a bug fix, an enhancement or new feature, or a change to the documentation? --> - Lemma dictionary used can be found [here](http://infolingu.univ-mlv.fr/DonneesLinguistiques/Dictionnaires/telechargement.html), I used the XML version. - Add several files containing exhaustive list of words for each part of speech - Add some lemma rules - Add POS that are not checked in the standard Lemmatizer, i.e PRON, DET, ADV and AUX - Modify the Lemmatizer class to check in lookup table as a last resort if POS not mentionned - Modify the lemmatize function to check in lookup table as a last resort - Init files are updated so the model can support all the functionalities mentioned above - Add words to tokenizer_exceptions_list.py in respect to regex used in tokenizer_exceptions.py ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [X] I have submitted the spaCy Contributor Agreement. - [X] I ran the tests, and all new and existing tests passed. - [X] My changes don't require a change to the documentation, or if they do, I've added all required information. * Set version to 2.0.13 * Fix formatting and consistency * Update docs for new version [ci skip] * Increment version [ci skip] * Add info on wheels [ci skip] * Adding "This is a sentence" example to Sinhala (#2846) * Add wheels badge * Update badge [ci skip] * Update README.rst [ci skip] * Update murmurhash pin * Increment version to 2.0.14.dev0 * Update GPU docs for v2.0.14 * Add wheel to setup_requires * Import prefer_gpu and require_gpu functions from Thinc * Add tests for prefer_gpu() and require_gpu() * Update requirements and setup.py * Workaround bug in thinc require_gpu * Set version to v2.0.14 * Update push-tag script * Unhack prefer_gpu * Require thinc 6.10.6 * Update prefer_gpu and require_gpu docs [ci skip] * Fix specifiers for GPU * Set version to 2.0.14.dev1 * Set version to 2.0.14 * Update Thinc version pin * Increment version * Fix msgpack-numpy version pin * Increment version * Update version to 2.0.16 * Update version [ci skip] * Redundant ')' in the Stop words' example (#2856) <!--- Provide a general summary of your changes in the title. --> ## Description <!--- Use this section to describe your changes. If your changes required testing, include information about the testing environment and the tests you ran. If your test fixes a bug reported in an issue, don't forget to include the issue number. If your PR is still a work in progress, that's totally fine – just include a note to let us know. --> ### Types of change <!-- What type of change does your PR cover? Is it a bug fix, an enhancement or new feature, or a change to the documentation? --> ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [ ] I have submitted the spaCy Contributor Agreement. - [ ] I ran the tests, and all new and existing tests passed. - [ ] My changes don't require a change to the documentation, or if they do, I've added all required information. * Documentation improvement regarding joblib and SO (#2867) Some documentation improvements ## Description 1. Fixed the dead URL to joblib 2. Fixed Stack Overflow brand name (with space) ### Types of change Documentation ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. * raise error when setting overlapping entities as doc.ents (#2880) * Fix out-of-bounds access in NER training The helper method state.B(1) gets the index of the first token of the buffer, or -1 if no such token exists. Normally this is safe because we pass this to functions like state.safe_get(), which returns an empty token. Here we used it directly as an array index, which is not okay! This error may have been the cause of out-of-bounds access errors during training. Similar errors may still be around, so much be hunted down. Hunting this one down took a long time...I printed out values across training runs and diffed, looking for points of divergence between runs, when no randomness should be allowed. * Change PyThaiNLP Url (#2876) * Fix missing comma * Add example showing a fix-up rule for space entities * Set version to 2.0.17.dev0 * Update regex version * Revert "Update regex version" This reverts commit 62358dd867d15bc6a475942dff34effba69dd70a. * Try setting older regex version, to align with conda * Set version to 2.0.17 * Add spacy-js to universe [ci-skip] * Add spacy-raspberry to universe (closes #2889) * Add script to validate universe json [ci skip] * Removed space in docs + added contributor indo (#2909) * - removed unneeded space in documentation * - added contributor info * Allow input text of length up to max_length, inclusive (#2922) * Include universe spec for spacy-wordnet component (#2919) * feat: include universe spec for spacy-wordnet component * chore: include spaCy contributor agreement * Minor formatting changes [ci skip] * Fix image [ci skip] Twitter URL doesn't work on live site * Check if the word is in one of the regular lists specific to each POS (#2886) * 💫 Create random IDs for SVGs to prevent ID clashes (#2927) Resolves #2924. ## Description Fixes problem where multiple visualizations in Jupyter notebooks would have clashing arc IDs, resulting in weirdly positioned arc labels. Generating a random ID prefix so even identical parses won't receive the same IDs for consistency (even if effect of ID clash isn't noticable here.) ### Types of change bug fix ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. * Fix typo [ci skip] * fixes symbolic link on py3 and windows (#2949) * fixes symbolic link on py3 and windows during setup of spacy using command python -m spacy link en_core_web_sm en closes #2948 * Update spacy/compat.py Co-Authored-By: cicorias <cicorias@users.noreply.github.com> * Fix formatting * Update universe [ci skip] * Catalan Language Support (#2940) * Catalan language Support * Ddding Catalan to documentation * Sort languages alphabetically [ci skip] * Update tests for pytest 4.x (#2965) <!--- Provide a general summary of your changes in the title. --> ## Description - [x] Replace marks in params for pytest 4.0 compat ([see here](https://docs.pytest.org/en/latest/deprecations.html#marks-in-pytest-mark-parametrize)) - [x] Un-xfail passing tests (some fixes in a recent update resolved a bunch of issues, but tests were apparently never updated here) ### Types of change <!-- What type of change does your PR cover? Is it a bug fix, an enhancement or new feature, or a change to the documentation? --> ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. * Fix regex pin to harmonize with conda (#2964) * Update README.rst * Fix bug where Vocab.prune_vector did not use 'batch_size' (#2977) Fixes #2976 * Fix typo * Fix typo * Remove duplicate file * Require thinc 7.0.0.dev2 Fixes bug in gpu_ops that would use cupy instead of numpy on CPU * Add missing import * Fix error IDs * Fix tests
2018-11-29 15:30:29 +00:00
pdi-p
2017-07-26 12:12:52 +00:00
pecah-pecah
pecat-pecatan
peer-to-peer
pejabat-pejabat
2017-07-27 12:46:30 +00:00
pekak-pekak
pekik-pekuk
2017-07-26 12:12:52 +00:00
pelabuhan-pelabuhan
pelacur-pelacur
pelajar-pelajar
pelan-pelan
pelangi-pelangi
pem-bully
pemain-pemain
2017-07-27 12:46:30 +00:00
pemata-mataan
2017-07-26 12:12:52 +00:00
pemda-pemda
pemeluk-pemeluknya
pemerintah-pemerintah
pemerintah-red
pemerintah-swasta
2017-07-27 12:46:30 +00:00
pemetang-metangan
2017-07-26 12:12:52 +00:00
pemilu-pemilu
pemimpin-pemimpin
2017-07-27 12:46:30 +00:00
peminta-minta
2017-07-26 12:12:52 +00:00
pemuda-pemuda
pemuda-pemudi
penanggung-jawab
2017-07-27 12:46:30 +00:00
pengali-ali
2017-07-26 12:12:52 +00:00
pengaturan-pengaturan
2017-07-27 12:46:30 +00:00
penggembar-gemboran
pengorak-arik
pengotak-ngotakan
pengundang-undang
2017-07-26 12:12:52 +00:00
pengusaha-pengusaha
2017-07-27 12:46:30 +00:00
pentung-pentungan
2017-07-26 12:12:52 +00:00
penyakit-penyakit
2017-07-24 07:10:16 +00:00
perak-perak
2017-07-26 12:12:52 +00:00
perang-perangan
2017-07-24 07:10:16 +00:00
peras-perus
2017-07-26 12:12:52 +00:00
peraturan-peraturan
perda-perda
perempat-final
perempuan-perempuan
pergi-pergi
pergi-pulang
2017-07-27 12:46:30 +00:00
perintang-rintang
2017-07-26 12:12:52 +00:00
perkereta-apian
perlahan-lahan
2017-07-27 12:46:30 +00:00
perlip-perlipan
2017-07-26 12:12:52 +00:00
permen-permen
pernak-pernik
2017-07-27 12:46:30 +00:00
pernik-pernik
2017-07-24 07:10:16 +00:00
pertama-tama
2017-07-26 12:12:52 +00:00
pertandingan-pertandingan
pertimbangan-pertimbangan
perudang-undangan
perundang-undangan
perundangan-undangan
perusahaan-perusahaan
perusahaan-perusahan
perwakilan-perwakilan
pesan-pesan
pesawat-pesawat
peta-jalan
2017-07-27 12:46:30 +00:00
petang-petang
2017-07-24 07:10:16 +00:00
petantang-petenteng
petatang-peteteng
pete-pete
2017-07-26 12:12:52 +00:00
piala-piala
2017-07-27 12:46:30 +00:00
piat-piut
2017-07-26 12:12:52 +00:00
pick-up
picture-in-picture
pihak-pihak
2017-07-27 12:46:30 +00:00
pijak-pijak
pijar-pijar
pijat-pijat
2017-07-26 12:12:52 +00:00
pikir-pikir
pil-pil
pilah-pilih
pilih-pilih
pilihan-pilihan
2017-07-27 12:46:30 +00:00
pilin-memilin
2017-07-26 12:12:52 +00:00
pilkada-pilkada
2017-07-24 07:10:16 +00:00
pina-pina
2017-07-26 12:12:52 +00:00
pindah-pindah
ping-pong
pinjam-meminjam
pintar-pintarlah
2017-07-27 12:46:30 +00:00
pisang-pisang
pistol-pistolan
piting-memiting
2017-07-26 12:12:52 +00:00
planet-planet
play-off
plin-plan
2017-07-24 07:10:16 +00:00
plintat-plintut
plonga-plongo
2017-07-26 12:12:52 +00:00
plug-in
plus-minus
plus-plus
poco-poco
2017-07-27 12:46:30 +00:00
pohon-pohonan
2017-07-26 12:12:52 +00:00
poin-poin
point-of-sale
point-of-sales
pokemon-pokemon
pokja-pokja
pokok-pokok
2017-07-27 12:46:30 +00:00
pokrol-pokrolan
polang-paling
2017-07-26 12:12:52 +00:00
polda-polda
2017-07-27 12:46:30 +00:00
poleng-poleng
polong-polongan
2017-07-26 12:12:52 +00:00
polres-polres
polsek-polsek
polwan-polwan
2017-07-27 12:46:30 +00:00
poma-poma
2017-07-26 12:12:52 +00:00
pondok-pondok
ponpes-ponpes
2017-07-24 07:10:16 +00:00
pontang-panting
2017-07-26 12:12:52 +00:00
pop-up
2017-07-24 07:10:16 +00:00
porak-parik
porak-peranda
porak-poranda
2017-07-26 12:12:52 +00:00
pos-pos
posko-posko
2017-07-27 12:46:30 +00:00
potong-memotong
2017-07-26 12:12:52 +00:00
praktek-praktek
praktik-praktik
produk-produk
program-program
promosi-degradasi
provinsi-provinsi
proyek-proyek
puing-puing
puisi-puisi
2017-07-27 12:46:30 +00:00
puji-pujian
2017-07-24 07:10:16 +00:00
pukang-pukang
2017-07-27 12:46:30 +00:00
pukul-memukul
2017-07-26 12:12:52 +00:00
pulang-pergi
pulau-pulai
pulau-pulau
pull-up
2017-07-27 12:46:30 +00:00
pulut-pulut
2017-07-26 12:12:52 +00:00
pundi-pundi
2017-07-24 07:10:16 +00:00
pungak-pinguk
2017-07-27 12:46:30 +00:00
punggung-memunggung
2017-07-24 07:10:16 +00:00
pura-pura
puruk-parak
2017-07-27 12:46:30 +00:00
pusar-pusar
2017-07-26 12:12:52 +00:00
pusat-pusat
push-to-talk
push-up
push-ups
2017-07-27 12:46:30 +00:00
pusing-pusing
2017-07-26 12:12:52 +00:00
puskesmas-puskesmas
2017-07-27 12:46:30 +00:00
putar-putar
2017-07-26 12:12:52 +00:00
putera-puteri
putih-hitam
putih-putih
putra-putra
putra-putri
💫 Port master changes over to develop (#2979) * Create aryaprabhudesai.md (#2681) * Update _install.jade (#2688) Typo fix: "models" -> "model" * Add FAC to spacy.explain (resolves #2706) * Remove docstrings for deprecated arguments (see #2703) * When calling getoption() in conftest.py, pass a default option (#2709) * When calling getoption() in conftest.py, pass a default option This is necessary to allow testing an installed spacy by running: pytest --pyargs spacy * Add contributor agreement * update bengali token rules for hyphen and digits (#2731) * Less norm computations in token similarity (#2730) * Less norm computations in token similarity * Contributor agreement * Remove ')' for clarity (#2737) Sorry, don't mean to be nitpicky, I just noticed this when going through the CLI and thought it was a quick fix. That said, if this was intention than please let me know. * added contributor agreement for mbkupfer (#2738) * Basic support for Telugu language (#2751) * Lex _attrs for polish language (#2750) * Signed spaCy contributor agreement * Added polish version of english lex_attrs * Introduces a bulk merge function, in order to solve issue #653 (#2696) * Fix comment * Introduce bulk merge to increase performance on many span merges * Sign contributor agreement * Implement pull request suggestions * Describe converters more explicitly (see #2643) * Add multi-threading note to Language.pipe (resolves #2582) [ci skip] * Fix formatting * Fix dependency scheme docs (closes #2705) [ci skip] * Don't set stop word in example (closes #2657) [ci skip] * Add words to portuguese language _num_words (#2759) * Add words to portuguese language _num_words * Add words to portuguese language _num_words * Update Indonesian model (#2752) * adding e-KTP in tokenizer exceptions list * add exception token * removing lines with containing space as it won't matter since we use .split() method in the end, added new tokens in exception * add tokenizer exceptions list * combining base_norms with norm_exceptions * adding norm_exception * fix double key in lemmatizer * remove unused import on punctuation.py * reformat stop_words to reduce number of lines, improve readibility * updating tokenizer exception * implement is_currency for lang/id * adding orth_first_upper in tokenizer_exceptions * update the norm_exception list * remove bunch of abbreviations * adding contributors file * Fixed spaCy+Keras example (#2763) * bug fixes in keras example * created contributor agreement * Adding French hyphenated first name (#2786) * Fix typo (closes #2784) * Fix typo (#2795) [ci skip] Fixed typo on line 6 "regcognizer --> recognizer" * Adding basic support for Sinhala language. (#2788) * adding Sinhala language package, stop words, examples and lex_attrs. * Adding contributor agreement * Updating contributor agreement * Also include lowercase norm exceptions * Fix error (#2802) * Fix error ValueError: cannot resize an array that references or is referenced by another array in this way. Use the resize function * added spaCy Contributor Agreement * Add charlax's contributor agreement (#2805) * agreement of contributor, may I introduce a tiny pl languge contribution (#2799) * Contributors agreement * Contributors agreement * Contributors agreement * Add jupyter=True to displacy.render in documentation (#2806) * Revert "Also include lowercase norm exceptions" This reverts commit 70f4e8adf37cfcfab60be2b97d6deae949b30e9e. * Remove deprecated encoding argument to msgpack * Set up dependency tree pattern matching skeleton (#2732) * Fix bug when too many entity types. Fixes #2800 * Fix Python 2 test failure * Require older msgpack-numpy * Restore encoding arg on msgpack-numpy * Try to fix version pin for msgpack-numpy * Update Portuguese Language (#2790) * Add words to portuguese language _num_words * Add words to portuguese language _num_words * Portuguese - Add/remove stopwords, fix tokenizer, add currency symbols * Extended punctuation and norm_exceptions in the Portuguese language * Correct error in spacy universe docs concerning spacy-lookup (#2814) * Update Keras Example for (Parikh et al, 2016) implementation (#2803) * bug fixes in keras example * created contributor agreement * baseline for Parikh model * initial version of parikh 2016 implemented * tested asymmetric models * fixed grevious error in normalization * use standard SNLI test file * begin to rework parikh example * initial version of running example * start to document the new version * start to document the new version * Update Decompositional Attention.ipynb * fixed calls to similarity * updated the README * import sys package duh * simplified indexing on mapping word to IDs * stupid python indent error * added code from https://github.com/tensorflow/tensorflow/issues/3388 for tf bug workaround * Fix typo (closes #2815) [ci skip] * Update regex version dependency * Set version to 2.0.13.dev3 * Skip seemingly problematic test * Remove problematic test * Try previous version of regex * Revert "Remove problematic test" This reverts commit bdebbef45552d698d390aa430b527ee27830f11b. * Unskip test * Try older version of regex * 💫 Update training examples and use minibatching (#2830) <!--- Provide a general summary of your changes in the title. --> ## Description Update the training examples in `/examples/training` to show usage of spaCy's `minibatch` and `compounding` helpers ([see here](https://spacy.io/usage/training#tips-batch-size) for details). The lack of batching in the examples has caused some confusion in the past, especially for beginners who would copy-paste the examples, update them with large training sets and experienced slow and unsatisfying results. ### Types of change enhancements ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. * Visual C++ link updated (#2842) (closes #2841) [ci skip] * New landing page * Add contribution agreement * Correcting lang/ru/examples.py (#2845) * Correct some grammatical inaccuracies in lang\ru\examples.py; filled Contributor Agreement * Correct some grammatical inaccuracies in lang\ru\examples.py * Move contributor agreement to separate file * Set version to 2.0.13.dev4 * Add Persian(Farsi) language support (#2797) * Also include lowercase norm exceptions * Remove in favour of https://github.com/explosion/spaCy/graphs/contributors * Rule-based French Lemmatizer (#2818) <!--- Provide a general summary of your changes in the title. --> ## Description <!--- Use this section to describe your changes. If your changes required testing, include information about the testing environment and the tests you ran. If your test fixes a bug reported in an issue, don't forget to include the issue number. If your PR is still a work in progress, that's totally fine – just include a note to let us know. --> Add a rule-based French Lemmatizer following the english one and the excellent PR for [greek language optimizations](https://github.com/explosion/spaCy/pull/2558) to adapt the Lemmatizer class. ### Types of change <!-- What type of change does your PR cover? Is it a bug fix, an enhancement or new feature, or a change to the documentation? --> - Lemma dictionary used can be found [here](http://infolingu.univ-mlv.fr/DonneesLinguistiques/Dictionnaires/telechargement.html), I used the XML version. - Add several files containing exhaustive list of words for each part of speech - Add some lemma rules - Add POS that are not checked in the standard Lemmatizer, i.e PRON, DET, ADV and AUX - Modify the Lemmatizer class to check in lookup table as a last resort if POS not mentionned - Modify the lemmatize function to check in lookup table as a last resort - Init files are updated so the model can support all the functionalities mentioned above - Add words to tokenizer_exceptions_list.py in respect to regex used in tokenizer_exceptions.py ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [X] I have submitted the spaCy Contributor Agreement. - [X] I ran the tests, and all new and existing tests passed. - [X] My changes don't require a change to the documentation, or if they do, I've added all required information. * Set version to 2.0.13 * Fix formatting and consistency * Update docs for new version [ci skip] * Increment version [ci skip] * Add info on wheels [ci skip] * Adding "This is a sentence" example to Sinhala (#2846) * Add wheels badge * Update badge [ci skip] * Update README.rst [ci skip] * Update murmurhash pin * Increment version to 2.0.14.dev0 * Update GPU docs for v2.0.14 * Add wheel to setup_requires * Import prefer_gpu and require_gpu functions from Thinc * Add tests for prefer_gpu() and require_gpu() * Update requirements and setup.py * Workaround bug in thinc require_gpu * Set version to v2.0.14 * Update push-tag script * Unhack prefer_gpu * Require thinc 6.10.6 * Update prefer_gpu and require_gpu docs [ci skip] * Fix specifiers for GPU * Set version to 2.0.14.dev1 * Set version to 2.0.14 * Update Thinc version pin * Increment version * Fix msgpack-numpy version pin * Increment version * Update version to 2.0.16 * Update version [ci skip] * Redundant ')' in the Stop words' example (#2856) <!--- Provide a general summary of your changes in the title. --> ## Description <!--- Use this section to describe your changes. If your changes required testing, include information about the testing environment and the tests you ran. If your test fixes a bug reported in an issue, don't forget to include the issue number. If your PR is still a work in progress, that's totally fine – just include a note to let us know. --> ### Types of change <!-- What type of change does your PR cover? Is it a bug fix, an enhancement or new feature, or a change to the documentation? --> ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [ ] I have submitted the spaCy Contributor Agreement. - [ ] I ran the tests, and all new and existing tests passed. - [ ] My changes don't require a change to the documentation, or if they do, I've added all required information. * Documentation improvement regarding joblib and SO (#2867) Some documentation improvements ## Description 1. Fixed the dead URL to joblib 2. Fixed Stack Overflow brand name (with space) ### Types of change Documentation ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. * raise error when setting overlapping entities as doc.ents (#2880) * Fix out-of-bounds access in NER training The helper method state.B(1) gets the index of the first token of the buffer, or -1 if no such token exists. Normally this is safe because we pass this to functions like state.safe_get(), which returns an empty token. Here we used it directly as an array index, which is not okay! This error may have been the cause of out-of-bounds access errors during training. Similar errors may still be around, so much be hunted down. Hunting this one down took a long time...I printed out values across training runs and diffed, looking for points of divergence between runs, when no randomness should be allowed. * Change PyThaiNLP Url (#2876) * Fix missing comma * Add example showing a fix-up rule for space entities * Set version to 2.0.17.dev0 * Update regex version * Revert "Update regex version" This reverts commit 62358dd867d15bc6a475942dff34effba69dd70a. * Try setting older regex version, to align with conda * Set version to 2.0.17 * Add spacy-js to universe [ci-skip] * Add spacy-raspberry to universe (closes #2889) * Add script to validate universe json [ci skip] * Removed space in docs + added contributor indo (#2909) * - removed unneeded space in documentation * - added contributor info * Allow input text of length up to max_length, inclusive (#2922) * Include universe spec for spacy-wordnet component (#2919) * feat: include universe spec for spacy-wordnet component * chore: include spaCy contributor agreement * Minor formatting changes [ci skip] * Fix image [ci skip] Twitter URL doesn't work on live site * Check if the word is in one of the regular lists specific to each POS (#2886) * 💫 Create random IDs for SVGs to prevent ID clashes (#2927) Resolves #2924. ## Description Fixes problem where multiple visualizations in Jupyter notebooks would have clashing arc IDs, resulting in weirdly positioned arc labels. Generating a random ID prefix so even identical parses won't receive the same IDs for consistency (even if effect of ID clash isn't noticable here.) ### Types of change bug fix ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. * Fix typo [ci skip] * fixes symbolic link on py3 and windows (#2949) * fixes symbolic link on py3 and windows during setup of spacy using command python -m spacy link en_core_web_sm en closes #2948 * Update spacy/compat.py Co-Authored-By: cicorias <cicorias@users.noreply.github.com> * Fix formatting * Update universe [ci skip] * Catalan Language Support (#2940) * Catalan language Support * Ddding Catalan to documentation * Sort languages alphabetically [ci skip] * Update tests for pytest 4.x (#2965) <!--- Provide a general summary of your changes in the title. --> ## Description - [x] Replace marks in params for pytest 4.0 compat ([see here](https://docs.pytest.org/en/latest/deprecations.html#marks-in-pytest-mark-parametrize)) - [x] Un-xfail passing tests (some fixes in a recent update resolved a bunch of issues, but tests were apparently never updated here) ### Types of change <!-- What type of change does your PR cover? Is it a bug fix, an enhancement or new feature, or a change to the documentation? --> ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. * Fix regex pin to harmonize with conda (#2964) * Update README.rst * Fix bug where Vocab.prune_vector did not use 'batch_size' (#2977) Fixes #2976 * Fix typo * Fix typo * Remove duplicate file * Require thinc 7.0.0.dev2 Fixes bug in gpu_ops that would use cupy instead of numpy on CPU * Add missing import * Fix error IDs * Fix tests
2018-11-29 15:30:29 +00:00
putra/i
2017-07-26 12:12:52 +00:00
putri-putri
putus-putus
putusan-putusan
2017-07-24 07:10:16 +00:00
puvi-puvi
2017-07-26 12:12:52 +00:00
quad-core
2017-07-27 12:46:30 +00:00
raba-rabaan
2017-07-24 07:10:16 +00:00
raba-rubu
2017-07-27 12:46:30 +00:00
rada-rada
2017-07-26 12:12:52 +00:00
radio-frequency
ragu-ragu
2017-07-27 12:46:30 +00:00
rahasia-rahasiaan
2017-07-26 12:12:52 +00:00
raja-raja
2017-07-24 07:10:16 +00:00
rama-rama
2017-07-26 12:12:52 +00:00
ramai-ramai
ramalan-ramalan
2017-07-27 12:46:30 +00:00
rambeh-rambeh
2017-07-26 12:12:52 +00:00
rambu-rambu
rame-rame
2017-07-27 12:46:30 +00:00
ramu-ramuan
2017-07-24 07:10:16 +00:00
randa-rondo
2017-07-27 12:46:30 +00:00
rangkul-merangkul
2017-07-24 07:10:16 +00:00
rango-rango
2017-07-26 12:12:52 +00:00
rap-rap
2017-07-27 12:46:30 +00:00
rasa-rasanya
2017-07-26 12:12:52 +00:00
rata-rata
2017-07-27 12:46:30 +00:00
raun-raun
2017-07-26 12:12:52 +00:00
read-only
real-life
real-time
2017-07-27 12:46:30 +00:00
rebah-rebah
rebah-rebahan
rebas-rebas
2017-07-26 12:12:52 +00:00
red-eye
2017-07-27 12:46:30 +00:00
redam-redam
redep-redup
2017-07-26 12:12:52 +00:00
rehab-rekon
2017-07-27 12:46:30 +00:00
reja-reja
reka-reka
reka-rekaan
2017-07-26 12:12:52 +00:00
rekan-rekan
rekan-rekannya
rekor-rekor
relief-relief
2017-07-27 12:46:30 +00:00
remah-remah
2017-07-26 12:12:52 +00:00
remang-remang
2017-07-27 12:46:30 +00:00
rembah-rembah
rembah-rembih
remeh-cemeh
remeh-temeh
2017-07-26 12:12:52 +00:00
rempah-rempah
rencana-rencana
2017-07-27 12:46:30 +00:00
renyai-renyai
2017-07-24 07:10:16 +00:00
rep-repan
2017-07-27 12:46:30 +00:00
repot-repot
repuh-repuh
2017-07-26 12:12:52 +00:00
restoran-restoran
2017-07-27 12:46:30 +00:00
retak-retak
2017-07-24 07:10:16 +00:00
riang-riang
ribu-ribu
2017-07-26 12:12:52 +00:00
ribut-ribut
rica-rica
ride-sharing
2017-07-24 07:10:16 +00:00
rigi-rigi
2017-07-27 12:46:30 +00:00
rinai-rinai
2017-07-24 07:10:16 +00:00
rintik-rintik
2017-07-26 12:12:52 +00:00
ritual-ritual
2017-07-24 07:10:16 +00:00
robak-rabik
robat-rabit
2017-07-26 12:12:52 +00:00
robot-robot
role-play
role-playing
roll-on
2017-07-24 07:10:16 +00:00
rombang-rambing
romol-romol
2017-07-27 12:46:30 +00:00
rompang-romping
2017-07-24 07:10:16 +00:00
rondah-rondih
ropak-rapik
2017-07-27 12:46:30 +00:00
royal-royalan
2017-07-26 12:12:52 +00:00
royo-royo
2017-07-27 12:46:30 +00:00
ruak-ruak
2017-07-24 07:10:16 +00:00
ruba-ruba
2017-07-26 12:12:52 +00:00
rudal-rudal
2017-07-27 12:46:30 +00:00
ruji-ruji
ruku-ruku
2017-07-26 12:12:52 +00:00
rumah-rumah
2017-07-27 12:46:30 +00:00
rumah-rumahan
2017-07-24 07:10:16 +00:00
rumbai-rumbai
2017-07-27 12:46:30 +00:00
rumput-rumputan
runding-merunding
2017-07-24 07:10:16 +00:00
rundu-rundu
runggu-rangga
2017-07-26 12:12:52 +00:00
runner-up
2017-07-24 07:10:16 +00:00
runtang-runtung
2017-07-26 12:12:52 +00:00
rupa-rupa
2017-07-27 12:46:30 +00:00
rupa-rupanya
2017-07-26 12:12:52 +00:00
rusun-rusun
rute-rute
saat-saat
2017-07-27 12:46:30 +00:00
saban-saban
2017-07-24 07:10:16 +00:00
sabu-sabu
2017-07-27 12:46:30 +00:00
sabung-menyabung
2017-07-26 12:12:52 +00:00
sah-sah
sahabat-sahabat
saham-saham
2017-07-27 12:46:30 +00:00
sahut-menyahut
saing-menyaing
saji-sajian
sakit-sakitan
2017-07-26 12:12:52 +00:00
saksi-saksi
2017-07-27 12:46:30 +00:00
saku-saku
salah-salah
2017-07-24 07:10:16 +00:00
sama-sama
2017-07-27 12:46:30 +00:00
samar-samar
sambar-menyambar
sambung-bersambung
sambung-menyambung
sambut-menyambut
2017-07-24 07:10:16 +00:00
samo-samo
💫 Port master changes over to develop (#2979) * Create aryaprabhudesai.md (#2681) * Update _install.jade (#2688) Typo fix: "models" -> "model" * Add FAC to spacy.explain (resolves #2706) * Remove docstrings for deprecated arguments (see #2703) * When calling getoption() in conftest.py, pass a default option (#2709) * When calling getoption() in conftest.py, pass a default option This is necessary to allow testing an installed spacy by running: pytest --pyargs spacy * Add contributor agreement * update bengali token rules for hyphen and digits (#2731) * Less norm computations in token similarity (#2730) * Less norm computations in token similarity * Contributor agreement * Remove ')' for clarity (#2737) Sorry, don't mean to be nitpicky, I just noticed this when going through the CLI and thought it was a quick fix. That said, if this was intention than please let me know. * added contributor agreement for mbkupfer (#2738) * Basic support for Telugu language (#2751) * Lex _attrs for polish language (#2750) * Signed spaCy contributor agreement * Added polish version of english lex_attrs * Introduces a bulk merge function, in order to solve issue #653 (#2696) * Fix comment * Introduce bulk merge to increase performance on many span merges * Sign contributor agreement * Implement pull request suggestions * Describe converters more explicitly (see #2643) * Add multi-threading note to Language.pipe (resolves #2582) [ci skip] * Fix formatting * Fix dependency scheme docs (closes #2705) [ci skip] * Don't set stop word in example (closes #2657) [ci skip] * Add words to portuguese language _num_words (#2759) * Add words to portuguese language _num_words * Add words to portuguese language _num_words * Update Indonesian model (#2752) * adding e-KTP in tokenizer exceptions list * add exception token * removing lines with containing space as it won't matter since we use .split() method in the end, added new tokens in exception * add tokenizer exceptions list * combining base_norms with norm_exceptions * adding norm_exception * fix double key in lemmatizer * remove unused import on punctuation.py * reformat stop_words to reduce number of lines, improve readibility * updating tokenizer exception * implement is_currency for lang/id * adding orth_first_upper in tokenizer_exceptions * update the norm_exception list * remove bunch of abbreviations * adding contributors file * Fixed spaCy+Keras example (#2763) * bug fixes in keras example * created contributor agreement * Adding French hyphenated first name (#2786) * Fix typo (closes #2784) * Fix typo (#2795) [ci skip] Fixed typo on line 6 "regcognizer --> recognizer" * Adding basic support for Sinhala language. (#2788) * adding Sinhala language package, stop words, examples and lex_attrs. * Adding contributor agreement * Updating contributor agreement * Also include lowercase norm exceptions * Fix error (#2802) * Fix error ValueError: cannot resize an array that references or is referenced by another array in this way. Use the resize function * added spaCy Contributor Agreement * Add charlax's contributor agreement (#2805) * agreement of contributor, may I introduce a tiny pl languge contribution (#2799) * Contributors agreement * Contributors agreement * Contributors agreement * Add jupyter=True to displacy.render in documentation (#2806) * Revert "Also include lowercase norm exceptions" This reverts commit 70f4e8adf37cfcfab60be2b97d6deae949b30e9e. * Remove deprecated encoding argument to msgpack * Set up dependency tree pattern matching skeleton (#2732) * Fix bug when too many entity types. Fixes #2800 * Fix Python 2 test failure * Require older msgpack-numpy * Restore encoding arg on msgpack-numpy * Try to fix version pin for msgpack-numpy * Update Portuguese Language (#2790) * Add words to portuguese language _num_words * Add words to portuguese language _num_words * Portuguese - Add/remove stopwords, fix tokenizer, add currency symbols * Extended punctuation and norm_exceptions in the Portuguese language * Correct error in spacy universe docs concerning spacy-lookup (#2814) * Update Keras Example for (Parikh et al, 2016) implementation (#2803) * bug fixes in keras example * created contributor agreement * baseline for Parikh model * initial version of parikh 2016 implemented * tested asymmetric models * fixed grevious error in normalization * use standard SNLI test file * begin to rework parikh example * initial version of running example * start to document the new version * start to document the new version * Update Decompositional Attention.ipynb * fixed calls to similarity * updated the README * import sys package duh * simplified indexing on mapping word to IDs * stupid python indent error * added code from https://github.com/tensorflow/tensorflow/issues/3388 for tf bug workaround * Fix typo (closes #2815) [ci skip] * Update regex version dependency * Set version to 2.0.13.dev3 * Skip seemingly problematic test * Remove problematic test * Try previous version of regex * Revert "Remove problematic test" This reverts commit bdebbef45552d698d390aa430b527ee27830f11b. * Unskip test * Try older version of regex * 💫 Update training examples and use minibatching (#2830) <!--- Provide a general summary of your changes in the title. --> ## Description Update the training examples in `/examples/training` to show usage of spaCy's `minibatch` and `compounding` helpers ([see here](https://spacy.io/usage/training#tips-batch-size) for details). The lack of batching in the examples has caused some confusion in the past, especially for beginners who would copy-paste the examples, update them with large training sets and experienced slow and unsatisfying results. ### Types of change enhancements ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. * Visual C++ link updated (#2842) (closes #2841) [ci skip] * New landing page * Add contribution agreement * Correcting lang/ru/examples.py (#2845) * Correct some grammatical inaccuracies in lang\ru\examples.py; filled Contributor Agreement * Correct some grammatical inaccuracies in lang\ru\examples.py * Move contributor agreement to separate file * Set version to 2.0.13.dev4 * Add Persian(Farsi) language support (#2797) * Also include lowercase norm exceptions * Remove in favour of https://github.com/explosion/spaCy/graphs/contributors * Rule-based French Lemmatizer (#2818) <!--- Provide a general summary of your changes in the title. --> ## Description <!--- Use this section to describe your changes. If your changes required testing, include information about the testing environment and the tests you ran. If your test fixes a bug reported in an issue, don't forget to include the issue number. If your PR is still a work in progress, that's totally fine – just include a note to let us know. --> Add a rule-based French Lemmatizer following the english one and the excellent PR for [greek language optimizations](https://github.com/explosion/spaCy/pull/2558) to adapt the Lemmatizer class. ### Types of change <!-- What type of change does your PR cover? Is it a bug fix, an enhancement or new feature, or a change to the documentation? --> - Lemma dictionary used can be found [here](http://infolingu.univ-mlv.fr/DonneesLinguistiques/Dictionnaires/telechargement.html), I used the XML version. - Add several files containing exhaustive list of words for each part of speech - Add some lemma rules - Add POS that are not checked in the standard Lemmatizer, i.e PRON, DET, ADV and AUX - Modify the Lemmatizer class to check in lookup table as a last resort if POS not mentionned - Modify the lemmatize function to check in lookup table as a last resort - Init files are updated so the model can support all the functionalities mentioned above - Add words to tokenizer_exceptions_list.py in respect to regex used in tokenizer_exceptions.py ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [X] I have submitted the spaCy Contributor Agreement. - [X] I ran the tests, and all new and existing tests passed. - [X] My changes don't require a change to the documentation, or if they do, I've added all required information. * Set version to 2.0.13 * Fix formatting and consistency * Update docs for new version [ci skip] * Increment version [ci skip] * Add info on wheels [ci skip] * Adding "This is a sentence" example to Sinhala (#2846) * Add wheels badge * Update badge [ci skip] * Update README.rst [ci skip] * Update murmurhash pin * Increment version to 2.0.14.dev0 * Update GPU docs for v2.0.14 * Add wheel to setup_requires * Import prefer_gpu and require_gpu functions from Thinc * Add tests for prefer_gpu() and require_gpu() * Update requirements and setup.py * Workaround bug in thinc require_gpu * Set version to v2.0.14 * Update push-tag script * Unhack prefer_gpu * Require thinc 6.10.6 * Update prefer_gpu and require_gpu docs [ci skip] * Fix specifiers for GPU * Set version to 2.0.14.dev1 * Set version to 2.0.14 * Update Thinc version pin * Increment version * Fix msgpack-numpy version pin * Increment version * Update version to 2.0.16 * Update version [ci skip] * Redundant ')' in the Stop words' example (#2856) <!--- Provide a general summary of your changes in the title. --> ## Description <!--- Use this section to describe your changes. If your changes required testing, include information about the testing environment and the tests you ran. If your test fixes a bug reported in an issue, don't forget to include the issue number. If your PR is still a work in progress, that's totally fine – just include a note to let us know. --> ### Types of change <!-- What type of change does your PR cover? Is it a bug fix, an enhancement or new feature, or a change to the documentation? --> ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [ ] I have submitted the spaCy Contributor Agreement. - [ ] I ran the tests, and all new and existing tests passed. - [ ] My changes don't require a change to the documentation, or if they do, I've added all required information. * Documentation improvement regarding joblib and SO (#2867) Some documentation improvements ## Description 1. Fixed the dead URL to joblib 2. Fixed Stack Overflow brand name (with space) ### Types of change Documentation ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. * raise error when setting overlapping entities as doc.ents (#2880) * Fix out-of-bounds access in NER training The helper method state.B(1) gets the index of the first token of the buffer, or -1 if no such token exists. Normally this is safe because we pass this to functions like state.safe_get(), which returns an empty token. Here we used it directly as an array index, which is not okay! This error may have been the cause of out-of-bounds access errors during training. Similar errors may still be around, so much be hunted down. Hunting this one down took a long time...I printed out values across training runs and diffed, looking for points of divergence between runs, when no randomness should be allowed. * Change PyThaiNLP Url (#2876) * Fix missing comma * Add example showing a fix-up rule for space entities * Set version to 2.0.17.dev0 * Update regex version * Revert "Update regex version" This reverts commit 62358dd867d15bc6a475942dff34effba69dd70a. * Try setting older regex version, to align with conda * Set version to 2.0.17 * Add spacy-js to universe [ci-skip] * Add spacy-raspberry to universe (closes #2889) * Add script to validate universe json [ci skip] * Removed space in docs + added contributor indo (#2909) * - removed unneeded space in documentation * - added contributor info * Allow input text of length up to max_length, inclusive (#2922) * Include universe spec for spacy-wordnet component (#2919) * feat: include universe spec for spacy-wordnet component * chore: include spaCy contributor agreement * Minor formatting changes [ci skip] * Fix image [ci skip] Twitter URL doesn't work on live site * Check if the word is in one of the regular lists specific to each POS (#2886) * 💫 Create random IDs for SVGs to prevent ID clashes (#2927) Resolves #2924. ## Description Fixes problem where multiple visualizations in Jupyter notebooks would have clashing arc IDs, resulting in weirdly positioned arc labels. Generating a random ID prefix so even identical parses won't receive the same IDs for consistency (even if effect of ID clash isn't noticable here.) ### Types of change bug fix ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. * Fix typo [ci skip] * fixes symbolic link on py3 and windows (#2949) * fixes symbolic link on py3 and windows during setup of spacy using command python -m spacy link en_core_web_sm en closes #2948 * Update spacy/compat.py Co-Authored-By: cicorias <cicorias@users.noreply.github.com> * Fix formatting * Update universe [ci skip] * Catalan Language Support (#2940) * Catalan language Support * Ddding Catalan to documentation * Sort languages alphabetically [ci skip] * Update tests for pytest 4.x (#2965) <!--- Provide a general summary of your changes in the title. --> ## Description - [x] Replace marks in params for pytest 4.0 compat ([see here](https://docs.pytest.org/en/latest/deprecations.html#marks-in-pytest-mark-parametrize)) - [x] Un-xfail passing tests (some fixes in a recent update resolved a bunch of issues, but tests were apparently never updated here) ### Types of change <!-- What type of change does your PR cover? Is it a bug fix, an enhancement or new feature, or a change to the documentation? --> ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. * Fix regex pin to harmonize with conda (#2964) * Update README.rst * Fix bug where Vocab.prune_vector did not use 'batch_size' (#2977) Fixes #2976 * Fix typo * Fix typo * Remove duplicate file * Require thinc 7.0.0.dev2 Fixes bug in gpu_ops that would use cupy instead of numpy on CPU * Add missing import * Fix error IDs * Fix tests
2018-11-29 15:30:29 +00:00
sampah-sampah
2017-07-24 07:10:16 +00:00
sampai-sampai
2017-07-27 12:46:30 +00:00
samping-menyamping
2017-07-26 12:12:52 +00:00
sana-sini
2017-07-27 12:46:30 +00:00
sandar-menyandar
2017-07-26 12:12:52 +00:00
sandi-sandi
2017-07-27 12:46:30 +00:00
sangat-sangat
sangkut-menyangkut
sapa-menyapa
sapai-sapai
2017-07-26 12:12:52 +00:00
sapi-sapi
2017-07-27 12:46:30 +00:00
sapu-sapu
2017-07-26 12:12:52 +00:00
saran-saran
sarana-prasarana
2017-07-27 12:46:30 +00:00
sari-sari
2017-07-24 07:10:16 +00:00
sarit-sarit
2017-07-26 12:12:52 +00:00
satu-dua
satu-satu
satu-satunya
satuan-satuan
saudara-saudara
2017-07-27 12:46:30 +00:00
sauk-menyauk
sauk-sauk
2017-07-26 12:12:52 +00:00
sayang-sayang
sayap-sayap
2017-07-27 12:46:30 +00:00
sayup-menyayup
sayup-sayup
2017-07-26 12:12:52 +00:00
sayur-mayur
2017-07-27 12:46:30 +00:00
sayur-sayuran
2017-07-26 12:12:52 +00:00
sci-fi
2017-07-27 12:46:30 +00:00
seagak-agak
seakal-akal
seakan-akan
sealak-alak
seari-arian
2017-07-24 07:10:16 +00:00
sebaik-baiknya
2017-07-27 12:46:30 +00:00
sebelah-menyebelah
sebentar-sebentar
seberang-menyeberang
seberuntung-beruntungnya
2017-07-26 12:12:52 +00:00
sebesar-besarnya
2017-07-27 12:46:30 +00:00
seboleh-bolehnya
sedalam-dalamnya
sedam-sedam
sedang-menyedang
2017-07-26 12:12:52 +00:00
sedang-sedang
2017-07-27 12:46:30 +00:00
sedap-sedapan
sedapat-dapatnya
sedikit-dikitnya
sedikit-sedikit
sedikit-sedikitnya
sedini-dininya
seelok-eloknya
segala-galanya
segan-menyegan
segan-menyegani
2017-07-24 07:10:16 +00:00
segan-segan
2017-07-27 12:46:30 +00:00
sehabis-habisnya
2017-07-26 12:12:52 +00:00
sehari-hari
2017-07-27 12:46:30 +00:00
sehari-harian
2017-07-26 12:12:52 +00:00
sehari-harinya
2017-07-27 12:46:30 +00:00
sejadi-jadinya
2017-07-24 07:10:16 +00:00
sekali-kali
2017-07-27 12:46:30 +00:00
sekali-sekali
sekenyang-kenyangnya
sekira-kira
2017-07-26 12:12:52 +00:00
sekolah-sekolah
2017-07-24 07:10:16 +00:00
sekonyong-konyong
2017-07-27 12:46:30 +00:00
sekosong-kosongnya
2017-07-26 12:12:52 +00:00
sektor-sektor
2017-07-27 12:46:30 +00:00
sekuasa-kuasanya
sekuat-kuatnya
2017-07-24 07:10:16 +00:00
sekurang-kurangnya
2017-07-26 12:12:52 +00:00
sel-sel
2017-07-27 12:46:30 +00:00
sela-menyela
sela-sela
2017-07-24 07:10:16 +00:00
selak-seluk
selama-lamanya
2017-07-26 12:12:52 +00:00
selambat-lambatnya
2017-07-24 07:10:16 +00:00
selang-seli
selang-seling
2017-07-27 12:46:30 +00:00
selar-belar
selat-latnya
2017-07-26 12:12:52 +00:00
selatan-tenggara
2017-07-27 12:46:30 +00:00
selekas-lekasnya
2017-07-24 07:10:16 +00:00
selentang-selenting
2017-07-27 12:46:30 +00:00
selepas-lepas
2017-07-26 12:12:52 +00:00
self-driving
self-esteem
self-healing
self-help
2017-07-27 12:46:30 +00:00
selir-menyelir
seloyong-seloyong
2017-07-24 07:10:16 +00:00
seluk-beluk
2017-07-27 12:46:30 +00:00
seluk-semeluk
2017-07-24 07:10:16 +00:00
sema-sema
2017-07-27 12:46:30 +00:00
semah-semah
semak-semak
semaksimal-maksimalnya
semalam-malaman
2017-07-24 07:10:16 +00:00
semang-semang
2017-07-27 12:46:30 +00:00
semanis-manisnya
semasa-masa
2017-07-24 07:10:16 +00:00
semata-mata
2017-07-27 12:46:30 +00:00
semau-maunya
2017-07-26 12:12:52 +00:00
sembunyi-sembunyi
2017-07-27 12:46:30 +00:00
sembunyi-sembunyian
sembur-sembur
2017-07-26 12:12:52 +00:00
semena-mena
2017-07-27 12:46:30 +00:00
semenda-menyemenda
semengga-mengga
semenggah-menggah
sementang-mentang
semerdeka-merdekanya
2017-07-26 12:12:52 +00:00
semi-final
semi-permanen
2017-07-27 12:46:30 +00:00
sempat-sempatnya
semu-semu
semua-muanya
semujur-mujurnya
semut-semutan
sen-senan
2017-07-26 12:12:52 +00:00
sendiri-sendiri
2017-07-27 12:46:30 +00:00
sengal-sengal
2017-07-24 07:10:16 +00:00
sengar-sengir
2017-07-27 12:46:30 +00:00
sengau-sengauan
senggak-sengguk
senggang-tenggang
senggol-menyenggol
2017-07-26 12:12:52 +00:00
senior-junior
senjata-senjata
senyum-senyum
2017-07-24 07:10:16 +00:00
seolah-olah
sepala-pala
2017-07-27 12:46:30 +00:00
sepandai-pandai
sepetang-petangan
💫 Port master changes over to develop (#2979) * Create aryaprabhudesai.md (#2681) * Update _install.jade (#2688) Typo fix: "models" -> "model" * Add FAC to spacy.explain (resolves #2706) * Remove docstrings for deprecated arguments (see #2703) * When calling getoption() in conftest.py, pass a default option (#2709) * When calling getoption() in conftest.py, pass a default option This is necessary to allow testing an installed spacy by running: pytest --pyargs spacy * Add contributor agreement * update bengali token rules for hyphen and digits (#2731) * Less norm computations in token similarity (#2730) * Less norm computations in token similarity * Contributor agreement * Remove ')' for clarity (#2737) Sorry, don't mean to be nitpicky, I just noticed this when going through the CLI and thought it was a quick fix. That said, if this was intention than please let me know. * added contributor agreement for mbkupfer (#2738) * Basic support for Telugu language (#2751) * Lex _attrs for polish language (#2750) * Signed spaCy contributor agreement * Added polish version of english lex_attrs * Introduces a bulk merge function, in order to solve issue #653 (#2696) * Fix comment * Introduce bulk merge to increase performance on many span merges * Sign contributor agreement * Implement pull request suggestions * Describe converters more explicitly (see #2643) * Add multi-threading note to Language.pipe (resolves #2582) [ci skip] * Fix formatting * Fix dependency scheme docs (closes #2705) [ci skip] * Don't set stop word in example (closes #2657) [ci skip] * Add words to portuguese language _num_words (#2759) * Add words to portuguese language _num_words * Add words to portuguese language _num_words * Update Indonesian model (#2752) * adding e-KTP in tokenizer exceptions list * add exception token * removing lines with containing space as it won't matter since we use .split() method in the end, added new tokens in exception * add tokenizer exceptions list * combining base_norms with norm_exceptions * adding norm_exception * fix double key in lemmatizer * remove unused import on punctuation.py * reformat stop_words to reduce number of lines, improve readibility * updating tokenizer exception * implement is_currency for lang/id * adding orth_first_upper in tokenizer_exceptions * update the norm_exception list * remove bunch of abbreviations * adding contributors file * Fixed spaCy+Keras example (#2763) * bug fixes in keras example * created contributor agreement * Adding French hyphenated first name (#2786) * Fix typo (closes #2784) * Fix typo (#2795) [ci skip] Fixed typo on line 6 "regcognizer --> recognizer" * Adding basic support for Sinhala language. (#2788) * adding Sinhala language package, stop words, examples and lex_attrs. * Adding contributor agreement * Updating contributor agreement * Also include lowercase norm exceptions * Fix error (#2802) * Fix error ValueError: cannot resize an array that references or is referenced by another array in this way. Use the resize function * added spaCy Contributor Agreement * Add charlax's contributor agreement (#2805) * agreement of contributor, may I introduce a tiny pl languge contribution (#2799) * Contributors agreement * Contributors agreement * Contributors agreement * Add jupyter=True to displacy.render in documentation (#2806) * Revert "Also include lowercase norm exceptions" This reverts commit 70f4e8adf37cfcfab60be2b97d6deae949b30e9e. * Remove deprecated encoding argument to msgpack * Set up dependency tree pattern matching skeleton (#2732) * Fix bug when too many entity types. Fixes #2800 * Fix Python 2 test failure * Require older msgpack-numpy * Restore encoding arg on msgpack-numpy * Try to fix version pin for msgpack-numpy * Update Portuguese Language (#2790) * Add words to portuguese language _num_words * Add words to portuguese language _num_words * Portuguese - Add/remove stopwords, fix tokenizer, add currency symbols * Extended punctuation and norm_exceptions in the Portuguese language * Correct error in spacy universe docs concerning spacy-lookup (#2814) * Update Keras Example for (Parikh et al, 2016) implementation (#2803) * bug fixes in keras example * created contributor agreement * baseline for Parikh model * initial version of parikh 2016 implemented * tested asymmetric models * fixed grevious error in normalization * use standard SNLI test file * begin to rework parikh example * initial version of running example * start to document the new version * start to document the new version * Update Decompositional Attention.ipynb * fixed calls to similarity * updated the README * import sys package duh * simplified indexing on mapping word to IDs * stupid python indent error * added code from https://github.com/tensorflow/tensorflow/issues/3388 for tf bug workaround * Fix typo (closes #2815) [ci skip] * Update regex version dependency * Set version to 2.0.13.dev3 * Skip seemingly problematic test * Remove problematic test * Try previous version of regex * Revert "Remove problematic test" This reverts commit bdebbef45552d698d390aa430b527ee27830f11b. * Unskip test * Try older version of regex * 💫 Update training examples and use minibatching (#2830) <!--- Provide a general summary of your changes in the title. --> ## Description Update the training examples in `/examples/training` to show usage of spaCy's `minibatch` and `compounding` helpers ([see here](https://spacy.io/usage/training#tips-batch-size) for details). The lack of batching in the examples has caused some confusion in the past, especially for beginners who would copy-paste the examples, update them with large training sets and experienced slow and unsatisfying results. ### Types of change enhancements ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. * Visual C++ link updated (#2842) (closes #2841) [ci skip] * New landing page * Add contribution agreement * Correcting lang/ru/examples.py (#2845) * Correct some grammatical inaccuracies in lang\ru\examples.py; filled Contributor Agreement * Correct some grammatical inaccuracies in lang\ru\examples.py * Move contributor agreement to separate file * Set version to 2.0.13.dev4 * Add Persian(Farsi) language support (#2797) * Also include lowercase norm exceptions * Remove in favour of https://github.com/explosion/spaCy/graphs/contributors * Rule-based French Lemmatizer (#2818) <!--- Provide a general summary of your changes in the title. --> ## Description <!--- Use this section to describe your changes. If your changes required testing, include information about the testing environment and the tests you ran. If your test fixes a bug reported in an issue, don't forget to include the issue number. If your PR is still a work in progress, that's totally fine – just include a note to let us know. --> Add a rule-based French Lemmatizer following the english one and the excellent PR for [greek language optimizations](https://github.com/explosion/spaCy/pull/2558) to adapt the Lemmatizer class. ### Types of change <!-- What type of change does your PR cover? Is it a bug fix, an enhancement or new feature, or a change to the documentation? --> - Lemma dictionary used can be found [here](http://infolingu.univ-mlv.fr/DonneesLinguistiques/Dictionnaires/telechargement.html), I used the XML version. - Add several files containing exhaustive list of words for each part of speech - Add some lemma rules - Add POS that are not checked in the standard Lemmatizer, i.e PRON, DET, ADV and AUX - Modify the Lemmatizer class to check in lookup table as a last resort if POS not mentionned - Modify the lemmatize function to check in lookup table as a last resort - Init files are updated so the model can support all the functionalities mentioned above - Add words to tokenizer_exceptions_list.py in respect to regex used in tokenizer_exceptions.py ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [X] I have submitted the spaCy Contributor Agreement. - [X] I ran the tests, and all new and existing tests passed. - [X] My changes don't require a change to the documentation, or if they do, I've added all required information. * Set version to 2.0.13 * Fix formatting and consistency * Update docs for new version [ci skip] * Increment version [ci skip] * Add info on wheels [ci skip] * Adding "This is a sentence" example to Sinhala (#2846) * Add wheels badge * Update badge [ci skip] * Update README.rst [ci skip] * Update murmurhash pin * Increment version to 2.0.14.dev0 * Update GPU docs for v2.0.14 * Add wheel to setup_requires * Import prefer_gpu and require_gpu functions from Thinc * Add tests for prefer_gpu() and require_gpu() * Update requirements and setup.py * Workaround bug in thinc require_gpu * Set version to v2.0.14 * Update push-tag script * Unhack prefer_gpu * Require thinc 6.10.6 * Update prefer_gpu and require_gpu docs [ci skip] * Fix specifiers for GPU * Set version to 2.0.14.dev1 * Set version to 2.0.14 * Update Thinc version pin * Increment version * Fix msgpack-numpy version pin * Increment version * Update version to 2.0.16 * Update version [ci skip] * Redundant ')' in the Stop words' example (#2856) <!--- Provide a general summary of your changes in the title. --> ## Description <!--- Use this section to describe your changes. If your changes required testing, include information about the testing environment and the tests you ran. If your test fixes a bug reported in an issue, don't forget to include the issue number. If your PR is still a work in progress, that's totally fine – just include a note to let us know. --> ### Types of change <!-- What type of change does your PR cover? Is it a bug fix, an enhancement or new feature, or a change to the documentation? --> ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [ ] I have submitted the spaCy Contributor Agreement. - [ ] I ran the tests, and all new and existing tests passed. - [ ] My changes don't require a change to the documentation, or if they do, I've added all required information. * Documentation improvement regarding joblib and SO (#2867) Some documentation improvements ## Description 1. Fixed the dead URL to joblib 2. Fixed Stack Overflow brand name (with space) ### Types of change Documentation ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. * raise error when setting overlapping entities as doc.ents (#2880) * Fix out-of-bounds access in NER training The helper method state.B(1) gets the index of the first token of the buffer, or -1 if no such token exists. Normally this is safe because we pass this to functions like state.safe_get(), which returns an empty token. Here we used it directly as an array index, which is not okay! This error may have been the cause of out-of-bounds access errors during training. Similar errors may still be around, so much be hunted down. Hunting this one down took a long time...I printed out values across training runs and diffed, looking for points of divergence between runs, when no randomness should be allowed. * Change PyThaiNLP Url (#2876) * Fix missing comma * Add example showing a fix-up rule for space entities * Set version to 2.0.17.dev0 * Update regex version * Revert "Update regex version" This reverts commit 62358dd867d15bc6a475942dff34effba69dd70a. * Try setting older regex version, to align with conda * Set version to 2.0.17 * Add spacy-js to universe [ci-skip] * Add spacy-raspberry to universe (closes #2889) * Add script to validate universe json [ci skip] * Removed space in docs + added contributor indo (#2909) * - removed unneeded space in documentation * - added contributor info * Allow input text of length up to max_length, inclusive (#2922) * Include universe spec for spacy-wordnet component (#2919) * feat: include universe spec for spacy-wordnet component * chore: include spaCy contributor agreement * Minor formatting changes [ci skip] * Fix image [ci skip] Twitter URL doesn't work on live site * Check if the word is in one of the regular lists specific to each POS (#2886) * 💫 Create random IDs for SVGs to prevent ID clashes (#2927) Resolves #2924. ## Description Fixes problem where multiple visualizations in Jupyter notebooks would have clashing arc IDs, resulting in weirdly positioned arc labels. Generating a random ID prefix so even identical parses won't receive the same IDs for consistency (even if effect of ID clash isn't noticable here.) ### Types of change bug fix ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. * Fix typo [ci skip] * fixes symbolic link on py3 and windows (#2949) * fixes symbolic link on py3 and windows during setup of spacy using command python -m spacy link en_core_web_sm en closes #2948 * Update spacy/compat.py Co-Authored-By: cicorias <cicorias@users.noreply.github.com> * Fix formatting * Update universe [ci skip] * Catalan Language Support (#2940) * Catalan language Support * Ddding Catalan to documentation * Sort languages alphabetically [ci skip] * Update tests for pytest 4.x (#2965) <!--- Provide a general summary of your changes in the title. --> ## Description - [x] Replace marks in params for pytest 4.0 compat ([see here](https://docs.pytest.org/en/latest/deprecations.html#marks-in-pytest-mark-parametrize)) - [x] Un-xfail passing tests (some fixes in a recent update resolved a bunch of issues, but tests were apparently never updated here) ### Types of change <!-- What type of change does your PR cover? Is it a bug fix, an enhancement or new feature, or a change to the documentation? --> ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. * Fix regex pin to harmonize with conda (#2964) * Update README.rst * Fix bug where Vocab.prune_vector did not use 'batch_size' (#2977) Fixes #2976 * Fix typo * Fix typo * Remove duplicate file * Require thinc 7.0.0.dev2 Fixes bug in gpu_ops that would use cupy instead of numpy on CPU * Add missing import * Fix error IDs * Fix tests
2018-11-29 15:30:29 +00:00
sepoi-sepoi
2017-07-27 12:46:30 +00:00
sepraktis-praktisnya
sepuas-puasnya
serak-serak
serak-serik
serang-menyerang
serang-serangan
2017-07-26 12:12:52 +00:00
serangan-serangan
2017-07-27 12:46:30 +00:00
seraya-menyeraya
2017-07-26 12:12:52 +00:00
serba-serbi
2017-07-24 07:10:16 +00:00
serbah-serbih
serembah-serembih
2017-07-26 12:12:52 +00:00
serigala-serigala
2017-07-27 12:46:30 +00:00
sering-sering
serobot-serobotan
serong-menyerong
serta-menyertai
2017-07-24 07:10:16 +00:00
serta-merta
2017-07-27 12:46:30 +00:00
serta-serta
2017-07-26 12:12:52 +00:00
seru-seruan
service-oriented
2017-07-27 12:46:30 +00:00
sesak-menyesak
sesal-menyesali
sesayup-sayup
2017-07-26 12:12:52 +00:00
sesi-sesi
2017-07-27 12:46:30 +00:00
sesuang-suang
sesudah-sudah
sesudah-sudahnya
sesuka-suka
sesuka-sukanya
2017-07-26 12:12:52 +00:00
set-piece
2017-07-27 12:46:30 +00:00
setempat-setempat
2017-07-26 12:12:52 +00:00
setengah-setengah
2017-07-24 07:10:16 +00:00
setidak-tidaknya
2017-07-26 12:12:52 +00:00
setinggi-tingginya
2017-07-27 12:46:30 +00:00
seupaya-upaya
seupaya-upayanya
2017-07-26 12:12:52 +00:00
sewa-menyewa
2017-07-27 12:46:30 +00:00
sewaktu-waktu
2017-07-26 12:12:52 +00:00
sewenang-wenang
2017-07-27 12:46:30 +00:00
sewot-sewotan
2017-07-26 12:12:52 +00:00
shabu-shabu
short-term
short-throw
2017-07-24 07:10:16 +00:00
sia-sia
2017-07-26 12:12:52 +00:00
siang-siang
siap-siap
siapa-siapa
2017-07-27 12:46:30 +00:00
sibar-sibar
sibur-sibur
sida-sida
2017-07-26 12:12:52 +00:00
side-by-side
sign-in
2017-07-27 12:46:30 +00:00
siku-siku
sikut-sikutan
silah-silah
silang-menyilang
silir-semilir
2017-07-26 12:12:52 +00:00
simbol-simbol
simpan-pinjam
2017-07-27 12:46:30 +00:00
sinar-menyinar
sinar-seminar
sinar-suminar
sindir-menyindir
2017-07-26 12:12:52 +00:00
singa-singa
2017-07-27 12:46:30 +00:00
singgah-menyinggah
2017-07-26 12:12:52 +00:00
single-core
sipil-militer
2017-07-27 12:46:30 +00:00
sir-siran
sirat-sirat
2017-07-26 12:12:52 +00:00
sisa-sisa
sisi-sisi
siswa-siswa
siswa-siswi
💫 Port master changes over to develop (#2979) * Create aryaprabhudesai.md (#2681) * Update _install.jade (#2688) Typo fix: "models" -> "model" * Add FAC to spacy.explain (resolves #2706) * Remove docstrings for deprecated arguments (see #2703) * When calling getoption() in conftest.py, pass a default option (#2709) * When calling getoption() in conftest.py, pass a default option This is necessary to allow testing an installed spacy by running: pytest --pyargs spacy * Add contributor agreement * update bengali token rules for hyphen and digits (#2731) * Less norm computations in token similarity (#2730) * Less norm computations in token similarity * Contributor agreement * Remove ')' for clarity (#2737) Sorry, don't mean to be nitpicky, I just noticed this when going through the CLI and thought it was a quick fix. That said, if this was intention than please let me know. * added contributor agreement for mbkupfer (#2738) * Basic support for Telugu language (#2751) * Lex _attrs for polish language (#2750) * Signed spaCy contributor agreement * Added polish version of english lex_attrs * Introduces a bulk merge function, in order to solve issue #653 (#2696) * Fix comment * Introduce bulk merge to increase performance on many span merges * Sign contributor agreement * Implement pull request suggestions * Describe converters more explicitly (see #2643) * Add multi-threading note to Language.pipe (resolves #2582) [ci skip] * Fix formatting * Fix dependency scheme docs (closes #2705) [ci skip] * Don't set stop word in example (closes #2657) [ci skip] * Add words to portuguese language _num_words (#2759) * Add words to portuguese language _num_words * Add words to portuguese language _num_words * Update Indonesian model (#2752) * adding e-KTP in tokenizer exceptions list * add exception token * removing lines with containing space as it won't matter since we use .split() method in the end, added new tokens in exception * add tokenizer exceptions list * combining base_norms with norm_exceptions * adding norm_exception * fix double key in lemmatizer * remove unused import on punctuation.py * reformat stop_words to reduce number of lines, improve readibility * updating tokenizer exception * implement is_currency for lang/id * adding orth_first_upper in tokenizer_exceptions * update the norm_exception list * remove bunch of abbreviations * adding contributors file * Fixed spaCy+Keras example (#2763) * bug fixes in keras example * created contributor agreement * Adding French hyphenated first name (#2786) * Fix typo (closes #2784) * Fix typo (#2795) [ci skip] Fixed typo on line 6 "regcognizer --> recognizer" * Adding basic support for Sinhala language. (#2788) * adding Sinhala language package, stop words, examples and lex_attrs. * Adding contributor agreement * Updating contributor agreement * Also include lowercase norm exceptions * Fix error (#2802) * Fix error ValueError: cannot resize an array that references or is referenced by another array in this way. Use the resize function * added spaCy Contributor Agreement * Add charlax's contributor agreement (#2805) * agreement of contributor, may I introduce a tiny pl languge contribution (#2799) * Contributors agreement * Contributors agreement * Contributors agreement * Add jupyter=True to displacy.render in documentation (#2806) * Revert "Also include lowercase norm exceptions" This reverts commit 70f4e8adf37cfcfab60be2b97d6deae949b30e9e. * Remove deprecated encoding argument to msgpack * Set up dependency tree pattern matching skeleton (#2732) * Fix bug when too many entity types. Fixes #2800 * Fix Python 2 test failure * Require older msgpack-numpy * Restore encoding arg on msgpack-numpy * Try to fix version pin for msgpack-numpy * Update Portuguese Language (#2790) * Add words to portuguese language _num_words * Add words to portuguese language _num_words * Portuguese - Add/remove stopwords, fix tokenizer, add currency symbols * Extended punctuation and norm_exceptions in the Portuguese language * Correct error in spacy universe docs concerning spacy-lookup (#2814) * Update Keras Example for (Parikh et al, 2016) implementation (#2803) * bug fixes in keras example * created contributor agreement * baseline for Parikh model * initial version of parikh 2016 implemented * tested asymmetric models * fixed grevious error in normalization * use standard SNLI test file * begin to rework parikh example * initial version of running example * start to document the new version * start to document the new version * Update Decompositional Attention.ipynb * fixed calls to similarity * updated the README * import sys package duh * simplified indexing on mapping word to IDs * stupid python indent error * added code from https://github.com/tensorflow/tensorflow/issues/3388 for tf bug workaround * Fix typo (closes #2815) [ci skip] * Update regex version dependency * Set version to 2.0.13.dev3 * Skip seemingly problematic test * Remove problematic test * Try previous version of regex * Revert "Remove problematic test" This reverts commit bdebbef45552d698d390aa430b527ee27830f11b. * Unskip test * Try older version of regex * 💫 Update training examples and use minibatching (#2830) <!--- Provide a general summary of your changes in the title. --> ## Description Update the training examples in `/examples/training` to show usage of spaCy's `minibatch` and `compounding` helpers ([see here](https://spacy.io/usage/training#tips-batch-size) for details). The lack of batching in the examples has caused some confusion in the past, especially for beginners who would copy-paste the examples, update them with large training sets and experienced slow and unsatisfying results. ### Types of change enhancements ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. * Visual C++ link updated (#2842) (closes #2841) [ci skip] * New landing page * Add contribution agreement * Correcting lang/ru/examples.py (#2845) * Correct some grammatical inaccuracies in lang\ru\examples.py; filled Contributor Agreement * Correct some grammatical inaccuracies in lang\ru\examples.py * Move contributor agreement to separate file * Set version to 2.0.13.dev4 * Add Persian(Farsi) language support (#2797) * Also include lowercase norm exceptions * Remove in favour of https://github.com/explosion/spaCy/graphs/contributors * Rule-based French Lemmatizer (#2818) <!--- Provide a general summary of your changes in the title. --> ## Description <!--- Use this section to describe your changes. If your changes required testing, include information about the testing environment and the tests you ran. If your test fixes a bug reported in an issue, don't forget to include the issue number. If your PR is still a work in progress, that's totally fine – just include a note to let us know. --> Add a rule-based French Lemmatizer following the english one and the excellent PR for [greek language optimizations](https://github.com/explosion/spaCy/pull/2558) to adapt the Lemmatizer class. ### Types of change <!-- What type of change does your PR cover? Is it a bug fix, an enhancement or new feature, or a change to the documentation? --> - Lemma dictionary used can be found [here](http://infolingu.univ-mlv.fr/DonneesLinguistiques/Dictionnaires/telechargement.html), I used the XML version. - Add several files containing exhaustive list of words for each part of speech - Add some lemma rules - Add POS that are not checked in the standard Lemmatizer, i.e PRON, DET, ADV and AUX - Modify the Lemmatizer class to check in lookup table as a last resort if POS not mentionned - Modify the lemmatize function to check in lookup table as a last resort - Init files are updated so the model can support all the functionalities mentioned above - Add words to tokenizer_exceptions_list.py in respect to regex used in tokenizer_exceptions.py ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [X] I have submitted the spaCy Contributor Agreement. - [X] I ran the tests, and all new and existing tests passed. - [X] My changes don't require a change to the documentation, or if they do, I've added all required information. * Set version to 2.0.13 * Fix formatting and consistency * Update docs for new version [ci skip] * Increment version [ci skip] * Add info on wheels [ci skip] * Adding "This is a sentence" example to Sinhala (#2846) * Add wheels badge * Update badge [ci skip] * Update README.rst [ci skip] * Update murmurhash pin * Increment version to 2.0.14.dev0 * Update GPU docs for v2.0.14 * Add wheel to setup_requires * Import prefer_gpu and require_gpu functions from Thinc * Add tests for prefer_gpu() and require_gpu() * Update requirements and setup.py * Workaround bug in thinc require_gpu * Set version to v2.0.14 * Update push-tag script * Unhack prefer_gpu * Require thinc 6.10.6 * Update prefer_gpu and require_gpu docs [ci skip] * Fix specifiers for GPU * Set version to 2.0.14.dev1 * Set version to 2.0.14 * Update Thinc version pin * Increment version * Fix msgpack-numpy version pin * Increment version * Update version to 2.0.16 * Update version [ci skip] * Redundant ')' in the Stop words' example (#2856) <!--- Provide a general summary of your changes in the title. --> ## Description <!--- Use this section to describe your changes. If your changes required testing, include information about the testing environment and the tests you ran. If your test fixes a bug reported in an issue, don't forget to include the issue number. If your PR is still a work in progress, that's totally fine – just include a note to let us know. --> ### Types of change <!-- What type of change does your PR cover? Is it a bug fix, an enhancement or new feature, or a change to the documentation? --> ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [ ] I have submitted the spaCy Contributor Agreement. - [ ] I ran the tests, and all new and existing tests passed. - [ ] My changes don't require a change to the documentation, or if they do, I've added all required information. * Documentation improvement regarding joblib and SO (#2867) Some documentation improvements ## Description 1. Fixed the dead URL to joblib 2. Fixed Stack Overflow brand name (with space) ### Types of change Documentation ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. * raise error when setting overlapping entities as doc.ents (#2880) * Fix out-of-bounds access in NER training The helper method state.B(1) gets the index of the first token of the buffer, or -1 if no such token exists. Normally this is safe because we pass this to functions like state.safe_get(), which returns an empty token. Here we used it directly as an array index, which is not okay! This error may have been the cause of out-of-bounds access errors during training. Similar errors may still be around, so much be hunted down. Hunting this one down took a long time...I printed out values across training runs and diffed, looking for points of divergence between runs, when no randomness should be allowed. * Change PyThaiNLP Url (#2876) * Fix missing comma * Add example showing a fix-up rule for space entities * Set version to 2.0.17.dev0 * Update regex version * Revert "Update regex version" This reverts commit 62358dd867d15bc6a475942dff34effba69dd70a. * Try setting older regex version, to align with conda * Set version to 2.0.17 * Add spacy-js to universe [ci-skip] * Add spacy-raspberry to universe (closes #2889) * Add script to validate universe json [ci skip] * Removed space in docs + added contributor indo (#2909) * - removed unneeded space in documentation * - added contributor info * Allow input text of length up to max_length, inclusive (#2922) * Include universe spec for spacy-wordnet component (#2919) * feat: include universe spec for spacy-wordnet component * chore: include spaCy contributor agreement * Minor formatting changes [ci skip] * Fix image [ci skip] Twitter URL doesn't work on live site * Check if the word is in one of the regular lists specific to each POS (#2886) * 💫 Create random IDs for SVGs to prevent ID clashes (#2927) Resolves #2924. ## Description Fixes problem where multiple visualizations in Jupyter notebooks would have clashing arc IDs, resulting in weirdly positioned arc labels. Generating a random ID prefix so even identical parses won't receive the same IDs for consistency (even if effect of ID clash isn't noticable here.) ### Types of change bug fix ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. * Fix typo [ci skip] * fixes symbolic link on py3 and windows (#2949) * fixes symbolic link on py3 and windows during setup of spacy using command python -m spacy link en_core_web_sm en closes #2948 * Update spacy/compat.py Co-Authored-By: cicorias <cicorias@users.noreply.github.com> * Fix formatting * Update universe [ci skip] * Catalan Language Support (#2940) * Catalan language Support * Ddding Catalan to documentation * Sort languages alphabetically [ci skip] * Update tests for pytest 4.x (#2965) <!--- Provide a general summary of your changes in the title. --> ## Description - [x] Replace marks in params for pytest 4.0 compat ([see here](https://docs.pytest.org/en/latest/deprecations.html#marks-in-pytest-mark-parametrize)) - [x] Un-xfail passing tests (some fixes in a recent update resolved a bunch of issues, but tests were apparently never updated here) ### Types of change <!-- What type of change does your PR cover? Is it a bug fix, an enhancement or new feature, or a change to the documentation? --> ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. * Fix regex pin to harmonize with conda (#2964) * Update README.rst * Fix bug where Vocab.prune_vector did not use 'batch_size' (#2977) Fixes #2976 * Fix typo * Fix typo * Remove duplicate file * Require thinc 7.0.0.dev2 Fixes bug in gpu_ops that would use cupy instead of numpy on CPU * Add missing import * Fix error IDs * Fix tests
2018-11-29 15:30:29 +00:00
siswa/i
2017-07-26 12:12:52 +00:00
siswi-siswi
situ-situ
situs-situs
six-core
six-speed
2017-07-24 07:10:16 +00:00
slintat-slintut
2017-07-26 12:12:52 +00:00
slo-mo
slow-motion
snap-on
2017-07-27 12:46:30 +00:00
sobek-sobekan
sodok-sodokan
2017-07-26 12:12:52 +00:00
sok-sokan
2017-07-27 12:46:30 +00:00
solek-menyolek
2017-07-26 12:12:52 +00:00
solid-state
2017-07-24 07:10:16 +00:00
sorak-sorai
2017-07-26 12:12:52 +00:00
sorak-sorak
sore-sore
sosio-ekonomi
soya-soya
spill-resistant
split-screen
sponsor-sponsor
2017-07-27 12:46:30 +00:00
sponsor-sponsoran
2017-07-26 12:12:52 +00:00
srikandi-srikandi
staf-staf
stand-by
stand-up
start-up
stasiun-stasiun
state-owned
striker-striker
studi-studi
2017-07-27 12:46:30 +00:00
suam-suam
2017-07-26 12:12:52 +00:00
suami-isteri
suami-istri
suami-suami
2017-07-27 12:46:30 +00:00
suang-suang
2017-07-26 12:12:52 +00:00
suara-suara
sudin-sudin
2017-07-24 07:10:16 +00:00
sudu-sudu
2017-07-27 12:46:30 +00:00
sudung-sudung
sugi-sugi
2017-07-26 12:12:52 +00:00
suka-suka
suku-suku
2017-07-27 12:46:30 +00:00
sulang-menyulang
2017-07-24 07:10:16 +00:00
sulat-sulit
2017-07-27 12:46:30 +00:00
sulur-suluran
2017-07-26 12:12:52 +00:00
sum-sum
sumber-sumber
2017-07-24 07:10:16 +00:00
sumpah-sumpah
2017-07-27 12:46:30 +00:00
sumpit-sumpit
sundut-bersundut
2017-07-26 12:12:52 +00:00
sungai-sungai
sungguh-sungguh
2017-07-27 12:46:30 +00:00
sungut-sungut
sunting-menyunting
2017-07-26 12:12:52 +00:00
super-damai
super-rahasia
super-sub
supply-demand
supply-side
2017-07-27 12:46:30 +00:00
suram-suram
surat-menyurat
2017-07-26 12:12:52 +00:00
surat-surat
2017-07-27 12:46:30 +00:00
suruh-suruhan
suruk-surukan
2017-07-26 12:12:52 +00:00
susul-menyusul
2017-07-27 12:46:30 +00:00
suwir-suwir
2017-07-26 12:12:52 +00:00
syarat-syarat
system-on-chip
t-shirt
t-shirts
2017-07-24 07:10:16 +00:00
tabar-tabar
2017-07-27 12:46:30 +00:00
tabir-mabir
tabrak-tubruk
tabuh-tabuhan
tabun-menabun
tahu-menahu
tahu-tahu
2017-07-26 12:12:52 +00:00
tahun-tahun
2017-07-27 12:46:30 +00:00
takah-takahnya
2017-07-24 07:10:16 +00:00
takang-takik
2017-07-26 12:12:52 +00:00
take-off
2017-07-27 12:46:30 +00:00
takut-takut
takut-takutan
tali-bertali
tali-tali
talun-temalun
2017-07-26 12:12:52 +00:00
taman-taman
2017-07-27 12:46:30 +00:00
tampak-tampak
tanak-tanakan
tanam-menanam
tanam-tanaman
2017-07-26 12:12:52 +00:00
tanda-tanda
2017-07-27 12:46:30 +00:00
tangan-menangan
2017-07-26 12:12:52 +00:00
tangan-tangan
tangga-tangga
tanggal-tanggal
tanggul-tanggul
2017-07-27 12:46:30 +00:00
tanggung-menanggung
tanggung-tanggung
2017-07-26 12:12:52 +00:00
tank-tank
💫 Port master changes over to develop (#2979) * Create aryaprabhudesai.md (#2681) * Update _install.jade (#2688) Typo fix: "models" -> "model" * Add FAC to spacy.explain (resolves #2706) * Remove docstrings for deprecated arguments (see #2703) * When calling getoption() in conftest.py, pass a default option (#2709) * When calling getoption() in conftest.py, pass a default option This is necessary to allow testing an installed spacy by running: pytest --pyargs spacy * Add contributor agreement * update bengali token rules for hyphen and digits (#2731) * Less norm computations in token similarity (#2730) * Less norm computations in token similarity * Contributor agreement * Remove ')' for clarity (#2737) Sorry, don't mean to be nitpicky, I just noticed this when going through the CLI and thought it was a quick fix. That said, if this was intention than please let me know. * added contributor agreement for mbkupfer (#2738) * Basic support for Telugu language (#2751) * Lex _attrs for polish language (#2750) * Signed spaCy contributor agreement * Added polish version of english lex_attrs * Introduces a bulk merge function, in order to solve issue #653 (#2696) * Fix comment * Introduce bulk merge to increase performance on many span merges * Sign contributor agreement * Implement pull request suggestions * Describe converters more explicitly (see #2643) * Add multi-threading note to Language.pipe (resolves #2582) [ci skip] * Fix formatting * Fix dependency scheme docs (closes #2705) [ci skip] * Don't set stop word in example (closes #2657) [ci skip] * Add words to portuguese language _num_words (#2759) * Add words to portuguese language _num_words * Add words to portuguese language _num_words * Update Indonesian model (#2752) * adding e-KTP in tokenizer exceptions list * add exception token * removing lines with containing space as it won't matter since we use .split() method in the end, added new tokens in exception * add tokenizer exceptions list * combining base_norms with norm_exceptions * adding norm_exception * fix double key in lemmatizer * remove unused import on punctuation.py * reformat stop_words to reduce number of lines, improve readibility * updating tokenizer exception * implement is_currency for lang/id * adding orth_first_upper in tokenizer_exceptions * update the norm_exception list * remove bunch of abbreviations * adding contributors file * Fixed spaCy+Keras example (#2763) * bug fixes in keras example * created contributor agreement * Adding French hyphenated first name (#2786) * Fix typo (closes #2784) * Fix typo (#2795) [ci skip] Fixed typo on line 6 "regcognizer --> recognizer" * Adding basic support for Sinhala language. (#2788) * adding Sinhala language package, stop words, examples and lex_attrs. * Adding contributor agreement * Updating contributor agreement * Also include lowercase norm exceptions * Fix error (#2802) * Fix error ValueError: cannot resize an array that references or is referenced by another array in this way. Use the resize function * added spaCy Contributor Agreement * Add charlax's contributor agreement (#2805) * agreement of contributor, may I introduce a tiny pl languge contribution (#2799) * Contributors agreement * Contributors agreement * Contributors agreement * Add jupyter=True to displacy.render in documentation (#2806) * Revert "Also include lowercase norm exceptions" This reverts commit 70f4e8adf37cfcfab60be2b97d6deae949b30e9e. * Remove deprecated encoding argument to msgpack * Set up dependency tree pattern matching skeleton (#2732) * Fix bug when too many entity types. Fixes #2800 * Fix Python 2 test failure * Require older msgpack-numpy * Restore encoding arg on msgpack-numpy * Try to fix version pin for msgpack-numpy * Update Portuguese Language (#2790) * Add words to portuguese language _num_words * Add words to portuguese language _num_words * Portuguese - Add/remove stopwords, fix tokenizer, add currency symbols * Extended punctuation and norm_exceptions in the Portuguese language * Correct error in spacy universe docs concerning spacy-lookup (#2814) * Update Keras Example for (Parikh et al, 2016) implementation (#2803) * bug fixes in keras example * created contributor agreement * baseline for Parikh model * initial version of parikh 2016 implemented * tested asymmetric models * fixed grevious error in normalization * use standard SNLI test file * begin to rework parikh example * initial version of running example * start to document the new version * start to document the new version * Update Decompositional Attention.ipynb * fixed calls to similarity * updated the README * import sys package duh * simplified indexing on mapping word to IDs * stupid python indent error * added code from https://github.com/tensorflow/tensorflow/issues/3388 for tf bug workaround * Fix typo (closes #2815) [ci skip] * Update regex version dependency * Set version to 2.0.13.dev3 * Skip seemingly problematic test * Remove problematic test * Try previous version of regex * Revert "Remove problematic test" This reverts commit bdebbef45552d698d390aa430b527ee27830f11b. * Unskip test * Try older version of regex * 💫 Update training examples and use minibatching (#2830) <!--- Provide a general summary of your changes in the title. --> ## Description Update the training examples in `/examples/training` to show usage of spaCy's `minibatch` and `compounding` helpers ([see here](https://spacy.io/usage/training#tips-batch-size) for details). The lack of batching in the examples has caused some confusion in the past, especially for beginners who would copy-paste the examples, update them with large training sets and experienced slow and unsatisfying results. ### Types of change enhancements ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. * Visual C++ link updated (#2842) (closes #2841) [ci skip] * New landing page * Add contribution agreement * Correcting lang/ru/examples.py (#2845) * Correct some grammatical inaccuracies in lang\ru\examples.py; filled Contributor Agreement * Correct some grammatical inaccuracies in lang\ru\examples.py * Move contributor agreement to separate file * Set version to 2.0.13.dev4 * Add Persian(Farsi) language support (#2797) * Also include lowercase norm exceptions * Remove in favour of https://github.com/explosion/spaCy/graphs/contributors * Rule-based French Lemmatizer (#2818) <!--- Provide a general summary of your changes in the title. --> ## Description <!--- Use this section to describe your changes. If your changes required testing, include information about the testing environment and the tests you ran. If your test fixes a bug reported in an issue, don't forget to include the issue number. If your PR is still a work in progress, that's totally fine – just include a note to let us know. --> Add a rule-based French Lemmatizer following the english one and the excellent PR for [greek language optimizations](https://github.com/explosion/spaCy/pull/2558) to adapt the Lemmatizer class. ### Types of change <!-- What type of change does your PR cover? Is it a bug fix, an enhancement or new feature, or a change to the documentation? --> - Lemma dictionary used can be found [here](http://infolingu.univ-mlv.fr/DonneesLinguistiques/Dictionnaires/telechargement.html), I used the XML version. - Add several files containing exhaustive list of words for each part of speech - Add some lemma rules - Add POS that are not checked in the standard Lemmatizer, i.e PRON, DET, ADV and AUX - Modify the Lemmatizer class to check in lookup table as a last resort if POS not mentionned - Modify the lemmatize function to check in lookup table as a last resort - Init files are updated so the model can support all the functionalities mentioned above - Add words to tokenizer_exceptions_list.py in respect to regex used in tokenizer_exceptions.py ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [X] I have submitted the spaCy Contributor Agreement. - [X] I ran the tests, and all new and existing tests passed. - [X] My changes don't require a change to the documentation, or if they do, I've added all required information. * Set version to 2.0.13 * Fix formatting and consistency * Update docs for new version [ci skip] * Increment version [ci skip] * Add info on wheels [ci skip] * Adding "This is a sentence" example to Sinhala (#2846) * Add wheels badge * Update badge [ci skip] * Update README.rst [ci skip] * Update murmurhash pin * Increment version to 2.0.14.dev0 * Update GPU docs for v2.0.14 * Add wheel to setup_requires * Import prefer_gpu and require_gpu functions from Thinc * Add tests for prefer_gpu() and require_gpu() * Update requirements and setup.py * Workaround bug in thinc require_gpu * Set version to v2.0.14 * Update push-tag script * Unhack prefer_gpu * Require thinc 6.10.6 * Update prefer_gpu and require_gpu docs [ci skip] * Fix specifiers for GPU * Set version to 2.0.14.dev1 * Set version to 2.0.14 * Update Thinc version pin * Increment version * Fix msgpack-numpy version pin * Increment version * Update version to 2.0.16 * Update version [ci skip] * Redundant ')' in the Stop words' example (#2856) <!--- Provide a general summary of your changes in the title. --> ## Description <!--- Use this section to describe your changes. If your changes required testing, include information about the testing environment and the tests you ran. If your test fixes a bug reported in an issue, don't forget to include the issue number. If your PR is still a work in progress, that's totally fine – just include a note to let us know. --> ### Types of change <!-- What type of change does your PR cover? Is it a bug fix, an enhancement or new feature, or a change to the documentation? --> ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [ ] I have submitted the spaCy Contributor Agreement. - [ ] I ran the tests, and all new and existing tests passed. - [ ] My changes don't require a change to the documentation, or if they do, I've added all required information. * Documentation improvement regarding joblib and SO (#2867) Some documentation improvements ## Description 1. Fixed the dead URL to joblib 2. Fixed Stack Overflow brand name (with space) ### Types of change Documentation ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. * raise error when setting overlapping entities as doc.ents (#2880) * Fix out-of-bounds access in NER training The helper method state.B(1) gets the index of the first token of the buffer, or -1 if no such token exists. Normally this is safe because we pass this to functions like state.safe_get(), which returns an empty token. Here we used it directly as an array index, which is not okay! This error may have been the cause of out-of-bounds access errors during training. Similar errors may still be around, so much be hunted down. Hunting this one down took a long time...I printed out values across training runs and diffed, looking for points of divergence between runs, when no randomness should be allowed. * Change PyThaiNLP Url (#2876) * Fix missing comma * Add example showing a fix-up rule for space entities * Set version to 2.0.17.dev0 * Update regex version * Revert "Update regex version" This reverts commit 62358dd867d15bc6a475942dff34effba69dd70a. * Try setting older regex version, to align with conda * Set version to 2.0.17 * Add spacy-js to universe [ci-skip] * Add spacy-raspberry to universe (closes #2889) * Add script to validate universe json [ci skip] * Removed space in docs + added contributor indo (#2909) * - removed unneeded space in documentation * - added contributor info * Allow input text of length up to max_length, inclusive (#2922) * Include universe spec for spacy-wordnet component (#2919) * feat: include universe spec for spacy-wordnet component * chore: include spaCy contributor agreement * Minor formatting changes [ci skip] * Fix image [ci skip] Twitter URL doesn't work on live site * Check if the word is in one of the regular lists specific to each POS (#2886) * 💫 Create random IDs for SVGs to prevent ID clashes (#2927) Resolves #2924. ## Description Fixes problem where multiple visualizations in Jupyter notebooks would have clashing arc IDs, resulting in weirdly positioned arc labels. Generating a random ID prefix so even identical parses won't receive the same IDs for consistency (even if effect of ID clash isn't noticable here.) ### Types of change bug fix ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. * Fix typo [ci skip] * fixes symbolic link on py3 and windows (#2949) * fixes symbolic link on py3 and windows during setup of spacy using command python -m spacy link en_core_web_sm en closes #2948 * Update spacy/compat.py Co-Authored-By: cicorias <cicorias@users.noreply.github.com> * Fix formatting * Update universe [ci skip] * Catalan Language Support (#2940) * Catalan language Support * Ddding Catalan to documentation * Sort languages alphabetically [ci skip] * Update tests for pytest 4.x (#2965) <!--- Provide a general summary of your changes in the title. --> ## Description - [x] Replace marks in params for pytest 4.0 compat ([see here](https://docs.pytest.org/en/latest/deprecations.html#marks-in-pytest-mark-parametrize)) - [x] Un-xfail passing tests (some fixes in a recent update resolved a bunch of issues, but tests were apparently never updated here) ### Types of change <!-- What type of change does your PR cover? Is it a bug fix, an enhancement or new feature, or a change to the documentation? --> ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. * Fix regex pin to harmonize with conda (#2964) * Update README.rst * Fix bug where Vocab.prune_vector did not use 'batch_size' (#2977) Fixes #2976 * Fix typo * Fix typo * Remove duplicate file * Require thinc 7.0.0.dev2 Fixes bug in gpu_ops that would use cupy instead of numpy on CPU * Add missing import * Fix error IDs * Fix tests
2018-11-29 15:30:29 +00:00
tante-tante
2017-07-26 12:12:52 +00:00
tanya-jawab
2017-07-27 12:46:30 +00:00
tapa-tapa
tapak-tapak
tari-menari
tari-tarian
2017-07-26 12:12:52 +00:00
tarik-menarik
tarik-ulur
tata-tertib
2017-07-27 12:46:30 +00:00
tatah-tatah
2017-07-26 12:12:52 +00:00
tau-tau
2017-07-24 07:10:16 +00:00
tawa-tawa
tawak-tawak
tawang-tawang
2017-07-27 12:46:30 +00:00
tawar-menawar
2017-07-24 07:10:16 +00:00
tawar-tawar
2017-07-27 12:46:30 +00:00
tayum-temayum
2017-07-26 12:12:52 +00:00
tebak-tebakan
2017-07-27 12:46:30 +00:00
tebu-tebu
tedong-tedong
tegak-tegak
tegerbang-gerbang
teh-tehan
2017-07-26 12:12:52 +00:00
tek-tek
2017-07-24 07:10:16 +00:00
teka-teki
2017-07-26 12:12:52 +00:00
teknik-teknik
teman-teman
teman-temanku
2017-07-27 12:46:30 +00:00
temas-temas
tembak-menembak
temeh-temeh
tempa-menempa
2017-07-26 12:12:52 +00:00
tempat-tempat
2017-07-27 12:46:30 +00:00
tempo-tempo
2017-07-24 07:10:16 +00:00
temut-temut
2017-07-26 12:12:52 +00:00
tenang-tenang
tengah-tengah
2017-07-27 12:46:30 +00:00
tenggang-menenggang
tengok-menengok
2017-07-26 12:12:52 +00:00
teori-teori
2017-07-27 12:46:30 +00:00
teraba-raba
teralang-alang
terambang-ambang
terambung-ambung
terang-terang
2017-07-26 12:12:52 +00:00
terang-terangan
2017-07-27 12:46:30 +00:00
teranggar-anggar
terangguk-angguk
teranggul-anggul
terangin-angin
terangkup-angkup
teranja-anja
terapung-apung
terayan-rayan
terayap-rayap
terbada-bada
2017-07-26 12:12:52 +00:00
terbahak-bahak
2017-07-27 12:46:30 +00:00
terbang-terbang
terbata-bata
terbatuk-batuk
terbayang-bayang
terbeda-bedakan
terbengkil-bengkil
terbengong-bengong
2017-07-26 12:12:52 +00:00
terbirit-birit
2017-07-27 12:46:30 +00:00
terbuai-buai
terbuang-buang
terbungkuk-bungkuk
2017-07-26 12:12:52 +00:00
terburu-buru
2017-07-27 12:46:30 +00:00
tercangak-cangak
tercengang-cengang
tercilap-cilap
tercongget-congget
tercoreng-moreng
tercungap-cungap
terdangka-dangka
terdengih-dengih
terduga-duga
terekeh-ekeh
terembut-embut
terembut-rembut
terempas-empas
terengah-engah
teresak-esak
tergagap-gagap
tergagau-gagau
tergaguk-gaguk
tergapai-gapai
tergegap-gegap
tergegas-gegas
tergelak-gelak
tergelang-gelang
tergeleng-geleng
tergelung-gelung
tergerai-gerai
tergerenyeng-gerenyeng
2017-07-26 12:12:52 +00:00
tergesa-gesa
tergila-gila
2017-07-27 12:46:30 +00:00
tergolek-golek
tergontai-gontai
tergudik-gudik
tergugu-gugu
terguling-guling
tergulut-gulut
terhambat-hambat
terharak-harak
terharap-harap
terhengit-hengit
terheran-heran
terhinggut-hinggut
terigau-igau
terimpi-impi
terincut-incut
teringa-inga
2017-07-24 07:10:16 +00:00
teringat-ingat
2017-07-27 12:46:30 +00:00
terinjak-injak
terisak-isak
terjembak-jembak
terjerit-jerit
terkadang-kadang
terkagum-kagum
terkaing-kaing
terkakah-kakah
terkakak-kakak
terkampul-kampul
terkanjar-kanjar
terkantuk-kantuk
terkapah-kapah
terkapai-kapai
terkapung-kapung
terkatah-katah
2017-07-26 12:12:52 +00:00
terkatung-katung
2017-07-27 12:46:30 +00:00
terkecap-kecap
terkedek-kedek
terkedip-kedip
terkejar-kejar
terkekau-kekau
terkekeh-kekeh
terkekek-kekek
terkelinjat-kelinjat
terkelip-kelip
terkempul-kempul
terkemut-kemut
terkencar-kencar
terkencing-kencing
terkentut-kentut
terkepak-kepak
terkesot-kesot
terkesut-kesut
terkial-kial
terkijai-kijai
terkikih-kikih
terkikik-kikik
terkincak-kincak
terkindap-kindap
terkinja-kinja
terkirai-kirai
terkitar-kitar
terkocoh-kocoh
terkojol-kojol
terkokol-kokol
terkosel-kosel
terkotak-kotak
terkoteng-koteng
terkuai-kuai
terkumpal-kumpal
terlara-lara
terlayang-layang
terlebih-lebih
terlincah-lincah
terliuk-liuk
terlolong-lolong
terlongong-longong
2017-07-26 12:12:52 +00:00
terlunta-lunta
2017-07-27 12:46:30 +00:00
termangu-mangu
termanja-manja
termata-mata
termengah-mengah
termenung-menung
termimpi-mimpi
termonyong-monyong
ternanti-nanti
terngiang-ngiang
teroleng-oleng
2017-07-26 12:12:52 +00:00
terombang-ambing
2017-07-27 12:46:30 +00:00
terpalit-palit
terpandang-pandang
terpecah-pecah
terpekik-pekik
terpencar-pencar
terpereh-pereh
terpijak-pijak
terpikau-pikau
terpilah-pilah
terpinga-pinga
terpingkal-pingkal
terpingkau-pingkau
terpontang-panting
terpusing-pusing
terputus-putus
tersanga-sanga
tersaruk-saruk
tersedan-sedan
tersedih-sedih
tersedu-sedu
terseduh-seduh
tersendat-sendat
tersendeng-sendeng
tersengal-sengal
tersengguk-sengguk
tersengut-sengut
terseok-seok
tersera-sera
terserak-serak
tersetai-setai
tersia-sia
tersipu-sipu
tersoja-soja
tersungkuk-sungkuk
tersuruk-suruk
tertagak-tagak
tertahan-tahan
tertatih-tatih
tertegun-tegun
tertekan-tekan
terteleng-teleng
tertendang-tendang
tertimpang-timpang
tertitar-titar
terumbang-ambing
terumbang-umbang
terungkap-ungkap
2017-07-26 12:12:52 +00:00
terus-menerus
terus-terusan
tete-a-tete
text-to-speech
think-tank
think-thank
third-party
third-person
three-axis
three-point
tiap-tiap
2017-07-24 07:10:16 +00:00
tiba-tiba
2017-07-27 12:46:30 +00:00
tidak-tidak
tidur-tidur
tidur-tiduran
2017-07-26 12:12:52 +00:00
tie-dye
tie-in
2017-07-27 12:46:30 +00:00
tiga-tiganya
tikam-menikam
2017-07-26 12:12:52 +00:00
tiki-taka
tikus-tikus
2017-07-27 12:46:30 +00:00
tilik-menilik
2017-07-26 12:12:52 +00:00
tim-tim
2017-07-24 07:10:16 +00:00
timah-timah
2017-07-27 12:46:30 +00:00
timang-timangan
timbang-menimbang
2017-07-26 12:12:52 +00:00
time-lapse
2017-07-27 12:46:30 +00:00
timpa-menimpa
2017-07-24 07:10:16 +00:00
timu-timu
2017-07-27 12:46:30 +00:00
timun-timunan
2017-07-26 12:12:52 +00:00
timur-barat
timur-laut
timur-tenggara
2017-07-27 12:46:30 +00:00
tindih-bertindih
tindih-menindih
tinjau-meninjau
tinju-meninju
2017-07-26 12:12:52 +00:00
tip-off
tipu-tipu
2017-07-27 12:46:30 +00:00
tiru-tiruan
2017-07-26 12:12:52 +00:00
titik-titik
titik-titiknya
2017-07-27 12:46:30 +00:00
tiup-tiup
2017-07-26 12:12:52 +00:00
to-do
2017-07-27 12:46:30 +00:00
tokak-takik
2017-07-26 12:12:52 +00:00
toko-toko
tokoh-tokoh
2017-07-27 12:46:30 +00:00
tokok-menokok
tolak-menolak
tolong-menolong
2017-07-26 12:12:52 +00:00
tong-tong
top-level
top-up
2017-07-27 12:46:30 +00:00
totol-totol
2017-07-26 12:12:52 +00:00
touch-screen
trade-in
training-camp
trans-nasional
treble-winner
tri-band
trik-trik
triple-core
truk-truk
tua-tua
tuan-tuan
2017-07-24 07:10:16 +00:00
tuang-tuang
2017-07-27 12:46:30 +00:00
tuban-tuban
2017-07-26 12:12:52 +00:00
tubuh-tubuh
tujuan-tujuan
tuk-tuk
2017-07-27 12:46:30 +00:00
tukang-menukang
tukar-menukar
2017-07-26 12:12:52 +00:00
tulang-belulang
2017-07-27 12:46:30 +00:00
tulang-tulangan
2017-07-24 07:10:16 +00:00
tuli-tuli
2017-07-27 12:46:30 +00:00
tulis-menulis
tumbuh-tumbuhan
2017-07-24 07:10:16 +00:00
tumpang-tindih
2017-07-26 12:12:52 +00:00
tune-up
2017-07-27 12:46:30 +00:00
tunggang-tunggik
tunggang-tungging
tunggang-tunggit
tunggul-tunggul
tunjuk-menunjuk
2017-07-24 07:10:16 +00:00
tupai-tupai
2017-07-27 12:46:30 +00:00
tupai-tupaian
turi-turian
2017-07-26 12:12:52 +00:00
turn-based
turnamen-turnamen
turun-temurun
2017-07-27 12:46:30 +00:00
turut-menurut
turut-turutan
tuyuk-tuyuk
2017-07-26 12:12:52 +00:00
twin-cam
twin-turbocharged
two-state
two-step
two-tone
u-shape
2017-07-27 12:46:30 +00:00
uang-uangan
uar-uar
ubek-ubekan
ubel-ubel
2017-07-24 07:10:16 +00:00
ubrak-abrik
ubun-ubun
ubur-ubur
uci-uci
2017-07-26 12:12:52 +00:00
udang-undang
2017-07-27 12:46:30 +00:00
udap-udapan
2017-07-24 07:10:16 +00:00
ugal-ugalan
uget-uget
uir-uir
2017-07-27 12:46:30 +00:00
ujar-ujar
2017-07-26 12:12:52 +00:00
uji-coba
ujung-ujung
ujung-ujungnya
uka-uka
2017-07-27 12:46:30 +00:00
ukir-mengukir
ukir-ukiran
2017-07-24 07:10:16 +00:00
ula-ula
2017-07-27 12:46:30 +00:00
ulak-ulak
ulam-ulam
2017-07-24 07:10:16 +00:00
ulang-alik
ulang-aling
2017-07-27 12:46:30 +00:00
ulang-ulang
2017-07-24 07:10:16 +00:00
ulap-ulap
ular-ular
ular-ularan
2017-07-27 12:46:30 +00:00
ulek-ulek
2017-07-24 07:10:16 +00:00
ulu-ulu
ulung-ulung
umang-umang
umbang-ambing
2017-07-26 12:12:52 +00:00
umbi-umbian
2017-07-24 07:10:16 +00:00
umbul-umbul
umbut-umbut
uncang-uncit
2017-07-27 12:46:30 +00:00
undak-undakan
2017-07-26 12:12:52 +00:00
undang-undang
undang-undangnya
2017-07-24 07:10:16 +00:00
unduk-unduk
undung-undung
undur-undur
unek-unek
ungah-angih
unggang-anggit
unggat-unggit
2017-07-27 12:46:30 +00:00
unggul-mengungguli
ungkit-ungkit
2017-07-26 12:12:52 +00:00
unit-unit
universitas-universitas
unsur-unsur
2017-07-24 07:10:16 +00:00
untang-anting
2017-07-27 12:46:30 +00:00
unting-unting
untung-untung
untung-untungan
upah-mengupah
upih-upih
2017-07-26 12:12:52 +00:00
upside-down
2017-07-24 07:10:16 +00:00
ura-ura
uran-uran
2017-07-26 12:12:52 +00:00
urat-urat
2017-07-27 12:46:30 +00:00
uring-uringan
urup-urup
urup-urupan
urus-urus
2017-07-26 12:12:52 +00:00
usaha-usaha
2017-07-24 07:10:16 +00:00
user-user
2017-07-27 12:46:30 +00:00
user-useran
2017-07-24 07:10:16 +00:00
utak-atik
2017-07-26 12:12:52 +00:00
utang-piutang
utang-utang
2017-07-24 07:10:16 +00:00
utar-utar
2017-07-26 12:12:52 +00:00
utara-jauh
utara-selatan
2017-07-24 07:10:16 +00:00
uter-uter
2017-07-26 12:12:52 +00:00
utusan-utusan
v-belt
v-neck
value-added
very-very
video-video
visi-misi
visi-misinya
voa-islam
voice-over
volt-ampere
wajah-wajah
wajar-wajar
wake-up
wakil-wakil
walk-in
walk-out
2017-07-27 12:46:30 +00:00
wangi-wangian
2017-07-26 12:12:52 +00:00
wanita-wanita
2017-07-24 07:10:16 +00:00
wanti-wanti
2017-07-27 12:46:30 +00:00
wara-wara
2017-07-24 07:10:16 +00:00
wara-wiri
2017-07-26 12:12:52 +00:00
warna-warna
2017-07-24 07:10:16 +00:00
warna-warni
2017-07-26 12:12:52 +00:00
was-was
water-cooled
web-based
wide-angle
wilayah-wilayah
win-win
2017-07-24 07:10:16 +00:00
wira-wiri
wora-wari
2017-07-26 12:12:52 +00:00
work-life
world-class
2017-07-24 07:10:16 +00:00
yang-yang
2017-07-26 12:12:52 +00:00
yayasan-yayasan
year-on-year
yel-yel
yo-yo
zam-zam
2017-07-24 07:10:16 +00:00
zig-zag
""".split()
)