spaCy

History

adrianeboyd faaa832518 Generalize handling of tokenizer special cases (#4259 ) * Generalize handling of tokenizer special cases Handle tokenizer special cases more generally by using the Matcher internally to match special cases after the affix/token_match tokenization is complete. Instead of only matching special cases while processing balanced or nearly balanced prefixes and suffixes, this recognizes special cases in a wider range of contexts: * Allows arbitrary numbers of prefixes/affixes around special cases * Allows special cases separated by infixes Existing tests/settings that couldn't be preserved as before: * The emoticon '")' is no longer a supported special case * The emoticon ':)' in "example:)" is a false positive again When merged with #4258 (or the relevant cache bugfix), the affix and token_match properties should be modified to flush and reload all special cases to use the updated internal tokenization with the Matcher. * Remove accidentally added test case * Really remove accidentally added test * Reload special cases when necessary Reload special cases when affixes or token_match are modified. Skip reloading during initialization. * Update error code number * Fix offset and whitespace in Matcher special cases * Fix offset bugs when merging and splitting tokens * Set final whitespace on final token in inserted special case * Improve cache flushing in tokenizer * Separate cache and specials memory (temporarily) * Flush cache when adding special cases * Repeated `self._cache = PreshMap()` and `self._specials = PreshMap()` are necessary due to this bug: https://github.com/explosion/preshed/issues/21 * Remove reinitialized PreshMaps on cache flush * Update UD bin scripts * Update imports for `bin/` * Add all currently supported languages * Update subtok merger for new Matcher validation * Modify blinded check to look at tokens instead of lemmas (for corpora with tokens but not lemmas like Telugu) * Use special Matcher only for cases with affixes * Reinsert specials cache checks during normal tokenization for special cases as much as possible * Additionally include specials cache checks while splitting on infixes * Since the special Matcher needs consistent affix-only tokenization for the special cases themselves, introduce the argument `with_special_cases` in order to do tokenization with or without specials cache checks * After normal tokenization, postprocess with special cases Matcher for special cases containing affixes * Replace PhraseMatcher with Aho-Corasick Replace PhraseMatcher with the Aho-Corasick algorithm over numpy arrays of the hash values for the relevant attribute. The implementation is based on FlashText. The speed should be similar to the previous PhraseMatcher. It is now possible to easily remove match IDs and matches don't go missing with large keyword lists / vocabularies. Fixes #4308. * Restore support for pickling * Fix internal keyword add/remove for numpy arrays * Add test for #4248, clean up test * Improve efficiency of special cases handling * Use PhraseMatcher instead of Matcher * Improve efficiency of merging/splitting special cases in document * Process merge/splits in one pass without repeated token shifting * Merge in place if no splits * Update error message number * Remove UD script modifications Only used for timing/testing, should be a separate PR * Remove final traces of UD script modifications * Update UD bin scripts * Update imports for `bin/` * Add all currently supported languages * Update subtok merger for new Matcher validation * Modify blinded check to look at tokens instead of lemmas (for corpora with tokens but not lemmas like Telugu) * Add missing loop for match ID set in search loop * Remove cruft in matching loop for partial matches There was a bit of unnecessary code left over from FlashText in the matching loop to handle partial token matches, which we don't have with PhraseMatcher. * Replace dict trie with MapStruct trie * Fix how match ID hash is stored/added * Update fix for match ID vocab * Switch from map_get_unless_missing to map_get * Switch from numpy array to Token.get_struct_attr Access token attributes directly in Doc instead of making a copy of the relevant values in a numpy array. Add unsatisfactory warning for hash collision with reserved terminal hash key. (Ideally it would change the reserved terminal hash and redo the whole trie, but for now, I'm hoping there won't be collisions.) * Restructure imports to export find_matches * Implement full remove() Remove unnecessary trie paths and free unused maps. Parallel to Matcher, raise KeyError when attempting to remove a match ID that has not been added. * Switch to PhraseMatcher.find_matches * Switch to local cdef functions for span filtering * Switch special case reload threshold to variable Refer to variable instead of hard-coded threshold * Move more of special case retokenize to cdef nogil Move as much of the special case retokenization to nogil as possible. * Rewrap sort as stdsort for OS X * Rewrap stdsort with specific types * Switch to qsort * Fix merge * Improve cmp functions * Fix realloc * Fix realloc again * Initialize span struct while retokenizing * Temporarily skip retokenizing * Revert "Move more of special case retokenize to cdef nogil" This reverts commit `0b7e52c797`. * Revert "Switch to qsort" This reverts commit `a98d71a942`. * Fix specials check while caching * Modify URL test with emoticons The multiple suffix tests result in the emoticon `:>`, which is now retokenized into one token as a special case after the suffixes are split off. * Refactor _apply_special_cases() * Use cdef ints for span info used in multiple spots * Modify _filter_special_spans() to prefer earlier Parallel to #4414, modify _filter_special_spans() so that the earlier span is preferred for overlapping spans of the same length. * Replace MatchStruct with Entity Replace MatchStruct with Entity since the existing Entity struct is nearly identical. * Replace Entity with more general SpanC * Replace MatchStruct with SpanC * Add error in debug-data if no dev docs are available (see #4575) * Update azure-pipelines.yml * Revert "Update azure-pipelines.yml" This reverts commit `ed1060cf59`. * Use latest wasabi * Reorganise install_requires * add dframcy to universe.json (#4580) * Update universe.json [ci skip] * Fix multiprocessing for as_tuples=True (#4582) * Fix conllu script (#4579) * force extensions to avoid clash between example scripts * fix arg order and default file encoding * add example config for conllu script * newline * move extension definitions to main function * few more encodings fixes * Add load_from_docbin example [ci skip] TODO: upload the file somewhere * Update README.md * Add warnings about 3.8 (resolves #4593) [ci skip] * Fixed typo: Added space between "recognize" and "various" (#4600) * Fix DocBin.merge() example (#4599) * Replace function registries with catalogue (#4584) * Replace functions registries with catalogue * Update __init__.py * Fix test * Revert unrelated flag [ci skip] * Bugfix/dep matcher issue 4590 (#4601) * add contributor agreement for prilopes * add test for issue #4590 * fix on_match params for DependencyMacther (#4590) * Minor updates to language example sentences (#4608) * Add punctuation to Spanish example sentences * Combine multilanguage examples for lang xx * Add punctuation to nb examples * Always realloc to a larger size Avoid potential (unlikely) edge case and cymem error seen in #4604. * Add error in debug-data if no dev docs are available (see #4575) * Update debug-data for GoldCorpus / Example * Ignore None label in misaligned NER data		2019-11-13 21:24:35 +01:00
..
5hirish.md	Added Adam project to spaCy Universe (#2275 )	2018-04-30 22:25:01 +02:00
ALSchwalm.md	Fix bug where Vocab.prune_vector did not use 'batch_size' (#2977 )	2018-11-28 19:49:33 +01:00
Azagh3l.md	Create Azagh3l.md (#3836 )	2019-06-11 10:58:32 +02:00
Bharat123rox.md	Made changes suggested by @ines	2019-03-20 07:43:19 +05:30
BigstickCarpet.md	Better formatting for `spacy train` CLI (#2357 )	2018-05-25 13:08:45 +02:00
BramVanroy.md	Documentation improvement regarding joblib and SO (#2867 )	2018-10-24 15:19:17 +02:00
BreakBB.md	Fix symlink creation to show error message on failure (#3589 ) (resolves #3307 ))	2019-04-16 11:58:31 +02:00
Bri-Will.md	Adds contributor agreement for Bri-Will	2017-12-11 14:38:37 -08:00
Brixjohn.md	Added alpha support for Tagalog language (#3062 )	2018-12-18 13:08:38 +01:00
Cinnamy.md	Correcting lang/ru/examples.py (#2845 )	2018-10-13 15:19:43 +02:00
DeNeutoy.md	Allow vectors to be optional in init-model, more robust string counting (#3155 )	2019-01-14 23:48:30 +01:00
DimaBryuhanov.md	DimaBryuhanov.md (#2590 )	2018-07-24 18:43:27 +02:00
Dobita21.md	Create Dobita21.md (#3614 )	2019-04-18 12:51:54 +02:00
DoomCoder.md	Improved polish tokenizer and stop words. (#2974 )	2019-02-08 14:27:21 +11:00
DuyguA.md	added contributor agreement for DuyguA	2017-11-13 15:45:13 +01:00
EARL_GREYT.md	fix typo in first token (#4327 )	2019-09-27 14:49:36 +02:00
Eleni170.md	Add support for Greek language (#2535 )	2018-07-10 13:48:38 +02:00
EmilStenstrom.md	Add abbreviations from UD_Swedish-Talbanken (#2613 )	2018-08-07 13:53:17 +02:00
F0rge1cE.md	Fix offset bug in loading pre-trained word2vec. (#3689 )	2019-05-06 23:00:38 +02:00
FallakAsad.md	Bugfix/issue 3968 (#3982 )	2019-07-18 00:20:32 +02:00
GiorgioPorgio.md	Port over contributor agreement from spacy-lookups-data [ci skip]	2019-10-25 13:06:10 +02:00
Gizzio.md	Improved polish tokenizer and stop words. (#2974 )	2019-02-08 14:27:21 +11:00
Hazoom.md	Improve speed of _merge method (#4300 )	2019-09-18 21:34:34 +02:00
HiromuHota.md	Tags are joined with a comma and padded with asterisks (#3491 )	2019-03-28 16:17:31 +01:00
ICLRandD.md	Add entry for Blackstone in universe.json (#4101 )	2019-08-09 17:16:51 +02:00
IsaacHaze.md	Adds contributor agreement IsaacHaze	2017-12-10 23:15:06 +01:00
JKhakpour.md	Add Persian(Farsi) language support (#2797 )	2018-10-13 15:31:49 +02:00
Kimahriman.md	Fixed auto linking after download and added simple test to check	2018-01-29 14:25:21 -05:00
LRAbbade.md	Adding my contributor agreement (#2315 )	2018-05-09 21:25:05 +02:00
Loghijiaha.md	Tamil language support (#3154 )	2019-01-14 15:32:30 +01:00
MartinoMensio.md	added contributor agreement	2017-11-17 16:30:09 +01:00
MateuszOlko.md	Improved polish tokenizer and stop words. (#2974 )	2019-02-08 14:27:21 +11:00
MathiasDesch.md	Add spaCy Contributor Agreement	2017-11-09 11:56:47 +01:00
NSchrading.md	Re-add existing contributor agreements	2016-11-09 16:42:02 +01:00
NirantK.md	Create NirantK.md (#3807 ) [ci skip]	2019-06-01 17:36:06 +02:00
Pavle992.md	Stopwords for Serbian language. (#4078 )	2019-08-05 10:22:27 +02:00
PeterGilles.md	Initial commit: New language Luxembourgish (lb) (#4424 )	2019-10-14 12:27:50 +02:00
Poluglottos.md	Fix typo	2019-03-16 13:45:46 +01:00
PolyglotOpenstreetmap.md	Create PolyglotOpenstreetmap.md (#3198 )	2019-01-26 14:02:54 +01:00
RvanNieuwpoort.md	Signed Contributer Agreement by Rob van Nieuwpoort	2016-12-15 10:34:19 +01:00
SamuelLKane.md	fix(util): fix decaying function output (#3495 )	2019-03-28 13:24:47 +01:00
Schibsted.png	Add contributor agreement [ci skip]	2019-08-30 17:02:43 +02:00
aaronkub.md	fixing regex matcher examples (#3708 ) (#3719 )	2019-05-10 14:23:52 +02:00
aashishg.md	Added numbers to ../lang/hi/lex_attrs.py (#2629 )	2018-08-08 16:06:11 +02:00
abhi18av.md	Create abhi18av.md	2017-11-13 17:23:05 +05:30
adrianeboyd.md	Update TIGER/German dependency relations in documentation (#3204 )	2019-01-30 14:23:12 +01:00
adrienball.md	Fix egg fragments in direct download (#3369 )	2019-03-07 21:07:19 +01:00
ajrader.md	Correction of default lemmatizer lookup in English (Issue # 4104) (#4110 )	2019-08-15 11:39:10 +02:00
akki2825.md	add kannada support (#3264 )	2019-02-12 18:28:39 +01:00
akornilo.md	Update gold corpus code to properly ingest a directory of jsonl… (#4067 )	2019-08-02 09:58:51 +02:00
alexvy86.md	Fix code sample for Doc.set_extension (#2282 )	2018-05-02 10:16:05 +02:00
aliiae.md	Add Tatar Language Support (#2444 )	2018-06-19 10:17:53 +02:00
alldefector.md	Port over contributor agreements	2018-03-24 17:17:37 +01:00
alvaroabascar.md	Fix issue 2396 (#3089 )	2018-12-29 18:05:52 +01:00
alvations.md	Create alvations.md (#3119 )	2019-01-05 13:11:06 +01:00
ameyuuno.md	added contributor agreement ameyuuno.md (#3925 )	2019-07-09 10:09:52 +02:00
amitness.md	Fix broken link to Dive Into Python 3 website (#3656 )	2019-04-29 19:44:00 +02:00
amperinet.md	add small fix for French lemmatizer (#3206 )	2019-01-31 23:44:10 +01:00
aniruddha-adhikary.md	update bengali token rules for hyphen and digits (#2731 )	2018-09-05 21:49:00 +02:00
ansgar-t.md	escape html in displacy.render (#2378 ) (closes #2361 )	2018-05-28 18:36:41 +02:00
aongko.md	Update Indonesian model (#2752 )	2018-09-14 12:30:32 +02:00
aristorinjuang.md	adding more words and rephrasing (#2351 )	2018-05-24 11:40:57 +02:00
armsp.md	Update _training.jade (#2340 )	2018-05-21 11:09:33 +02:00
aryaprabhudesai.md	Create aryaprabhudesai.md (#2681 )	2018-08-20 18:56:14 +02:00
askhogan.md	Update example and sign contributor agreement (#3916 )	2019-07-08 10:27:20 +02:00
avadhpatel.md	Signed contributor agreement	2018-01-17 06:33:37 -06:00
avramandrei.md	Added RONEC to spaCy Universe (#4151 )	2019-08-20 14:46:07 +02:00
azarezade.md	add contributors.md	2018-01-23 13:47:30 +03:30
b1uec0in.md	Fix error when Korean text contains regexp special characters. (#4022 )	2019-07-25 17:53:33 +02:00
bdewilde.md	Add contributor agreement	2017-11-20 11:28:31 -06:00
beatesi.md	Updated wordforms for Norwegian lemmatizer (#3007 )	2018-12-06 15:46:18 +01:00
bellabie.md	Fix filename	2019-03-16 13:46:58 +01:00
bintay.md	most_similar() return the k most similar vectors (#4364 )	2019-10-03 14:09:44 +02:00
bjascob.md	Update Universe Website for pyInflect (#3641 )	2019-04-26 13:17:36 +02:00
boena.md	Updates to Swedish Language (#3164 )	2019-01-16 13:45:50 +01:00
bryant1410.md	Fix website docs for Vectors.from_glove (#3565 )	2019-04-10 15:23:27 +02:00
btrungchi.md	Fix loading tokenizer with custom prefix search (#2495 )	2018-07-04 12:56:07 +02:00
calumcalder.md	Port over contributor agreements	2018-03-24 17:17:37 +01:00
cbilgili.md	Adds Canbey Bilgili's Contributor Agreement	2017-12-01 17:27:41 +03:00
cclauss.md	Create cclauss.md	2017-11-20 14:57:30 +01:00
cedar101.md	Korean support (#3901 )	2019-07-09 22:23:16 +02:00
celikomer.md	Signed agreement (#3577 )	2019-04-11 11:31:27 +02:00
charlax.md	Add charlax's contributor agreement (#2805 )	2018-09-27 12:24:42 +02:00
chezou.md	Upadate the document for Unidic link with latest version URL (#3022 )	2018-12-07 17:24:48 +01:00
chrisdubois.md	Re-add existing contributor agreements	2016-11-09 16:42:02 +01:00
cicorias.md	fixes symbolic link on py3 and windows (#2949 )	2018-11-24 15:34:23 +01:00
clarus.md	Typo (#3865 )	2019-06-20 10:31:19 +02:00
clippered.md	issue #3012 : add test (#3021 )	2018-12-18 15:02:49 +01:00
coryhurst.md	Silent keyword in info function in init (#2459 )	2018-06-18 12:24:21 +02:00
d99kris.md	Rename d99kris to d99kris.md	2017-12-17 13:44:55 +01:00
danielhers.md	Signed contributor agreement	2017-11-08 16:28:56 +02:00
danielkingai2.md	Don't use numpy directly for similarity (#3362 )	2019-03-06 22:58:38 +00:00
danielruf.md	chore: cache dependencies (#2418 )	2018-06-11 00:22:41 +02:00
darindf.md	Fix error (#2802 )	2018-09-26 21:31:03 +02:00
demfier.md	Port over contributor agreements	2017-10-24 20:13:34 +02:00
demongolem.md	Update tokenizer.md for construction example (#3790 )	2019-06-16 14:32:56 +02:00
doug-descombaz.md	Port over contributor agreements	2018-03-24 17:17:37 +01:00
dvsrepo.md	Adds contributor agreement dvsrepo	2017-04-07 11:58:28 +02:00
elbaulp.md	Changed learning rate by its param name. (#3855 )	2019-06-20 10:29:20 +02:00
emulbreh.md	Add contributor agreement for emulbreh	2018-02-13 13:40:33 +01:00
enerrio.md	add contributor agreement for @enerrio	2018-02-15 12:43:04 -08:00
er-raoniz.md	Fix example sentences in Hindi for grammatical errors (#4343 )	2019-09-30 23:32:49 +02:00
estr4ng7d.md	Marathi Language Support (#3767 )	2019-05-24 14:29:42 +02:00
filipecaixeta.md	Add words to portuguese language _num_words (#2759 )	2018-09-14 12:30:16 +02:00
fizban99.md	Create fizban99.md (#3601 )	2019-04-17 11:22:19 +02:00
foufaster.md	Create foufaster.md (#3179 )	2019-01-21 15:45:54 +01:00
frascuchon.md	Include universe spec for spacy-wordnet component (#2919 )	2018-11-13 23:54:46 +01:00
free-variation.md	Fixed spaCy+Keras example (#2763 )	2018-09-15 13:06:39 +02:00
fsonntag.md	Add contributer aggreement	2017-11-19 16:30:35 +01:00
fucking-signup.md	Add contributor agreement	2018-01-08 03:08:57 +01:00
gavrieltal.md	Initialize trues to 0.0 in training example (#3004 )	2018-12-03 01:33:22 +01:00
giannisdaras.md	Greek language optimizations (#2558 )	2018-07-18 18:51:38 +02:00
graus.md	adds textpipe to universe (#3500 ) [ci skip]	2019-03-28 15:13:19 +01:00
greenriverrus.md	Added contributor agreement	2017-11-26 22:14:08 +03:00
grivaz.md	Introduces a bulk merge function, in order to solve issue #653 (#2696 )	2018-09-10 16:41:42 +02:00
gustavengstrom.md	Adding noun_chunks to the Swedish language model (sv) (#4422 )	2019-10-21 12:57:06 +02:00
henry860916.md	update response after calling add_pipe (#3661 )	2019-05-01 12:02:18 +02:00
himkt.md	fix wrong indexing (#2416 )	2018-06-19 10:20:57 +02:00
honnibal.md	Port over contributor agreements	2017-10-24 20:13:34 +02:00
howl-anderson.md	Port over contributor agreements	2018-03-24 17:17:37 +01:00
hugovk.md	CLA	2017-11-29 10:25:20 +02:00
iann0036.md	Port over contributor agreements	2018-03-24 17:17:37 +01:00
idealley.md	Added agrement (#2374 )	2018-05-26 18:19:08 +02:00
ines.md	Port over contributor agreements	2017-10-24 20:13:34 +02:00
intrafindBreno.md	Create intrafindBreno.md (#3814 )	2019-06-03 18:33:09 +02:00
isaric.md	Issue #1107 - adds examples.py for Croatian language (#4143 )	2019-08-18 23:04:41 +02:00
ivigamberdiev.md	Update links and http -> https (#3532 )	2019-04-02 17:36:22 +02:00
ivyleavedtoadflax.md	Add missing comma to NN example in docs (#2255 )	2018-04-28 14:56:00 +02:00
jacopofar.md	Visual C++ link updated (#2842 ) (closes #2841 ) [ci skip]	2018-10-12 14:59:45 +02:00
janimo.md	Update Romanian stopword list (#2316 )	2018-05-10 12:16:56 +02:00
jarib.md	Add three missing tags from the `nb` tag map (#3085 )	2018-12-27 14:48:40 +01:00
jaydeepborkar.md	Update stop_words.py and add name in contributors (#4325 )	2019-09-27 11:57:27 +02:00
jeannefukumaru.md	fix typos in tag_map flagged by `python -m debug-data` (#3542 )	2019-04-05 12:06:09 +02:00
jenojp.md	Raise error if annotation dict in simple training style has unexpected keys #4074 (#4079 )	2019-08-06 11:01:25 +02:00
jerbob92.md	Port over contributor agreements	2017-10-24 20:13:34 +02:00
jimregan.md	CLA	2017-06-26 21:32:48 +01:00
johnhaley81.md	Port over contributor agreements	2017-10-24 20:13:34 +02:00
juliamakogon.md	Ukrainian language added. Small fixes in Russian (#3241 )	2019-02-07 21:05:11 +01:00
justindujardin.md	Port over contributor agreements	2018-03-24 17:17:37 +01:00
kabirkhan.md	Add optional `id` property to EntityRuler patterns (#3591 )	2019-06-16 13:29:04 +02:00
katarkor.md	changed tag_map, morph_rules, lemmatizer for Norwegian (#2565 )	2018-07-19 19:38:24 +02:00
katrinleinweber.md	Formalise citation info (#2167 )	2018-03-30 10:34:14 +02:00
kbulygin.md	Fix the first `nlp` call for `ja` (closes #2901 ) (#3065 )	2018-12-18 15:01:06 +01:00
keshan.md	Adding basic support for Sinhala language. (#2788 )	2018-09-25 12:18:25 +02:00
khellan.md	Norwegian tweaks (#3894 )	2019-07-08 10:28:47 +02:00
kimfalk.md	agreeing to the contributor agreement.	2017-12-19 15:31:52 +01:00
knoxdw.md	Test and fix for Issue #2219 (#2272 )	2018-05-03 18:40:46 +02:00
kognate.md	Added support for serializing overwrite and ent_id_sep (#3918 )	2019-07-08 17:28:28 +02:00
kororo.md	Add ExcelCy into Universe list (#2572 )	2018-07-19 19:28:33 +02:00
kowaalczyk.md	Improved polish tokenizer and stop words. (#2974 )	2019-02-08 14:27:21 +11:00
kwhumphreys.md	add agreement	2018-01-03 13:00:14 -08:00
lauraBaakman.md	Fix contributor agreement	2019-02-07 20:56:13 +01:00
ldorigo.md	Submit contributor agreement (#3705 )	2019-05-10 14:19:18 +02:00
ligser.md	Fill contributer agreement	2017-11-11 11:39:31 +03:00
luvogels.md	Update luvogels.md	2017-04-27 10:42:07 +02:00
magnusburton.md	Initial commit for Swedish	2016-12-20 11:05:06 +01:00
markulrich.md	Use correct local parameter in example MyComponent (and added markulrich.md contributor file)	2017-11-22 15:59:08 -08:00
mauryaland.md	Update stop_words.py for French language (#2310 )	2018-05-09 12:04:38 +02:00
mbkupfer.md	added contributor agreement for mbkupfer (#2738 )	2018-09-10 11:32:03 +02:00
mdaudali.md	Correct typo for AllenAI url on homepage (#4050 )	2019-07-31 00:16:33 +02:00
mdcclv.md	Port over contributor agreements	2017-10-24 20:13:34 +02:00
mdda.md	Create mdda.md	2017-12-18 18:09:27 +08:00
melanuria.pdf	Add contributor agreement (see #1672 )	2017-12-20 22:00:12 +01:00
mihaigliga21.md	adding Romanian tag_map (#4257 )	2019-09-09 11:53:09 +02:00
mikelibg.md	Removed space in docs + added contributor indo (#2909 )	2018-11-08 14:18:25 +01:00
mirfan899.md	Add Urdu Language Support (#2430 )	2018-06-22 11:14:03 +02:00
miroli.md	Remove incorrect lemma lookup gäng->gänga (#2252 )	2018-04-28 14:54:41 +02:00
mn3mos.md	#2211 - Support for ssl certs config on download command (#2212 )	2018-05-03 18:37:02 +02:00
mollerhoj.md	Add Danish lemmatizer (#2184 )	2018-04-07 19:07:28 +02:00
moreymat.md	Support CUDA 10 (#3126 )	2019-01-09 03:10:45 +01:00
mpszumowski.md	Fix bug in CLI iob and ner converter (#2392 ) (fixes #2385 )	2018-05-30 12:28:44 +02:00
mpuig.md	Catalan Language Support (#2940 )	2018-11-26 15:25:47 +01:00
msklvsk.md	fix UD data file extensions (#2425 )	2018-06-08 14:26:11 +02:00
munozbravo.md	Overwrites default getter for like_num in Spanish by adding _num_words and like_num to lex_attrs.py (#3810 ) (closes #3803 ))	2019-06-02 12:22:57 +02:00
neelkamath.md	Add "spaCy Server" to spaCy Universe (#4553 )	2019-10-30 13:20:46 +01:00
nipunsadvilkar.md	Incorrect Token attribute ent_iob_ description (#3800 )	2019-05-31 16:50:45 +02:00
njsmith.md	When calling getoption() in conftest.py, pass a default option (#2709 )	2018-09-03 09:57:52 +02:00
nlptown.md	Improved Dutch language resources and Dutch lemmatization (#3409 )	2019-04-03 14:13:26 +02:00
nourshalabi.md	Additions to Arabic stop words. (#2422 )	2018-06-08 02:33:23 +02:00
ohenrik.md	Added contributors agreement	2018-01-25 11:05:29 +01:00
oroszgy.md	Accepted contributor agreement.	2016-12-26 22:37:02 +01:00
ottosulin.md	Port over contributor agreements	2018-03-24 17:17:37 +01:00
oxinabox.md	squashme	2018-02-09 23:19:11 +08:00
ozcankasal.md	trilyon forgotten (#3083 )	2018-12-27 14:44:23 +01:00
pberba.md	Update `vocab.get_vector` docs to include features on Fasttext ngram (#4464 )	2019-10-20 01:28:18 +02:00
pbnsilva.md	Adds contributor agreement	2018-01-11 17:40:12 +01:00
phiedulxp.md	update lang/zh (#4103 )	2019-08-12 10:37:48 +02:00
phojnacki.md	agreement of contributor, may I introduce a tiny pl languge contribution (#2799 )	2018-09-27 12:25:22 +02:00
pickfire.md	Add myself to contributors (#3575 )	2019-04-11 11:31:04 +02:00
pktippa.md	Added pktippa contributor agreement	2018-02-07 15:37:28 +05:30
pmbaumgartner.md	contributor agreement	2019-07-14 20:46:06 -04:00
polm.md	Port over contributor agreements	2017-10-24 20:13:34 +02:00
prilopes.md	Generalize handling of tokenizer special cases (#4259 )	2019-11-13 21:24:35 +01:00
pzelasko.md	Less norm computations in token similarity (#2730 )	2018-09-05 21:50:23 +02:00
ramananbalakrishnan.md	Support single value for attribute list in doc.to_array	2017-10-19 17:00:41 +05:30
retnuh.md	Update call to `mkdir()` to create the parents (#3139 )	2019-01-11 03:02:18 +01:00
richardpaulhudson.md	Request to include Holmes in spaCy Universe (#3685 )	2019-05-08 02:42:03 +02:00
rokasramas.md	Lithuanian language support (#3895 )	2019-07-08 10:25:22 +02:00
roshni-b.md	updates for Bengali language (#3286 )	2019-02-18 10:02:28 +01:00
ryanzhe.md	biluo_tags_from_offsets throw exception for overlapping entities (#4021 )	2019-08-15 18:13:32 +02:00
sainathadapa.md	Basic support for Telugu language (#2751 )	2018-09-10 11:53:18 +02:00
sammous.md	Updating description and code snippet spacy-lefff (#2623 )	2018-08-02 17:25:27 +02:00
savkov.md	Renamed the file	2018-01-11 17:49:29 +00:00
seanBE.md	add return_matches and as_tuples back to Matcher.pipe (#4303 )	2019-09-18 22:00:33 +02:00
shuvanon.md	Port over contributor agreements	2017-10-24 20:13:34 +02:00
skrcode.md	Restore contributor agreement	2018-03-31 14:06:37 +02:00
socool.md	Update Thai tokenizer_exception list (#3529 )	2019-04-03 09:13:36 +02:00
sorenlind.md	Add contributor agreement.	2017-11-24 15:29:54 +01:00
suchow.md	Re-add existing contributor agreements	2016-11-09 16:42:02 +01:00
svlandeg.md	Fix small typo bug in French regexp + relevant unit test (#2980 )	2018-11-29 20:16:13 +01:00
tamuhey.md	Fix iss4278 (#4279 )	2019-09-12 10:44:49 +02:00
therealronnie.md	Addresses Issue #2228 - Deserialization fails when using tensor=False or sentiment=False (#2230 )	2018-05-01 13:40:22 +02:00
thomasopsomer.md	add contributor agreement	2018-01-28 20:12:05 +01:00
tjkemp.md	Enhancement/lang fi examples (#2547 )	2018-07-15 09:50:27 +02:00
tmetzl.md	Merge branch 'master' into develop [ci skip]	2019-03-11 12:23:24 +01:00
tokestermw.md	added contributor agreement	2017-11-17 17:27:20 -08:00
trungtv.md	Add support for Vietnamese in spaCy by leveraging Pyvi, an external Vietnamese tokenizer (#2155 )	2018-03-29 12:19:51 +02:00
tyburam.md	Lex _attrs for polish language (#2750 )	2018-09-10 11:53:57 +02:00
tzano.md	Add Arabic language (#2314 )	2018-05-15 00:27:19 +02:00
ujwal-narayan.md	Enhancing Kannada language Resources (#3755 )	2019-05-20 12:56:10 +02:00
ursachec.md	Add contributor agreement for @ursachec	2018-02-13 20:49:42 +01:00
uwol.md	added contributor agreement	2017-11-05 12:33:43 +01:00
veer-bains.md	Fixed syntax error in lang/ko when using python 2 (#4082 ) (closes #4068 )	2019-08-05 10:19:32 +02:00
vikaskyadav.md	Create vikaskyadav.md (#2621 )	2018-08-02 14:03:44 +02:00
vishnumenon.md	Fix the code for FACILITIY entities (#2324 )	2018-05-12 15:19:17 +02:00
vsolovyov.md	Re-add existing contributor agreements	2016-11-09 16:42:02 +01:00
w4nderlust.md	Added Ludwig among the projects (#3548 ) [ci skip]	2019-04-07 13:01:26 +02:00
wallinm1.md	[finnish] Add contributor file	2017-02-04 13:54:10 +02:00
wannaphongcom.md	Update Thai tag map (#3480 )	2019-03-25 16:53:26 +01:00
willismonroe.md	Port over contributor agreements	2018-03-24 17:17:37 +01:00
willprice.md	Improve random prefix generation in displaCy arcs (#3096 )	2018-12-27 14:46:02 +01:00
wojtuch.md	User correct variable name in the examples (#2664 )	2018-08-13 22:21:24 +02:00
wxv.md	Fix is_ascii documentation and create contributor file (#2988 )	2018-11-30 15:57:58 +01:00
x-ji.md	Fix venv command examples (#2560 ) [ci skip]	2018-07-18 10:31:24 +02:00
xssChauhan.md	Change default output format from `jsonl` to `json` for cli convert (#3583 ) (closes #3523 )	2019-04-12 11:31:23 +02:00
yanaiela.md	Custom entity render (#4117 )	2019-08-16 18:39:25 +02:00
yaph.md	Create yaph.md so I can contribute (#3658 )	2019-04-29 19:43:06 +02:00
yashpatadia.md	Add test file for issue (#3625 ) and spacy contributor agreement	2019-07-11 14:53:14 +05:30
yuukos.md	Port over contributor agreements	2017-10-24 20:13:34 +02:00
zhuorulin.md	Bugfix/fix wikidata train entity linker (#4509 )	2019-10-24 12:52:59 +02:00
zqhZY.md	add contributors.md	2017-12-28 18:04:52 +08:00
zqianem.md	Fix typo in documentation (#4322 )	2019-09-25 19:42:18 +02:00