Commit Graph

406 Commits

Author SHA1 Message Date
Peter Gilles 428887b8f2 Initial commit: New language Luxembourgish (lb) (#4424)
* new language: Luxembourgish (lb)

* update

* update

* Update and rename .github/CONTRIBUTOR_AGREEMENT.md to .github/contributors/PeterGilles.md

* Update and rename .github/contributors/PeterGilles.md to .github/CONTRIBUTOR_AGREEMENT.md

* Update norm_exceptions.py

* Delete README.md

* moved test_lemma.py

* deactivated 'lemma_lookup = LOOKUP'

* update

* Update conftest.py

* update

* tests updated

* import unicode_literals

* Update spacy/tests/lang/lb/test_text.py

Co-Authored-By: Ines Montani <ines@ines.io>

* Create PeterGilles.md
2019-10-14 12:27:50 +02:00
Ben Taylor 1db79a33cb most_similar() return the k most similar vectors (#4364)
* most_similar return n-most similar vectors

* updated most_similar comment

* add bintay contributor agreement

* sign bintay contributor agreement

* fix most_similar documentation typo

* fixed error in prune_vectors

* updated prune_vectors test
2019-10-03 14:09:44 +02:00
Rahul Soni ed620daa5c Fix example sentences in Hindi for grammatical errors (#4343)
* Fix grammar for hindi

* Fix grammar for hindi

* Submit contributor agreement
2019-09-30 23:32:49 +02:00
Ines Montani 159b72ed4c Delete main.yml 2019-09-29 15:58:59 +02:00
Ines Montani 539a7b53cd
Update main.yml 2019-09-29 15:55:26 +02:00
Ines Montani b7913c8eca
Update main.yml 2019-09-29 15:40:07 +02:00
Ines Montani eb2b60069e
Update main.yml 2019-09-29 15:33:53 +02:00
Ines Montani 70295f9e59
Update main.yml 2019-09-29 15:32:11 +02:00
Ines Montani b503270b09
Update main.yml 2019-09-29 15:30:31 +02:00
Ines Montani 52ea244830 Fix workflows 2019-09-29 15:30:13 +02:00
Ines Montani e9acfaec52 Revert "Revert "Rename workflows to _workflows""
This reverts commit 051fac51ee.
2019-09-29 15:29:02 +02:00
Ines Montani 051fac51ee Revert "Rename workflows to _workflows"
This reverts commit ba0027c936.
2019-09-29 15:28:59 +02:00
Ines Montani 7164c687e9 Revert "Merge branch 'master' of https://github.com/explosion/spaCy"
This reverts commit 41aab59dbf, reversing
changes made to ba0027c936.
2019-09-29 15:28:31 +02:00
Ines Montani 41aab59dbf Merge branch 'master' of https://github.com/explosion/spaCy 2019-09-29 15:26:32 +02:00
Ines Montani ba0027c936 Rename workflows to _workflows 2019-09-29 15:26:23 +02:00
Ines Montani 80f67f6065
Update build.yml 2019-09-29 15:24:28 +02:00
Ines Montani e787e6d47f
Update build.yml 2019-09-29 15:15:34 +02:00
Ines Montani b2f41e2a9b
Update build.yml 2019-09-29 15:06:19 +02:00
Ines Montani 8b02fff097
Update build.yml 2019-09-29 14:55:43 +02:00
Ines Montani ace0d5c580
Update build.yml 2019-09-29 14:52:01 +02:00
Ines Montani d32fb03401
Update build.yml 2019-09-29 14:48:21 +02:00
Ines Montani a5c0130b50
Update and rename pythonpackage.yml to build.yml 2019-09-29 14:43:48 +02:00
EarlGreyT 1e9e2d8aa1 fix typo in first token (#4327)
* fix typo in first token

The head of 'in' is review which has an offset of 4 and not 44

* added contributor agreement
2019-09-27 14:49:36 +02:00
Jaydeep Borkar 6a06a3fa6a Update stop_words.py and add name in contributors (#4325)
* Update stop_words.py and add name in contributors

* add jaydeepborkar.md in contributors directory

* Reset template [ci skip]


Co-authored-by: Ines Montani <ines@ines.io>
2019-09-27 11:57:27 +02:00
Em Zhan aafa091541 Fix typo in documentation (#4322)
* Fix typo 'probj' instead of 'pobj'

* Add spaCy contributor agreement for zqianem
2019-09-25 19:42:18 +02:00
Sean Löfgren 31c683d87d add return_matches and as_tuples back to Matcher.pipe (#4303)
* add contributor agreement [ci skip]

* add return_matches and as_tuples back to Matcher.pipe
2019-09-18 22:00:33 +02:00
Moshe Hazoom 72463b062f Improve speed of _merge method (#4300)
* make merge more efficient

* fix offsets

* merge works with relative indices

* remove printing

* Add the SCA

* fix SCA date

* more cythonize _retokenize.pyx

* more cythonize _retokenize.pyx

* fix only declaration in _retokenize.pyx

* switch back to absolute head

* switch back to absolute head

* fix comment

* merge from origin repo
2019-09-18 21:34:34 +02:00
tamuhey 71909cdf22 Fix iss4278 (#4279)
* fix: len(tuple) == 2

* (#4278) add fail test

* add contributor's aggreement
2019-09-12 10:44:49 +02:00
Mihai Gliga 25aecd504f adding Romanian tag_map (#4257)
* adding Romanian tag_map

* added SCA file

* forgotten import
2019-09-09 11:53:09 +02:00
Ines Montani bcd1b12f43 Add contributor agreement [ci skip] 2019-08-30 17:02:43 +02:00
Andrei-Marius Avram 199589228e Added RONEC to spaCy Universe (#4151)
* Added RONEC to spaCy Universe

* Added contributor file

* Corrected date from .github/contributors/avramandrei.md

* Convert tabs to spaces

* Remove duplicate keys

Can only have one GitHub link unfortunately

* Also add models category

* Adjust ID

This is used to generate the URL, so a simpler string is better
2019-08-20 14:46:07 +02:00
Ivan Šarić 434f6fa6c1 Issue #1107 - adds examples.py for Croatian language (#4143)
* adds contributor agreement for isaric

* adds examples.py for croatian language
2019-08-18 23:04:41 +02:00
yanaiela ec0beccaf1 Custom entity render (#4117)
* customizable template for entities display, allowing to pass additional parameters along each entity

* contributor agreement

* simpler naming for the additional parameters given to the span entities renderer

Co-Authored-By: Ines Montani <ines@ines.io>

* change of default parameter, as suggested

Co-Authored-By: Ines Montani <ines@ines.io>
2019-08-16 18:39:25 +02:00
Ziming He eea7d4f4a8 biluo_tags_from_offsets throw exception for overlapping entities (#4021)
* Check whether two entities overlap

- biluo_gold_biluo_overlap now throw exception when entities passed in have overlaps
- added unit test

* SCA agreement
2019-08-15 18:13:32 +02:00
AJ Rader 2f3648700c Correction of default lemmatizer lookup in English (Issue # 4104) (#4110)
* pytest file for issue4104 established

* edited default lookup english lemmatizer for spun; fixes issue 4102

* eliminated parameterization and sorted dictionary dependnency in issue 4104 test

* added contributor agreement
2019-08-15 11:39:10 +02:00
Ines Montani 5196dbd89d Delete wip.yml [ci skip] 2019-08-13 13:31:21 +02:00
Ines Montani 35c865024b Fix file name [ci skip] 2019-08-12 18:39:54 +02:00
Ines Montani 3a39154804 Create wip.yaml [ci skip] 2019-08-12 17:26:31 +02:00
黎谢鹏 250a54414b update lang/zh (#4103)
* update lang/zh

* update lang/zh
2019-08-12 10:37:48 +02:00
ICLR&D 87e40b17a0 Add entry for Blackstone in universe.json (#4101)
* Add entry for Blackstone in universe.json

Add an entry for the Blackstone project. Checked JSON is valid.

* Create ICLRandD.md

* Fix indentation (tabs to spaces)

It looks like during validation, the JSON file automatically changed spaces to tabs. This caused the diff to show *everything* as changed, which is obviously not true. This hopefully fixes that.

* Try to fix formatting for diff

* Fix diff


Co-authored-by: Ines Montani <ines@ines.io>
2019-08-09 17:16:51 +02:00
Jeno 15be09ceb0 Raise error if annotation dict in simple training style has unexpected keys #4074 (#4079)
* adding enhancement #4074.

* modified behavior to strictly require top level dictionary keys - issue #4074

* pass expected keys to error message and add links as expected top level key
2019-08-06 11:01:25 +02:00
Pavle Vidanović e1a935d71c Stopwords for Serbian language. (#4078)
* Serbian stopwords added. (cyrillic alphabet)

* spaCy Contribution agreement included.

* Test initialize updated
2019-08-05 10:22:27 +02:00
veer-bains 874bd8c8dd Fixed syntax error in lang/ko when using python 2 (#4082) (closes #4068)
* fixed syntax error in declaring variables with python 2.7 in spacy/lang/ko/__init__.py

* fixed syntax error in declaring variables with python 2.7 in spacy/lang/ko/__init__.py

* Update __init__.py

* Create veer-bains.md

* Update __init__.py

fixed syntax errors in variable datatype assignment when calling spacy.blank("ko") with python 2.7
2019-08-05 10:19:32 +02:00
Anastassia 33b14724a5 Update gold corpus code to properly ingest a directory of jsonl… (#4067)
* Update gold corpus code to properly ingest a directory of jsonlines files

In response to: https://github.com/explosion/spaCy/issues/3975

* Update spacy/gold.pyx

Co-Authored-By: Ines Montani <ines@ines.io>
2019-08-02 09:58:51 +02:00
Mohammed Daudali 23ec07debd Correct typo for AllenAI url on homepage (#4050)
* Typo fix for AllenAI url

Changed incorrect home page url for AllenAI from appenai.org to allenai.org

* Sign contributor agreement

* Change date format
2019-07-31 00:16:33 +02:00
Bae Yong-Ju 05fbf5d976 Fix error when Korean text contains regexp special characters. (#4022) 2019-07-25 17:53:33 +02:00
Falak Asad ff1e73e35c Bugfix/issue 3968 (#3982)
* Fix for issue-3968

* Added contributor agreement

* Made suggested changes
2019-07-18 00:20:32 +02:00
pmbaumgartner 931e87f927 contributor agreement 2019-07-14 20:46:06 -04:00
yash d5311b3c42 Add test file for issue (#3625) and spacy contributor agreement 2019-07-11 14:53:14 +05:30
cedar101 58f06e6180 Korean support (#3901)
* start lang/ko

* add test codes

* using natto-py

* add test_ko_tokenizer_full_tags()

* spaCy contributor agreement

* external dependency for ko

* collections.namedtuple for python version < 3.5

* case fix

* tuple unpacking

* add jongseong(final consonant)

* apply mecab option

* Remove Pipfile for now


Co-authored-by: Ines Montani <ines@ines.io>
2019-07-09 22:23:16 +02:00
Alex a795fbd3b2 added contributor agreement ameyuuno.md (#3925)
@ines hi! 
I asked to change my username (yuukos -> ameyuuno). So I added a new contributor agreement.
2019-07-09 10:09:52 +02:00
Joshua Smith e8420ab2b7 Added support for serializing overwrite and ent_id_sep (#3918)
* Perserve flags in EntityRuler

The EntityRuler (explosion/spaCy#3526) does not preserve
overwrite flags (or `ent_id_sep`) when serialized.  This
commit adds support for serialization/deserialization preserving
overwrite and ent_id_sep flags.

* add signed contributor agreement

* flake8 cleanup

mostly blank line issues.

* mark test from the issue as needing a model

The test from the issue needs some language model for serialization
but the test wasn't originally marked correctly.

* remove unneeded model loading

The model didn't need to be loaded, and I replaced it with
a change that doesn't require it (using existings fixtures)

* change tempdir handling to be compatible with python 2.7

* Adds code to handle item saved before this change.

This code chanes how the save files are handled and how the bytes
are stored as well.  This code adds check to dispatch correctly
if it encounters bytes or files saved in the old format (and tests
for those cases).

* use util function for tempdir management

Updated after PR comments: this code now uses the make_tempdir function from util
instead of doing it by hand.
2019-07-08 17:28:28 +02:00
Knut O. Hellan a54f0cfc2b Norwegian tweaks (#3894)
* Norwegian fix

Add support for alternative past tense verb form (vaska).

* Norwegian months

Add all Norwegian months to tokenizer excpetions.

* More Norwegian abbreviations

Add more Norwegian abbreviations to tokenizer_exceptions.

* Contributor agreement khellan

Add signed contributor agreement for khellan (Knut O. Hellan).
2019-07-08 10:28:47 +02:00
Patrick Hogan 8c0586fd9c Update example and sign contributor agreement (#3916)
* Sign contributor agreement for askhogan

* Remove unneeded `seen_tokens` which is never used within the scope
2019-07-08 10:27:20 +02:00
Rokas Ramanauskas 61ce126d4c Lithuanian language support (#3895)
* initial LT lang support

* Added more stopwords. Started setting up some basic test environment (not complete)

* Initial morph rules for LT lang

* Closes #1 Adds tokenizer exceptions for Lithuanian

* Closes #5 Punctuation rules. Closes #6 Lexical Attributes

* test: add native examples to basic tests

* feat: add tag map for lt lang

* fix: remove undefined tag attribute 'Definite'

* feat: add lemmatizer for lt lang

* refactor: add new instances to lt lang morph rules; use tags from tag map

* refactor: add morph rules to lt lang defaults

* refactor: only keep nouns, verbs, adverbs and adjectives in lt lang lemmatizer lookup

* refactor: add capitalized words to lt lang lemmatizer

* refactor: add more num words to lt lang lex attrs

* refactor: update lt lang stop word set

* refactor: add new instances to lt lang tokenizer exceptions

* refactor: remove comments form lt lang init file

* refactor: use function instead of lambda in lt lex lang getter

* refactor: remove conversion to dict in lt init when dict is already provided

* chore: rename lt 'test_basic' to 'test_text'

* feat: add more lt text tests

* feat: add lemmatizer tests

* refactor: remove unused imports, add newline to end of file

* chore: add contributor agreement

* chore: change 'en' to 'lt' in lt example description

* fix: add missing encoding info

* style: add newline to end of file

* refactor: use python2 compatible syntax

* style: reformat code using black
2019-07-08 10:25:22 +02:00
Guillaume Claret d7a519a922 Typo (#3865)
* Typo

* Add contributor agreement
2019-06-20 10:31:19 +02:00
Alejandro Alcalde 4866a7ee9e Changed learning rate by its param name. (#3855)
* Changed learning rate by its param name.

I've been searching for a while how the parameter learning rate was named, with `beta1` and `beta2` its easy as they are marked as code, but learning rate wasn't. I think writing the actual parameter name would be helpful.

* Signing SCA
2019-06-20 10:29:20 +02:00
Greg Werner 9041a72d7f Update tokenizer.md for construction example (#3790)
* Update tokenizer.md for construction example

Self contained example.  You should really say what nlp is so that the example will work as is

* Update CONTRIBUTOR_AGREEMENT.md

* Restore contributor agreement

* Adjust construction examples
2019-06-16 14:32:56 +02:00
Kabir Khan 1e19f34e29 Add optional `id` property to EntityRuler patterns (#3591)
* Adding support for entity_id in EntityRuler pipeline component

* Adding Spacy Contributor aggreement

* Updating EntityRuler to use string.format instead of f strings

* Update Entity Ruler to support an 'id' attribute per pattern that explicitly identifies an entity.

* Fixing tests

* Remove custom extension entity_id and use built in ent_id token attribute.

* Changing entity_id to ent_id for consistent naming

* entity_ids => ent_ids

* Removing kb, cleaning up tests, making util functions private, use rsplit instead of split
2019-06-16 13:29:04 +02:00
Azagh3l d0d56635ce Create Azagh3l.md (#3836) 2019-06-11 10:58:32 +02:00
intrafind 436a578369 Create intrafindBreno.md (#3814) 2019-06-03 18:33:09 +02:00
Germán 86eb817b74 Overwrites default getter for like_num in Spanish by adding _num_words and like_num to lex_attrs.py (#3810) (closes #3803))
* (#3803) Spanish like_num returning false for number-like token

* (#3803) Spanish like_num now returning True for number-like token
2019-06-02 12:22:57 +02:00
Nirant a5d92a3035 Create NirantK.md (#3807) [ci skip] 2019-06-01 17:36:06 +02:00
Nipun Sadvilkar 1f13005751 Incorrect Token attribute ent_iob_ description (#3800)
* Incorrect Token attribute ent_iob_ description

* Add spaCy contributor agreement
2019-05-31 16:50:45 +02:00
estr4ng7d 604acb6ace Marathi Language Support (#3767)
* Adding Marathi language details and folder to it

* Adding few changes and running tests

* Adding few changes and running tests

* Update __init__.py

mh -> mr

* Rename spacy/lang/mh/__init__.py to spacy/lang/mr/__init__.py

* mh -> mr
2019-05-24 14:29:42 +02:00
Ujwal Narayan 4d550a3055 Enhancing Kannada language Resources (#3755)
* Updated stop_words.py

Added more stopwords

* Create ujwal-narayan.md

Enhancing Kannada language resources
2019-05-20 12:56:10 +02:00
Aaron Kub 719a15f23d fixing regex matcher examples (#3708) (#3719) 2019-05-10 14:23:52 +02:00
Luca Dorigo 2663f4133c Submit contributor agreement (#3705) 2019-05-10 14:19:18 +02:00
richardpaulhudson a1e07f0d14 Request to include Holmes in spaCy Universe (#3685)
* Request to add Holmes to spaCy Universe

Dear spaCy team, I would be grateful if you would consider my Python library Holmes for inclusion in the spaCy Universe. Holmes transforms the syntactic structures delivered by spaCy into semantic structures that, together with various other techniques including ontological matching and word embeddings, serve as the basis for information extraction. Holmes supports several use cases including chatbot, structured search, topic matching and supervised document classification. I had the basic idea for Holmes around 15 years ago and now spaCy has made it possible to build an implementation that is stable and fast enough to actually be of use - thank you! At present Holmes supports English and German (I am based in Munich) but could easily be extended to support any other language with a spaCy model.

* Added
2019-05-08 02:42:03 +02:00
F0rge1cE dd1e6b0bc6 Fix offset bug in loading pre-trained word2vec. (#3689)
* Fix offset bug in loading pre-trained word2vec.

* add contributor agreement
2019-05-06 23:00:38 +02:00
张晓飞 ba1ff00370 update response after calling add_pipe (#3661)
* update response after calling add_pipe

component:print_info is appened in the last, so need show it at the end of  pipeline

* Create henry860916.md
2019-05-01 12:02:18 +02:00
Amit Chaudhary 167d63af31 Fix broken link to Dive Into Python 3 website (#3656)
* Fix broken link to Dive Into Python 3 website

* Sign spaCy Contributor Agreement
2019-04-29 19:44:00 +02:00
Ramiro Gómez e7e5999ddc Create yaph.md so I can contribute (#3658) 2019-04-29 19:43:06 +02:00
Brad Jascob 9afa0d6723 Update Universe Website for pyInflect (#3641) 2019-04-26 13:17:36 +02:00
Dobita21 d86848cf1f Create Dobita21.md (#3614)
<!--- Provide a general summary of your changes in the title. -->

## Description
<!--- Use this section to describe your changes. If your changes required
testing, include information about the testing environment and the tests you
ran. If your test fixes a bug reported in an issue, don't forget to include the
issue number. If your PR is still a work in progress, that's totally fine – just
include a note to let us know. -->

### Types of change
<!-- What type of change does your PR cover? Is it a bug fix, an enhancement
or new feature, or a change to the documentation? -->

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-04-18 12:51:54 +02:00
fizban99 57d4a8bf3d Create fizban99.md (#3601) 2019-04-17 11:22:19 +02:00
BreakBB 5b8dbe4975 Fix symlink creation to show error message on failure (#3589) (resolves #3307))
* Fix symlink creation to show error message on failure. Update tests to reflect those changes.

* Fix test to succeed on non windows systems.
2019-04-16 11:58:31 +02:00
Shikhar Chauhan bbf6f9f764 Change default output format from `jsonl` to `json` for cli convert (#3583) (closes #3523)
* Changing default ouput format from jsonl to json for cli convert

* Adding Contributor Agreement
2019-04-12 11:31:23 +02:00
Omer Celik 034a1f458b Signed agreement (#3577) 2019-04-11 11:31:27 +02:00
Ivan Tham 71710e2454 Add myself to contributors (#3575) 2019-04-11 11:31:04 +02:00
Santiago Castro 86e4b68aa9 Fix website docs for Vectors.from_glove (#3565)
* Fix website docs for Vectors.from_glove

* Add myself as a contributor
2019-04-10 15:23:27 +02:00
Piero Molino 5198aa4ae6 Added Ludwig among the projects (#3548) [ci skip]
* Added Ludwig among the projects

* Create w4nderlust.md

* Add Uber to logo wall
2019-04-07 13:01:26 +02:00
jeannefukumaru f67d881b30 fix typos in tag_map flagged by `python -m debug-data` (#3542)
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [ ] I have submitted the spaCy Contributor Agreement.
- [ ] I ran the tests, and all new and existing tests passed.
- [ ] My changes don't require a change to the documentation, or if they do, I've added all required information.


Co-authored-by: Ines Montani <ines@ines.io>
2019-04-05 12:06:09 +02:00
Yves Peirsman 951825532c Improved Dutch language resources and Dutch lemmatization (#3409)
* Improved Dutch language resources and Dutch lemmatization

* Fix conftest

* Update punctuation.py

* Auto-format

* Format and fix tests

* Remove unused test file

* Re-add deleted test

* removed redundant infix regex pattern for ','; note: brackets + simple hyphen remains

* Cleaner lemmatization files
2019-04-03 14:13:26 +02:00
Kamolsit Mongkolsrisawat dcc67f3f51 Update Thai tokenizer_exception list (#3529)
* add tokenizer_exceptions word (ก-น) from https://goo.gl/JpJ2qq

* update tokenizer_exceptions word list

* add contributor file
2019-04-03 09:13:36 +02:00
ivigamberdiev 5e5641616d Update links and http -> https (#3532)
* update links and http -> https

* SCA
2019-04-02 17:36:22 +02:00
Hiromu Hota 914b9ff3d2 Tags are joined with a comma and padded with asterisks (#3491)
<!--- Provide a general summary of your changes in the title. -->

## Description
<!--- Use this section to describe your changes. If your changes required
testing, include information about the testing environment and the tests you
ran. If your test fixes a bug reported in an issue, don't forget to include the
issue number. If your PR is still a work in progress, that's totally fine – just
include a note to let us know. -->

Fix a bug in the test of JapaneseTokenizer.
This PR may require @polm's review.

### Types of change
<!-- What type of change does your PR cover? Is it a bug fix, an enhancement
or new feature, or a change to the documentation? -->

Bug fix

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-03-28 16:17:31 +01:00
David 74e738dd4d adds textpipe to universe (#3500) [ci skip]
* Adds textpipe to universe

* signed contributor agreement

* Adjust formatting, code style and use "standalone" category
2019-03-28 15:13:19 +01:00
Samuel Kane 06a1846379 fix(util): fix decaying function output (#3495)
* fix(util): fix decaying function output

* fix(util): better test and adhere to code standards

* fix(util): correct variable name, pytestify test, update website text
2019-03-28 13:24:47 +01:00
Wannaphong Phatthiyaphaibun 297a051992 Update Thai tag map (#3480)
* Update Thai tag map

Update Thai tag map

* Create wannaphongcom.md
2019-03-25 16:53:26 +01:00
Bharat123Rox f2547f02d6 Made changes suggested by @ines 2019-03-20 07:43:19 +05:30
Bharat123Rox b5f077dcf4 Sign the Contributor Agreement and update details 2019-03-19 23:07:54 +05:30
Ines Montani f6ffbe1fd3 Fix filename 2019-03-16 13:46:58 +01:00
Ines Montani fb53eb570f Fix typo 2019-03-16 13:45:46 +01:00
Ryan Ford 00842d7f1b Merging conversion scripts for conll formats (#3405)
* merging conllu/conll and conllubio scripts

* tabs to spaces

* removing conllubio2json from converters/__init__.py

* Move not-really-CLI tests to misc

* Add converter test using no-ud data

* Fix test I broke

* removing include_biluo parameter

* fixing read_conllx

* remove include_biluo from convert.py
2019-03-15 18:14:46 +01:00
Ines Montani e77220e3ae Merge branch 'master' into develop [ci skip] 2019-03-11 12:23:24 +01:00
Ines Montani daaeeb7a2b Merge branch 'master' into develop 2019-03-07 22:07:31 +01:00
Adrien Ball 88909a9adb Fix egg fragments in direct download (#3369)
## Description
The egg fragment in the URL must be of the form `#egg=package_name==version` instead of `#egg=package_name-version`.
One of the consequences of specifying wrong egg fragments is that `pip` does not recognize the package and its version properly, and thus it re-downloads the package systematically.

I'm not sure how this should be tested properly. 
Here is what I had before the fix when running the same direct download twice:
```
$ python -m spacy download en_core_web_sm-2.0.0 --direct
Looking in indexes: https://pypi.python.org/simple/
Collecting en_core_web_sm-2.0.0 from https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz#egg=en_core_web_sm-2.0.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz (37.4MB)
    100% |████████████████████████████████| 37.4MB 1.6MB/s
  Generating metadata for package en-core-web-sm-2.0.0 produced metadata for project name en-core-web-sm. Fix your #egg=en-core-web-sm-2.0.0 fragments.
Installing collected packages: en-core-web-sm
  Running setup.py install for en-core-web-sm ... done
Successfully installed en-core-web-sm-2.0.0
$ python -m spacy download en_core_web_sm-2.0.0 --direct
Looking in indexes: https://pypi.python.org/simple/
Collecting en_core_web_sm-2.0.0 from https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz#egg=en_core_web_sm-2.0.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz (37.4MB)
    100% |████████████████████████████████| 37.4MB 919kB/s
  Generating metadata for package en-core-web-sm-2.0.0 produced metadata for project name en-core-web-sm. Fix your #egg=en-core-web-sm-2.0.0 fragments.
Requirement already satisfied (use --upgrade to upgrade): en-core-web-sm from https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz#egg=en_core_web_sm-2.0.0 in ./venv3/lib/python3.6/site-packages
```

And after the fix:
```
$ python -m spacy download en_core_web_sm-2.0.0 --direct
Looking in indexes: https://pypi.python.org/simple/
Collecting en_core_web_sm==2.0.0 from https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz#egg=en_core_web_sm==2.0.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz (37.4MB)
    100% |████████████████████████████████| 37.4MB 1.1MB/s
Installing collected packages: en-core-web-sm
  Running setup.py install for en-core-web-sm ... done
Successfully installed en-core-web-sm-2.0.0
$ python -m spacy download en_core_web_sm-2.0.0 --direct
Looking in indexes: https://pypi.python.org/simple/
Requirement already satisfied: en_core_web_sm==2.0.0 from https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz#egg=en_core_web_sm==2.0.0 in ./venv3/lib/python3.6/site-packages (2.0.0)
```

### Types of change
This is an enhancement as it avoids unnecessary downloads of (potentially big) spacy models, when they have already been downloaded.

## Checklist
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-03-07 21:07:19 +01:00
Ines Montani a8f1efd2f5 Merge branch 'master' into develop 2019-03-07 00:56:31 +01:00
Daniel King 5f40229397 Don't use numpy directly for similarity (#3362)
* Don't use numpy directly for similarity

* Contributor agreement
2019-03-06 22:58:38 +00:00
Ines Montani 3fdcdec6a0 Merge branch 'master' into develop 2019-02-18 10:03:32 +01:00
Roshni Biswas e09f1347fa updates for Bengali language (#3286)
* Update morph_rules.py

* contributor agreement for roshni-b

* created example sentences
2019-02-18 10:02:28 +01:00
Ines Montani 0184a95340 Merge branch 'master' into develop 2019-02-12 18:29:24 +01:00
Akhilesh a78db10941 add kannada support (#3264)
* add kannada support

* add few more stop words

* add support for Kannada Language
2019-02-12 18:28:39 +01:00
Ines Montani 9e652afa4b Merge branch 'master' into develop 2019-02-08 13:28:09 +01:00
Stanisław Giziński 1448ad100c Improved polish tokenizer and stop words. (#2974)
* Improved stop words list

* Removed some wrong stop words form list

* Improved stop words list

* Removed some wrong stop words form list

* Improved Polish Tokenizer (#38)

* Add tests for polish tokenizer

* Add polish tokenizer exceptions

* Don't split any words containing hyphens

* Fix test case with wrong model answer

* Remove commented out line of code until better solution is found

* Add source srx' license

* Rename exception_list.py to match spaCy conventionality

* Add a brief explanation of where the exception list comes from

* Add newline after reach exception

* Rename COPYING.txt to LICENSE

* Delete old files

* Add header to the license

* Agreements signed

* Stanisław Giziński agreement

* Krzysztof Kowalczyk - signed agreement

* Mateusz Olko agreement

* Add DoomCoder's contributor agreement

* Improve like number checking in polish lang


* like num tests added

* all from SI system added

* Final licence and removed splitting exceptions

* Added polish stop words to LEX_ATTRA

* Add encoding info to pl tokenizer exceptions
2019-02-08 14:27:21 +11:00
Ines Montani e2d93e4852 Merge branch 'master' into develop 2019-02-07 21:10:08 +01:00
Ines Montani 18205c6c48 Update company name 2019-02-07 21:06:55 +01:00
Julia Makogon b41d64825a Ukrainian language added. Small fixes in Russian (#3241)
* Classes for Ukrainian; small fix in Russian.

* Contributor agreement
2019-02-07 21:05:11 +01:00
Ines Montani f7e4674423 Fix contributor agreement 2019-02-07 20:56:13 +01:00
Ines Montani 4684195822
Rename contributer_agreement.md to .github/contributors/lauraBaakman.md 2019-02-07 20:55:53 +01:00
Ines Montani 5d0b60999d Merge branch 'master' into develop 2019-02-07 20:54:07 +01:00
Amandine Périnet b34bc9d2e9 add small fix for French lemmatizer (#3206) 2019-01-31 23:44:10 +01:00
adrianeboyd 03d58f9feb Update TIGER/German dependency relations in documentation (#3204)
* Add missing dependency relations for TIGER/German

* Contributor agreement for adrianeboyd
2019-01-30 14:23:12 +01:00
Jo f9ca09caa0 Create PolyglotOpenstreetmap.md (#3198)
* Create PolyglotOpenstreetmap.md

* forgot to tick that box
2019-01-26 14:02:54 +01:00
foufaster 8b61b6a6b5 Create foufaster.md (#3179) 2019-01-21 15:45:54 +01:00
Björn Lennartsson b892b446cc Updates to Swedish Language (#3164)
* Added the same punctuation rules as danish language.

* Added abbreviations and also the possibility to have capitalized abbreviations on some. Added a few specific cases too

* Added test for long texts in swedish

* Added morph rules, infixes and suffixes to __init__.py for swedish

* Added some tests for prefixes, infixes and suffixes

* Added tests for lemma

* Renamed files to follow convention

* [sv] Removed ambigious abbreviations

* Added more tests for tokenizer exceptions

* Added test for problem with punctuation in issue #2578

* Contributor agreement

* Removed faulty lemmatization of 'jag' ('I') as it was lemmatized to 'jaga' ('hunt')
2019-01-16 13:45:50 +01:00
Mark Neumann e599ed9ef8 Allow vectors to be optional in init-model, more robust string counting (#3155)
* more robust init-model

* key not word

* add license agreement
2019-01-14 23:48:30 +01:00
Loghi d97661d18b Tamil language support (#3154)
Tamil language support to spaCy
Description

Hereby, creating new PR to add support for Tamil language in spaCy

    added stop words, examples and numerical attributes
    <--Working on other language data-->

Types of change

Enhancement
Checklist

    [ x] I have submitted the spaCy Contributor Agreement.
    [x ] I ran the tests, and all new and existing tests passed.
    [ x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-01-14 15:32:30 +01:00
Hunter Kelly f28a1c7271 Update call to `mkdir()` to create the parents (#3139)
* Update call to `mkdir()` to create the parents

- Update the call to `output_dir.mkdir()` to also create the parents if needed

* don't automatically create parents but fail fast if cannot create directory

* add signed contributors agreement for retnuh
2019-01-11 03:02:18 +01:00
Amandine Périnet ee24e2534d French lemmatization: adding lemmas for adverbs and irregular lemmas for function words (#3131)
* adding adverbs and irregular cases for empty words

* adding adverbs and irregular cases for empty words

* adding adverbs and irregular cases for empty words

* updating contributor agreement for amperinet
2019-01-10 15:41:15 +01:00
Mathieu Morey f07b577fbd Support CUDA 10 (#3126)
* ENH support CUDA 10

* Update _instructions.jade
2019-01-09 03:10:45 +01:00
Amandine Périnet eef11a7a2c French lemmatization: correcting wrong lemmas in the lookup dictionnary (#3104)
* modifying French lookup that contained wrong lemmas

* correcting wrong line breaks on hyphen

* adding contributor agreement for amperinet@

* correcting a typo
2019-01-07 14:15:19 +01:00
alvations 9972716e01 Create alvations.md (#3119) 2019-01-05 13:11:06 +01:00
Álvaro Abella Bascarán 9bc4cc1352 Fix issue 2396 (#3089)
* Test on #2396: bug in Doc.get_lca_matrix()

* reimplementation of Doc.get_lca_matrix(), (closes #2396)

* reimplement Span.get_lca_matrix(), and call it from Doc.get_lca_matrix()

* tests Span.get_lca_matrix() as well as Doc.get_lca_matrix()

* implement _get_lca_matrix as a helper function in doc.pyx; call it from Doc.get_lca_matrix and Span.get_lca_matrix

* use memory view instead of np.ndarray in _get_lca_matrix (faster)

* fix bug when calling Span.get_lca_matrix; return lca matrix as np.array instead of memoryview

* cleaner conditional, add comment
2018-12-29 18:05:52 +01:00
Álvaro Abella Bascarán 6fe276f85d Fix issue 2396 (#3089)
* Test on #2396: bug in Doc.get_lca_matrix()

* reimplementation of Doc.get_lca_matrix(), (closes #2396)

* reimplement Span.get_lca_matrix(), and call it from Doc.get_lca_matrix()

* tests Span.get_lca_matrix() as well as Doc.get_lca_matrix()

* implement _get_lca_matrix as a helper function in doc.pyx; call it from Doc.get_lca_matrix and Span.get_lca_matrix

* use memory view instead of np.ndarray in _get_lca_matrix (faster)

* fix bug when calling Span.get_lca_matrix; return lca matrix as np.array instead of memoryview

* cleaner conditional, add comment
2018-12-29 18:02:26 +01:00
Jari Bakken e172f2478e Add three missing tags from the `nb` tag map (#3085)
* Contributors agreement for jarib

* Add tags from the UD/NORNE dataset that is missing in the nb tag map. Relates to #3082.
2018-12-27 14:48:40 +01:00
Will Price 4a6af0852a Improve random prefix generation in displaCy arcs (#3096)
* Improve random prefix generation in displaCy arcs

* Add @willprice contributor agreement
2018-12-27 14:46:02 +01:00
Özcan Kasal b573ebca77 trilyon forgotten (#3083)
* trilyon forgotten

* contributor added
2018-12-27 14:44:23 +01:00
Ken 5f0c5fbfa4 issue #3012: add test (#3021)
* issue #3012: add test

* add contributor aggreement

* Make test work without models and fix typos

ten.pos_ instead of ten.orth_ and comparison against "10" instead of integer 10
2018-12-18 15:02:49 +01:00
Kirill Bulygin 2fb004832f Fix the first `nlp` call for `ja` (closes #2901) (#3065)
* Fix the first `nlp` call for `ja` (closes #2901)

* Add unicode declaration, formatting and use relative import
2018-12-18 15:01:06 +01:00
Kirill Bulygin 10189d9092 Fix the first `nlp` call for `ja` (closes #2901) (#3065)
* Fix the first `nlp` call for `ja` (closes #2901)

* Add unicode declaration, formatting and use relative import
2018-12-18 14:53:50 +01:00
Brixjohn 52f3c95004 Added alpha support for Tagalog language (#3062)
I have added alpha support for the Tagalog language from the Philippines. It is the basis for the country's national language Filipino. I have heavily based the format to the EN and ES languages.

I have provided several words in the lemmatizer lookup table, added stop words from a source, translated numeric words to its Tagalog counterpart, added some tokenizer exceptions, and kept the tag map the same as the English language.

While the alpha language passed the preliminary testing that you provided, I think it needs more data to be useful for most cases.

* Added alpha support for Tagalog language

* Edited contributor template

* Included SCA; Reverted templates

* Fixed SCA template

* Fixed changes in SCA template
2018-12-18 13:08:38 +01:00
Amandine Périnet 361554f629 Lemmatization of Adjectives - French : adding rules and vocabulary (#3045)
* modifying FR lemmatisation for Adjectives

* adding contributor agreement for amperinet

* correcting some errors in vocabulary files
2018-12-16 18:11:07 +01:00
Aki Ariga 7fcd6419ff Upadate the document for Unidic link with latest version URL (#3022)
* Upadate Unidic link for latest version in document

This patch improves #3017 . The link for Unidic was old version one, so will the lates version.

* Add contributor agreement

* Use more specific link for unidic-cwj
2018-12-07 17:24:48 +01:00
Amandine Périnet 2457318b7a Lemmatization of Verbs - French : adding rules and vocabulary (#3006)
* updating rules and vocabulary for French lemmatization of verbs

* updating the file with French auxiliary verb

* updating rules and vocabulary for French lemmatization of verbs

* adding contributor agreement for amperinet

* adding rules for words with inclusive parentheses wrongly tokenized
2018-12-06 15:49:28 +01:00
Beate Sildnes f0d7e206ec Updated wordforms for Norwegian lemmatizer (#3007)
* Updated wordforms for Norwegian lemmatizer

Upload of updated lists of wordforms for the Norwegian lemmatizer (nouns, verbs, adverbs, adjectives and lookup).

* Add spaCy contributor agreement for user beatesi

*  Updated wordforms for Norwegian lemmatizer
2018-12-06 15:46:18 +01:00
Gavriel Loria ae5601beae Initialize trues to 0.0 in training example (#3004)
* added contributor agreement

* if there are no true positives, precision should be 0.0
2018-12-03 01:33:22 +01:00
wxv 06820ef6e7 Fix is_ascii documentation and create contributor file (#2988)
Proposed in #2933
2018-11-30 15:57:58 +01:00
Sofie 585de273cd Fix small typo bug in French regexp + relevant unit test (#2980)
* additional unit test for new entr word not in other lists

* bugfix - unit test works

* use _latin_lower instead of alpha_lower for french

* revert back to ALPHA_LOWER (following the code for languages)

* contributor agreement
2018-11-29 20:16:13 +01:00
Adam Schwalm 00566949de Fix bug where Vocab.prune_vector did not use 'batch_size' (#2977)
Fixes #2976
2018-11-28 19:49:33 +01:00
Marc Puig 98fe1ab259 Catalan Language Support (#2940)
* Catalan language Support

* Ddding Catalan to documentation
2018-11-26 15:25:47 +01:00
Shawn Cicoria 7601ae0cff fixes symbolic link on py3 and windows (#2949)
* fixes symbolic link on py3 and windows
during setup of spacy using command
python -m spacy link en_core_web_sm en
closes #2948

* Update spacy/compat.py

Co-Authored-By: cicorias <cicorias@users.noreply.github.com>
2018-11-24 15:34:23 +01:00
Francisco Aranda be99f1cac5 Include universe spec for spacy-wordnet component (#2919)
* feat: include universe spec for spacy-wordnet component

* chore: include spaCy contributor agreement
2018-11-13 23:54:46 +01:00
mikelibg 75e7d503b7 Removed space in docs + added contributor indo (#2909)
* - removed unneeded space in documentation

* - added contributor info
2018-11-08 14:18:25 +01:00
Bram Vanroy 071789467e Documentation improvement regarding joblib and SO (#2867)
Some documentation improvements

## Description
1. Fixed the dead URL to joblib
2. Fixed Stack Overflow brand name (with space)

### Types of change
Documentation

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2018-10-24 15:19:17 +02:00
JKhakpour 74a30d883c Add Persian(Farsi) language support (#2797) 2018-10-13 15:31:49 +02:00
Marina Lysyuk b76fe08308 Correcting lang/ru/examples.py (#2845)
* Correct some grammatical inaccuracies in lang\ru\examples.py; filled Contributor Agreement

* Correct some grammatical inaccuracies in lang\ru\examples.py

* Move contributor agreement to separate file
2018-10-13 15:19:43 +02:00
Jacopo Farina 42c42376a3 Visual C++ link updated (#2842) (closes #2841) [ci skip]
* New landing page

* Add contribution agreement
2018-10-12 14:59:45 +02:00
Przemysław Hojnacki 966b583d5e agreement of contributor, may I introduce a tiny pl languge contribution (#2799)
* Contributors agreement

* Contributors agreement

* Contributors agreement
2018-09-27 12:25:22 +02:00