Commit Graph

8686 Commits

Author SHA1 Message Date
Matthew Honnibal 546dd99cdf Merge master into develop -- mostly Arabic and website 2018-05-15 18:14:28 +02:00
Matthew Honnibal 581d318971 Fix conftest 2018-05-15 00:54:45 +02:00
Tahar Zanouda 00417794d3 Add Arabic language (#2314)
* added support for Arabic lang

* added Arabic language support

* updated conftest
2018-05-15 00:27:19 +02:00
Jani Monoses 0e08e49e87 Lemmatizer ro (#2319)
* Add Romanian lemmatizer lookup table.

Adapted from http://www.lexiconista.com/datasets/lemmatization/
by replacing cedillas with commas (ș and ț).

The original dataset is licensed under the Open Database License.

* Fix one blatant issue in the Romanian lemmatizer

* Romanian examples file

* Add ro_tokenizer in conftest

* Add Romanian lemmatizer test
2018-05-12 15:20:04 +02:00
vishnumenon ae3719ece5 Fix the code for FACILITIY entities (#2324)
* Fix the code for FACILITIY entities

As far as I can tell, the default models all use "FAC" rather than "FACILITY"

* Added my Contributor Agreement

* Rename vishnumenon to vishnumenon.md
2018-05-12 15:19:17 +02:00
Jani Monoses 42b34832e4 Update Romanian stopword list (#2316)
* Contributor agreement for janimo

* Update Romanian stopword list

Include the correct spellings of all the words already in the repo
that are using cedillas (ş and ţ) instead of commas (ș and ț).

Add another unrelated spelling fix.

See https://github.com/stopwords-iso/stopwords-ro/pull/1 and
https://github.com/stopwords-iso/stopwords-ro/pull/2
2018-05-10 12:16:56 +02:00
Lucas Abbade 18af53014f Adding my contributor agreement (#2315)
* Create LRAbbade.md

* Update LRAbbade.md
2018-05-09 21:25:05 +02:00
Lucas Abbade be7fdc59d1 Update lex_attrs.py (#2307)
* Update lex_attrs.py

Fixed spelling mistakes of some numbers (according to Brazilian Portuguese).

* Update lex_attrs.py

As requested, I've included the correct spelling for both Brazilian Portuguese and Portuguese Portuguese.

I will advise however, that the two are separated in the future. Brazilian Portuguese is a very different language from the original one, although most of the writing is unified, the way people talk in both countries is radically different. Keeping both languages as one may lead to bigger issues in the future, especially when it comes to spell checking.
2018-05-09 20:49:31 +02:00
mauryaland 5368ba028a Update stop_words.py for French language (#2310)
* Add contraction forms of some common stopwords

All the stopwords added contain the apostrophe" ' "or " ’ ".

* Adds contributor agreement mauryaland

* Update mauryaland.md
2018-05-09 12:04:38 +02:00
ines 7a3599c21a Fix formatting and consistency 2018-05-07 23:02:11 +02:00
ines 37facf9b4d Add config for no-response [ci skip] 2018-05-07 22:04:54 +02:00
ines ac25bc4016 Add docs section on sentence segmentation [ci skip] 2018-05-07 21:25:20 +02:00
ines 14148cd147 Fix formatting and wording 2018-05-07 21:24:35 +02:00
ines f803da609f Add scattertext [ci skip] 2018-05-07 19:10:23 +02:00
ines a685fff875 Merge branch 'master' of https://github.com/explosion/spaCy 2018-05-07 18:58:57 +02:00
ines e2241c797c Add lock-threads configuration [ci skip] 2018-05-07 18:54:22 +02:00
B! 414f5270b3 B Cavello's signed Contributor Agreement v2 (#2302)
This time hopefully created in the right spot. (Sorry about that!)
2018-05-07 17:48:54 +02:00
Matthew Honnibal f56bd4736b Improve dynamic oracle when values are missing in parse 2018-05-07 15:53:18 +02:00
Matthew Honnibal eddc0e0c74 Set gold.sent_starts in ud_train 2018-05-07 15:52:47 +02:00
Matthew Honnibal bf19f22340 Allow gold.sent_starts to be set from Python 2018-05-07 15:51:34 +02:00
Matt Upson 9a1d3b63fb Add missing default to .set_extension (#2297)
Failing to set a default, method, or getter results in a ValueError:

ValueError: [E083] Error setting extension: only one of `default`, `method`, or `getter` (plus optional `setter`) is allowed. Got: 0
2018-05-04 18:47:01 +02:00
ines 929a01139a Order issue templates 2018-05-04 03:04:41 +02:00
Ines Montani 7f39c8896b
Update issue templates (#2295)
* Update issue templates

* Update templates
2018-05-04 03:02:26 +02:00
Douglas Knox 9b49a40f4e Test and fix for Issue #2219 (#2272)
Test and fix for Issue #2219: Token.similarity() failed if single letter
2018-05-03 18:40:46 +02:00
Paul O'Leary McCann bd72fbf09c Port Japanese mecab tokenizer from v1 (#2036)
* Port Japanese mecab tokenizer from v1

This brings the Mecab-based Japanese tokenization introduced in #1246 to
spaCy v2. There isn't a JapaneseTagger implementation yet, but POS tag
information from Mecab is stored in a token extension. A tag map is also
included.

As a reminder, Mecab is required because Universal Dependencies are
based on Unidic tags, and Janome doesn't support Unidic.

Things to check:

1. Is this the right way to use a token extension?

2. What's the right way to implement a JapaneseTagger? The approach in
 #1246 relied on `tag_from_strings` which is just gone now. I guess the
best thing is to just try training spaCy's default Tagger?

-POLM

* Add tagging/make_doc and tests
2018-05-03 18:38:26 +02:00
G.Pruvost cc8e804648 #2211 - Support for ssl certs config on download command (#2212)
* Add support for SSL/Certs customization on download CLI

* Add a note on SSL options for the 'download' CLI in the README

* Add contributor agreement
2018-05-03 18:37:02 +02:00
Jens Dahl Møllerhøj b9290397fb rename SP to _SP (#2289) 2018-05-03 18:33:49 +02:00
ines c9547b7b8b Update Juniper (see #2293) 2018-05-03 15:36:02 +02:00
Matthew Honnibal a8e70a4187 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2018-05-03 14:02:10 +02:00
Matthew Honnibal c0e596283b Set version to 2.1.0a0 2018-05-03 14:00:11 +02:00
Alex Villarreal 647f2544c5 Fix code sample for span.set_extension (#2286) 2018-05-03 00:39:22 +02:00
Matthew Honnibal 8cd06cc763 Try to fix root-outside-sentence bug 2018-05-02 14:39:48 +00:00
Matthew Honnibal acebd01033 Set cildren from heads in finalize doc 2018-05-02 14:19:22 +00:00
Alex Villarreal 13d562e1a4 Fix code sample for Doc.set_extension (#2282)
* Fix code sample for `set_extension`

The previous sample code for `set_extension` fails the assertion at the end, because `city_getter` it checked if the whole document text matches any of the city names. Now it checks if any of the city names is contained in the document text.

* Contributor agreement
2018-05-02 10:16:05 +02:00
Matthew Honnibal 569440a6db Dont normalize gradient by batch size 2018-05-02 08:42:10 +02:00
Matthew Honnibal 281e29cbcd Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2018-05-02 01:36:23 +00:00
Matthew Honnibal 2338e8c7fc Update develop from master 2018-05-02 01:36:12 +00:00
Matthew Honnibal 9d147e12c4 Merge remote-tracking branch 'origin/master' into develop 2018-05-01 18:18:51 +02:00
Matthew Honnibal 8562faeb39 Fix conll2017 fab command 2018-05-01 18:04:58 +02:00
Matthew Honnibal 116ae46802 Improve experiment management 2018-05-01 17:51:22 +02:00
Matthew Honnibal 6d0fe67b72 Constrain subtok label to adjacent tokens 2018-05-01 17:34:27 +02:00
Matthew Honnibal 8f21953fc5 Constrain subtok to adjacent words 2018-05-01 17:29:00 +02:00
Matthew Honnibal b43bfd3524 Fix arc-eager oracle tests 2018-05-01 16:16:14 +02:00
Matthew Honnibal 31ed64e9b0 Fix textcat test 2018-05-01 15:18:39 +02:00
Matthew Honnibal 548bdff943 Update default Adam settings 2018-05-01 15:18:20 +02:00
Matthew Honnibal adbb1f7533 Add better arc-eager oracle tests 2018-05-01 15:14:55 +02:00
Matthew Honnibal 697bcaa34f Add some methods to ArcEager that make testing easier 2018-05-01 15:13:14 +02:00
Matthew Honnibal a5f6d69f8a Require new dev build of Thinc 2018-05-01 15:05:00 +02:00
Mr Roboto 6f5ccda19c Addresses Issue #2228 - Deserialization fails when using tensor=False or sentiment=False (#2230)
* Fixes issue #2228

* Adds a new contributor
2018-05-01 13:40:22 +02:00
Matthew Honnibal d44bb45c72 Fix scoring if tokenization changes 2018-05-01 01:33:20 +02:00