Commit Graph

8921 Commits

Author SHA1 Message Date
mauryaland 214c2ec263 check if argument flat is true or not (#3156) 2019-01-14 23:47:05 +01:00
Loghi d97661d18b Tamil language support (#3154)
Tamil language support to spaCy
Description

Hereby, creating new PR to add support for Tamil language in spaCy

    added stop words, examples and numerical attributes
    <--Working on other language data-->

Types of change

Enhancement
Checklist

    [ x] I have submitted the spaCy Contributor Agreement.
    [x ] I ran the tests, and all new and existing tests passed.
    [ x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-01-14 15:32:30 +01:00
Hunter Kelly f28a1c7271 Update call to `mkdir()` to create the parents (#3139)
* Update call to `mkdir()` to create the parents

- Update the call to `output_dir.mkdir()` to also create the parents if needed

* don't automatically create parents but fail fast if cannot create directory

* add signed contributors agreement for retnuh
2019-01-11 03:02:18 +01:00
Amandine Périnet ee24e2534d French lemmatization: adding lemmas for adverbs and irregular lemmas for function words (#3131)
* adding adverbs and irregular cases for empty words

* adding adverbs and irregular cases for empty words

* adding adverbs and irregular cases for empty words

* updating contributor agreement for amperinet
2019-01-10 15:41:15 +01:00
Kirill Bulygin 7b064542f7 Making `lang/th/test_tokenizer.py` pass by creating `ThaiTokenizer` (#3078) 2019-01-10 15:40:37 +01:00
Álvaro Abella Bascarán 1cd8f9823f Correct docs of `Token.subtree` and `Span.subtree` (issue #3122) (#3124)
* solve inconsistency between docs and Span.subtree (issue #3122)

* solve inconsistency between docs and Token.subtree (issue #3122)
2019-01-09 03:11:15 +01:00
Amandine Périnet eef11a7a2c French lemmatization: correcting wrong lemmas in the lookup dictionnary (#3104)
* modifying French lookup that contained wrong lemmas

* correcting wrong line breaks on hyphen

* adding contributor agreement for amperinet@

* correcting a typo
2019-01-07 14:15:19 +01:00
Ines Montani ac8487a96a Adjust pytest pin
Somehow, 4.1.x seems to cause test failure due to get_marker – possibly needs to be investigated for spacy-models/tests and likely not relevant on develop anymore
2019-01-07 12:08:19 +01:00
Álvaro Abella Bascarán e03e1eee92 Bugfix/get lca matrix (#3110)
This PR adds a test for an untested case of `Span.get_lca_matrix`, and fixes a bug for that scenario, which I introduced in [this PR](https://github.com/explosion/spaCy/pull/3089) (sorry!).

## Description
The previous implementation of get_lca_matrix was failing for the case `doc[j:k].get_lca_matrix()` where `j > 0`. A test has been added for this case and the bug has been fixed.

### Types of change
Bug fix

## Checklist

- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-01-06 19:07:50 +01:00
alvations 9972716e01 Create alvations.md (#3119) 2019-01-05 13:11:06 +01:00
alvations f43338a4c5 Joblib site has moved. (#3118) 2019-01-05 13:10:54 +01:00
Álvaro Abella Bascarán 6fe276f85d Fix issue 2396 (#3089)
* Test on #2396: bug in Doc.get_lca_matrix()

* reimplementation of Doc.get_lca_matrix(), (closes #2396)

* reimplement Span.get_lca_matrix(), and call it from Doc.get_lca_matrix()

* tests Span.get_lca_matrix() as well as Doc.get_lca_matrix()

* implement _get_lca_matrix as a helper function in doc.pyx; call it from Doc.get_lca_matrix and Span.get_lca_matrix

* use memory view instead of np.ndarray in _get_lca_matrix (faster)

* fix bug when calling Span.get_lca_matrix; return lca matrix as np.array instead of memoryview

* cleaner conditional, add comment
2018-12-29 18:02:26 +01:00
Sofie b7916fffcf Fixing few typos in the documentation (#3103)
* few typos / small grammatical errors corrected in documentation

* one more typo

* one last typo
2018-12-28 15:52:26 +01:00
Will Price 4a6af0852a Improve random prefix generation in displaCy arcs (#3096)
* Improve random prefix generation in displaCy arcs

* Add @willprice contributor agreement
2018-12-27 14:46:02 +01:00
Özcan Kasal b573ebca77 trilyon forgotten (#3083)
* trilyon forgotten

* contributor added
2018-12-27 14:44:23 +01:00
Ines Montani 2dc6c52ccc Update displayed Binder version (see #3077) [ci skip] 2018-12-20 17:36:19 +01:00
Muhammad Irfan 2e84ec1513 Fixed ISO code for Urdu. (#3073) 2018-12-20 12:28:53 +01:00
Kirill Bulygin 10189d9092 Fix the first `nlp` call for `ja` (closes #2901) (#3065)
* Fix the first `nlp` call for `ja` (closes #2901)

* Add unicode declaration, formatting and use relative import
2018-12-18 14:53:50 +01:00
Brixjohn 52f3c95004 Added alpha support for Tagalog language (#3062)
I have added alpha support for the Tagalog language from the Philippines. It is the basis for the country's national language Filipino. I have heavily based the format to the EN and ES languages.

I have provided several words in the lemmatizer lookup table, added stop words from a source, translated numeric words to its Tagalog counterpart, added some tokenizer exceptions, and kept the tag map the same as the English language.

While the alpha language passed the preliminary testing that you provided, I think it needs more data to be useful for most cases.

* Added alpha support for Tagalog language

* Edited contributor template

* Included SCA; Reverted templates

* Fixed SCA template

* Fixed changes in SCA template
2018-12-18 13:08:38 +01:00
Ines Montani c9a89bba50 Don't call begin_training if updating new model (see #3059) [ci skip] 2018-12-17 13:45:28 +01:00
Ines Montani 6f1438b5d9 Auto-format example 2018-12-17 13:44:38 +01:00
Amandine Périnet 361554f629 Lemmatization of Adjectives - French : adding rules and vocabulary (#3045)
* modifying FR lemmatisation for Adjectives

* adding contributor agreement for amperinet

* correcting some errors in vocabulary files
2018-12-16 18:11:07 +01:00
Shooter23 6ae8e49bff Fix docstring for is_right_punct(). (#3044) 2018-12-14 10:11:11 +01:00
Matthew Honnibal e5685d98a2 Fix averaging in textcat example (closes #2745) (#3032) [ci skip] 2018-12-08 13:27:05 +01:00
Ines Montani 8c0f0f50bc Use nlp.make_doc instead of nlp for patterns [ci skip] 2018-12-08 11:56:01 +01:00
Paul O'Leary McCann 7dd21b66d5 Extras require mecab (#3024)
* Add note that Unidic is required for Japanese

This addresses #3001. -POLM

* Add extras_require for mecab with old version

Related to issue #3018.

* mecab → ja

Co-Authored-By: polm <polm@dampfkraft.com>
2018-12-08 06:34:49 +01:00
Aki Ariga 7fcd6419ff Upadate the document for Unidic link with latest version URL (#3022)
* Upadate Unidic link for latest version in document

This patch improves #3017 . The link for Unidic was old version one, so will the lates version.

* Add contributor agreement

* Use more specific link for unidic-cwj
2018-12-07 17:24:48 +01:00
Amandine Périnet 0b44ea23bd Lemmatization of Nouns - French : adding rules and vocabulary (#2992)
* modifying FR lemmatization for nouns

* modifying FR lemmatization for nouns

* adding contributor agreement for amperinet

* adding rules for words with inclusive parentheses wrongly tokenized

* adding contributor agreement for amperinet

* adding a missing comma
2018-12-06 22:42:18 +01:00
Ines Montani 27905a7b14 Remove reference to cuda10 in docs (closes #2894) [ci skip] 2018-12-06 16:05:37 +01:00
Gavriel Loria 9c8c4287bf Accept iob2 and allow generic whitespace (#2999)
* accept non-pipe whitespace as delimiter; allow iob2 filename

* added small documentation note for IOB2 allowance

* added contributor agreement
2018-12-06 15:50:25 +01:00
Amandine Périnet 2457318b7a Lemmatization of Verbs - French : adding rules and vocabulary (#3006)
* updating rules and vocabulary for French lemmatization of verbs

* updating the file with French auxiliary verb

* updating rules and vocabulary for French lemmatization of verbs

* adding contributor agreement for amperinet

* adding rules for words with inclusive parentheses wrongly tokenized
2018-12-06 15:49:28 +01:00
Beate Sildnes f0d7e206ec Updated wordforms for Norwegian lemmatizer (#3007)
* Updated wordforms for Norwegian lemmatizer

Upload of updated lists of wordforms for the Norwegian lemmatizer (nouns, verbs, adverbs, adjectives and lookup).

* Add spaCy contributor agreement for user beatesi

*  Updated wordforms for Norwegian lemmatizer
2018-12-06 15:46:18 +01:00
Paul O'Leary McCann b36f6eabfb Add note that Unidic is required for Japanese (#3017)
This addresses #3001. -POLM
2018-12-06 15:14:10 +01:00
Gavriel Loria ae5601beae Initialize trues to 0.0 in training example (#3004)
* added contributor agreement

* if there are no true positives, precision should be 0.0
2018-12-03 01:33:22 +01:00
Justin DuJardin 33fca8672f fix issue compiling the latest spacy on MacOS 10.3.6 (#2998) 2018-12-02 05:51:11 +01:00
Matthew Honnibal bbaca991ba Set version to v2.0.18 2018-12-01 03:35:09 +01:00
Matthew Honnibal 05b2336ffa Try again to fix OSX build 2018-12-01 03:12:21 +01:00
Matthew Honnibal e1a4b0d7f7 Set version to v2.0.18.dev1 2018-12-01 03:12:12 +01:00
Matthew Honnibal 413530b269 Set version to 2.0.18 2018-12-01 03:00:27 +01:00
Matthew Honnibal 24d52876e1 Set version to v2.0.18.dev0 2018-12-01 02:38:04 +01:00
Matthew Honnibal 4895b2e830 Merge branch 'master' of https://github.com/explosion/spaCy 2018-12-01 02:37:21 +01:00
Matthew Honnibal 3f16af123e Try to fix OSX build error 2018-12-01 02:36:56 +01:00
Matthew Honnibal 61abb1ef70 Remove msgpack dependency, to try to fix #2995 2018-12-01 02:36:41 +01:00
Ines Montani add6469225 Add "new in v2.0.12" note to Span.ents (closes #2986) 2018-11-30 20:50:55 +01:00
Ines Montani c9bdeafbc7 Don't run weird failing test for now 2018-11-30 16:13:40 +01:00
wxv 06820ef6e7 Fix is_ascii documentation and create contributor file (#2988)
Proposed in #2933
2018-11-30 15:57:58 +01:00
Sofie 585de273cd Fix small typo bug in French regexp + relevant unit test (#2980)
* additional unit test for new entr word not in other lists

* bugfix - unit test works

* use _latin_lower instead of alpha_lower for french

* revert back to ALPHA_LOWER (following the code for languages)

* contributor agreement
2018-11-29 20:16:13 +01:00
Ben Batorsky 658f7e0dc8 OntoNotes url fix (#2981)
The website for OntoNotes 5 is: https://catalog.ldc.upenn.edu/LDC2013T19, currently the named entity section has it as https://catalog.ldc.upenn.edu/ldc2013T19.
2018-11-29 19:34:30 +01:00
Adam Schwalm 00566949de Fix bug where Vocab.prune_vector did not use 'batch_size' (#2977)
Fixes #2976
2018-11-28 19:49:33 +01:00
Ines Montani 58757c5684
Update README.rst 2018-11-26 20:56:17 +01:00