Commit Graph

318 Commits

Author SHA1 Message Date
ines 02aefe7cc0 Merge branch 'master' into develop 2018-07-18 18:52:59 +02:00
Ioannis Daras 6ed18412d0 Greek language optimizations (#2558)
* Greek language optimizations

* Add encoding on files containing greek words

* Add encoding on files containing greek words
2018-07-18 18:51:38 +02:00
ines 80e7485630 Merge branch 'master' into develop 2018-07-18 17:28:47 +02:00
Paul O'Leary McCann 61ef0739b8 Add Japanese stop words. (#2549)
List created by taking the 2000 top words from a Wikipedia dump and
removing everything that wasn't hiragana.

Tried going through kanji words and deciding what to keep but there were
too many obvious non-stopwords (東京 was in the top 500) and many other
words where it wasn't clear if they should be included or not.
2018-07-17 10:12:48 +02:00
Tero K f35980f865 Enhancement/lang fi examples (#2547)
* Added a file with examples in finnish

* added contributor agreement
2018-07-15 09:50:27 +02:00
Paul O'Leary McCann 1987f3f784 Add Japanese lemmas (#2543)
This info was already available from Mecab, forgot to add it before.
2018-07-13 10:55:14 +02:00
ines 3a321e79ac Merge branch 'master' into develop 2018-07-10 13:49:08 +02:00
Eleni170 6042723535 Add support for Greek language (#2535)
* Add contributor agreement

* Support for Greek language

* Fix missing el_tokenizer
2018-07-10 13:48:38 +02:00
Stefan Schweter 3dfc7f86be lemmatizer: correct lemma for Rang (#2537)
<!--- Provide a general summary of your changes in the title. -->

## Description

This PR corrects the German lemma form for the word "Rang". Initially, the lemma form was "ringen", which is not correct, because it refers to the verb ("ringen") and not to the noun ("Rang").

### Types of change

The lemma form for "Rang" is corrected to "Rang", see also the [Duden](https://www.duden.de/rechtschreibung/Rang) entry.

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2018-07-10 13:11:19 +02:00
ines fd6207426a Merge branch 'master' into develop 2018-07-09 18:05:10 +02:00
Duygu Altinok 00b9a58558 German lemmatizer additions (#2529)
* lemma of was-> was

* added new pairs issue @2486

* added article tests
2018-07-09 11:10:15 +02:00
ines 63666af328 Merge branch 'master' into develop 2018-07-04 14:52:25 +02:00
Muhammad Irfan f33c703066 Add Urdu Language Support (#2430)
* added Urdu language support.

* added Urdu language tests.

* modified conftest.py for Urdu language support.

* added spacy contributor agreement.
2018-06-22 11:14:03 +02:00
himkt 14d9007efd fix wrong indexing (#2416)
* fix wrong indexing

* add agreement
2018-06-19 10:20:57 +02:00
Aliia E 428bae66b5 Add Tatar Language Support (#2444)
* add Tatar lang support

* add Tatar letters

* add Tatar tests

* sign contributor agreement

* sign contributor agreement [x]

* remove comments from Language class

* remove all template comments
2018-06-19 10:17:53 +02:00
ines 778e5f4da3 Merge branch 'master' into develop 2018-06-11 00:38:04 +02:00
Nour Shalabi a169b79092 Additions to Arabic stop words. (#2422)
* Additions to Arabic stop words.

* Create nourshalabi.md
2018-06-08 02:33:23 +02:00
ines 330c039106 Merge branch 'master' into develop 2018-05-26 18:30:52 +02:00
Aristo Rinjuang 432ede04af adding more words and rephrasing (#2351)
* adding more words and rephrasing

* adding a contributor

* tokenizer bugs solved
2018-05-24 11:40:57 +02:00
Jani Monoses ec62cadf4c Updates to Romanian support (#2354)
* Add back Romanian in conftest

* Romanian lex_attr

* More tokenizer exceptions for Romanian

* Add tests for some Romanian tokenizer exceptions
2018-05-24 11:40:00 +02:00
Matthew Honnibal 546dd99cdf Merge master into develop -- mostly Arabic and website 2018-05-15 18:14:28 +02:00
Tahar Zanouda 00417794d3 Add Arabic language (#2314)
* added support for Arabic lang

* added Arabic language support

* updated conftest
2018-05-15 00:27:19 +02:00
Jani Monoses 0e08e49e87 Lemmatizer ro (#2319)
* Add Romanian lemmatizer lookup table.

Adapted from http://www.lexiconista.com/datasets/lemmatization/
by replacing cedillas with commas (ș and ț).

The original dataset is licensed under the Open Database License.

* Fix one blatant issue in the Romanian lemmatizer

* Romanian examples file

* Add ro_tokenizer in conftest

* Add Romanian lemmatizer test
2018-05-12 15:20:04 +02:00
Jani Monoses 42b34832e4 Update Romanian stopword list (#2316)
* Contributor agreement for janimo

* Update Romanian stopword list

Include the correct spellings of all the words already in the repo
that are using cedillas (ş and ţ) instead of commas (ș and ț).

Add another unrelated spelling fix.

See https://github.com/stopwords-iso/stopwords-ro/pull/1 and
https://github.com/stopwords-iso/stopwords-ro/pull/2
2018-05-10 12:16:56 +02:00
Lucas Abbade be7fdc59d1 Update lex_attrs.py (#2307)
* Update lex_attrs.py

Fixed spelling mistakes of some numbers (according to Brazilian Portuguese).

* Update lex_attrs.py

As requested, I've included the correct spelling for both Brazilian Portuguese and Portuguese Portuguese.

I will advise however, that the two are separated in the future. Brazilian Portuguese is a very different language from the original one, although most of the writing is unified, the way people talk in both countries is radically different. Keeping both languages as one may lead to bigger issues in the future, especially when it comes to spell checking.
2018-05-09 20:49:31 +02:00
mauryaland 5368ba028a Update stop_words.py for French language (#2310)
* Add contraction forms of some common stopwords

All the stopwords added contain the apostrophe" ' "or " ’ ".

* Adds contributor agreement mauryaland

* Update mauryaland.md
2018-05-09 12:04:38 +02:00
Paul O'Leary McCann bd72fbf09c Port Japanese mecab tokenizer from v1 (#2036)
* Port Japanese mecab tokenizer from v1

This brings the Mecab-based Japanese tokenization introduced in #1246 to
spaCy v2. There isn't a JapaneseTagger implementation yet, but POS tag
information from Mecab is stored in a token extension. A tag map is also
included.

As a reminder, Mecab is required because Universal Dependencies are
based on Unidic tags, and Janome doesn't support Unidic.

Things to check:

1. Is this the right way to use a token extension?

2. What's the right way to implement a JapaneseTagger? The approach in
 #1246 relied on `tag_from_strings` which is just gone now. I guess the
best thing is to just try training spaCy's default Tagger?

-POLM

* Add tagging/make_doc and tests
2018-05-03 18:38:26 +02:00
Matthew Honnibal 2c4a6d66fa Merge master into develop. Big merge, many conflicts -- need to review 2018-04-29 14:49:26 +02:00
Robin Linderborg 1f9904ef12 fixes #2238 (#2241)
* Remove erroneous lemma lookup år > åra in Swedish

* Add contributors agreement

* Add contrib agreement to correct directory

* Revert change to CONTRIBUTOR_AGREEMENT
2018-04-28 14:55:22 +02:00
Robin Linderborg d01f503b54 Remove incorrect lemma lookup gäng->gänga (#2252)
* Remove incorrect lemma lookup gäng->gänga
In modern Swedish, "gäng" is mostly associated with "gang" or "group of people". The removed lemma lookup lemmatized it to the verb "thread".

* Add contrib agreement to correct directory

* Revert change to CONTRIBUTOR_AGREEMENT
2018-04-28 14:54:41 +02:00
ines 686225eadd Fix Spanish noun_chunks (resolves #2210)
Make sure 'NP' label is added to StringStore and move noun_bounds helper into a closure to allow reusing label sets
2018-04-18 18:44:01 -04:00
Jens Dahl Møllerhøj e5055e3cf6 Add Danish lemmatizer (#2184)
* add danish lemmatizer

* fill contributor agreement
2018-04-07 19:07:28 +02:00
Matthew Honnibal 21047bde52 Fix syntax error in italian lemmatizer 2018-04-03 23:13:22 +02:00
Viet Trung Tran ea2af94cd9 Add support for Vietnamese in spaCy by leveraging Pyvi, an external Vietnamese tokenizer (#2155)
* support for Vietnamese

* Contributor Agreement for adding Vietnamese support on spaCy
2018-03-29 12:19:51 +02:00
ines 11c4735ccf Fix issue in Italian lemmatizer data (resolves #2050) 2018-03-27 23:55:22 +02:00
Matthew Honnibal 1f7229f40f Revert "Merge branch 'develop' of https://github.com/explosion/spaCy into develop"
This reverts commit c9ba3d3c2d, reversing
changes made to 92c26a35d4.
2018-03-27 19:23:02 +02:00
Ines Montani 68226109f4
Merge pull request #2142 from jimregan/polish-more-tokens
more exceptions
2018-03-24 19:06:44 +01:00
Matthew Honnibal 0d3bf0d4eb Merge branch 'master' of https://github.com/explosion/spaCy 2018-03-24 17:31:49 +01:00
dejanmarich ccd1c04c63 Update stop_words.py
Added more words
2018-03-24 17:31:24 +01:00
ines f1446b0257 Port over Turkish changes 2018-03-24 17:31:07 +01:00
DuyguA cd604878a4 quick typo fix 2018-03-24 17:26:35 +01:00
Jim O'Regan efe037e8be more exceptions 2018-03-24 00:05:27 +00:00
alldefector f4e5904fc2 Fix Spanish noun_chunks failure caused by typo 2018-03-14 17:03:17 +01:00
Ines Montani 14e7e0f12a
Merge pull request #2000 from jimregan/polish-tag-map
Polish tag map
2018-02-18 19:05:58 +01:00
Matthew Honnibal eb3040ce46
Merge pull request #1891 from fucking-signup/master
Fix issue #1889
2018-02-18 13:47:47 +01:00
4altinok 94fb0b75e3 code for is_currency 2018-02-11 18:51:32 +01:00
Ines Montani 0954e15dda
Merge pull request #1913 from ohenrik/nb_syntax_iterator
Norwegian Language (nb) - Added french syntax iterator with explanation
2018-02-06 04:59:07 +01:00
Ole Henrik Skogstrøm 251a7805fe Copied French syntax iterator to simplify future changes 2018-02-05 14:45:05 +01:00
ines f1d3deffac Add Russian example sentences (see #1107) 2018-02-01 20:09:40 +01:00
Ole Henrik Skogstrøm e40465487c Added french syntax iterator with explenation 2018-01-30 15:44:29 +01:00