Commit Graph

11649 Commits

Author SHA1 Message Date
Adriane Boyd 971826a96d
Include git commit in package and model meta (#5694)
* Include git commit in package and model meta

* Rewrite to read file in setup

* Fix file handle
2020-07-02 17:10:27 +02:00
Adriane Boyd 2bd78c39e3
Fix multiple context manages in examples (#5690) 2020-07-02 10:36:07 +02:00
Ines Montani 6bc643d2e2 Update netlify.toml [ci skip] 2020-07-01 21:34:17 +02:00
Ines Montani f2a932a60c Update netlify.toml [ci skip] 2020-07-01 13:34:35 +02:00
Álvaro Abella Bascarán ff0dbe5c64
Fix in docs: pipe(docs) instead of pipe(texts) (#5680)
Very minor fix in docs, specifically in this part:

```
 matcher = PhraseMatcher(nlp.vocab)
>   for doc in matcher.pipe(texts, batch_size=50):
>       pass
```

`texts` suggests the input is an iterable of strings. I replaced it for `docs`.
2020-06-30 20:00:50 +02:00
Matthias Hertel 8b0f749606
Website: fixed the token span in the text about the rule-based matching example (#5669)
* fixed token span in pattern matcher example

* contributor agreement
2020-06-30 19:58:23 +02:00
Matthew Honnibal 2d715451a2
Revert "Convert custom user_data to token extension format for Japanese tokenizer (#5652)" (#5665)
This reverts commit 1dd38191ec.
2020-06-29 14:34:15 +02:00
Adriane Boyd 1dd38191ec
Convert custom user_data to token extension format for Japanese tokenizer (#5652)
* Convert custom user_data to token extension format

Convert the user_data values so that they can be loaded as custom token
extensions for `inflection`, `reading_form`, `sub_tokens`, and `lemma`.

* Reset Underscore state in ja tokenizer tests
2020-06-29 14:20:26 +02:00
Adriane Boyd 167df42cb6
Move lemmatizer is_base_form to language settings (#5663)
Move `Lemmatizer.is_base_form` to the language settings so that each
language can provide a language-specific method as
`LanguageDefaults.is_base_form`.

The existing English-specific `Lemmatizer.is_base_form` is moved to
`EnglishDefaults`.
2020-06-29 14:16:57 +02:00
Adriane Boyd c4d0209472
Extend v2.3 migration guide (#5653)
* Extend preloaded vocab section

* Add section on tag maps
2020-06-26 14:12:29 +02:00
PluieElectrique 90c7eb0e2f
Reduce memory usage of Lookup's BloomFilter (#5606)
* Reduce memory usage of Lookup's BloomFilter

* Remove extra Table update
2020-06-26 14:09:10 +02:00
Adriane Boyd b7107ac89f
Disregard special tag _SP in check for new tag map (#5641)
* Skip special tag  _SP in check for new tag map

In `Tagger.begin_training()` check for new tags aside from `_SP` in the
new tag map initialized from the provided gold tuples when determining
whether to reinitialize the morphology with the new tag map.

* Simplify _SP check
2020-06-26 09:23:21 +02:00
Adriane Boyd fd4287c178
Fix backslashes in warnings config diff (#5640)
Fix backslashes in warnings config diff in v2.3 migration section.
2020-06-24 10:26:12 +02:00
Adriane Boyd 6fe6e761de
Skip vocab in component config overrides (#5624) 2020-06-23 23:21:11 +02:00
Adriane Boyd 7ce451c211
Extend what's new in v2.3 with vocab / is_oov (#5635) 2020-06-23 16:48:59 +02:00
Adriane Boyd d94e961f14
Fix polarity of Token.is_oov and Lexeme.is_oov (#5634)
Fix `Token.is_oov` and `Lexeme.is_oov` so they return `True` when the
lexeme does **not** have a vector.
2020-06-23 13:29:51 +02:00
Richard Liaw 0ef78bad93
contribute (#5632) 2020-06-23 08:53:58 +02:00
Adriane Boyd bc1cb30b21
Add warnings example in v2.3 migration guide (#5627) 2020-06-22 14:37:24 +02:00
Hiroshi Matsuda 150a39ccca
Japanese model: add user_dict entries and small refactor (#5573)
* user_dict fields: adding inflections, reading_forms, sub_tokens
deleting: unidic_tags
improve code readability around the token alignment procedure

* add test cases, replace fugashi with sudachipy in conftest

* move bunsetu.py to spaCy Universe as a pipeline component BunsetuRecognizer

* tag is space -> both surface and tag are spaces

* consider len(text)==0
2020-06-22 14:32:25 +02:00
Rameshh c34420794a
Add Nepali Language (#5622)
* added support for nepali lang

* added examples and test files

* added spacy contributor agreement
2020-06-22 10:25:46 +02:00
Karen Hambardzumyan 66a4834e56
Some changes for Armenian (#5616)
* Fixing numericals

* We need a Armenian question sign to make the sentence a question
2020-06-22 08:50:34 +02:00
Karen Hambardzumyan ff6a084e9c
Create mahnerak.md (#5615) 2020-06-20 11:14:26 +02:00
Marat M. Yavrumyan 8120b641cc
Update lex_attrs.py (#5608) 2020-06-19 20:00:34 +02:00
Marat M. Yavrumyan ccd7edf04b
Create myavrum.md (#5612) 2020-06-19 18:34:27 +02:00
Adriane Boyd 931d80de72
Warning for sudachipy 0.4.5 (#5611) 2020-06-19 12:43:41 +02:00
Ines Montani 6d712f3e06
Merge pull request #5599 from adrianeboyd/docs/v2.3.0-minor 2020-06-16 13:49:25 -07:00
Adriane Boyd 02369f91d3 Fix spacy convert argument 2020-06-16 20:41:17 +02:00
Adriane Boyd f0fd77648f Change example title to Dr.
Change example title to Dr. so the current model does exclude the title
in the initial example.
2020-06-16 20:36:21 +02:00
Adriane Boyd a6abdfbc3c Fix numpy.zeros() dtype for Doc.from_array 2020-06-16 20:35:45 +02:00
Adriane Boyd 9aff317ca7 Update POS in tagging example 2020-06-16 20:26:57 +02:00
Adriane Boyd 457babfa0c Update alignment example for new gold.align 2020-06-16 20:22:03 +02:00
Ines Montani 41003a5117 Update Binder version [ci skip] 2020-06-16 17:41:23 +02:00
Ines Montani fd89f44c0c Update Binder URL [ci skip] 2020-06-16 17:34:26 +02:00
Ines Montani 44af53bdd9 Add pkuseg warnings and auto-format [ci skip] 2020-06-16 17:13:35 +02:00
Ines Montani a9e5b840ee Fix typos and auto-format [ci skip] 2020-06-16 16:38:45 +02:00
Ines Montani 1d3e8b7578
Merge pull request #5595 from explosion/v2.3.x 2020-06-16 07:37:10 -07:00
Ines Montani e9d3e177f0 Merge branch 'master' into v2.3.x 2020-06-16 16:31:38 +02:00
Ines Montani bb54f54369 Fix model accuracy table [ci skip] 2020-06-16 16:10:12 +02:00
Adriane Boyd d5110ffbf2
Documentation updates for v2.3.0 (#5593)
* Update website models for v2.3.0

* Add docs for Chinese word segmentation

* Tighten up Chinese docs section

* Merge branch 'master' into docs/v2.3.0 [ci skip]

* Merge branch 'master' into docs/v2.3.0 [ci skip]

* Auto-format and update version

* Update matcher.md

* Update languages and sorting

* Typo in landing page

* Infobox about token_match behavior

* Add meta and basic docs for Japanese

* POS -> TAG in models table

* Add info about lookups for normalization

* Updates to API docs for v2.3

* Update adding norm exceptions for adding languages

* Add --omit-extra-lookups to CLI API docs

* Add initial draft of "What's New in v2.3"

* Add new in v2.3 tags to Chinese and Japanese sections

* Add tokenizer to migration section

* Add new in v2.3 flags to init-model

* Typo

* More what's new in v2.3

Co-authored-by: Ines Montani <ines@ines.io>
2020-06-16 15:37:35 +02:00
Matthew Honnibal 7ff447c5a0 Set version to v2.3.0 2020-06-15 18:22:25 +02:00
Adriane Boyd 0d8405aafa Updates to docstrings (#5589) 2020-06-15 14:58:36 +02:00
Adriane Boyd e867e9fa8f Fix and add warnings related to spacy-lookups-data (#5588)
* Fix warning message for lemmatization tables

* Add a warning when the `lexeme_norm` table is empty. (Given the
relatively lang-specific loading for `Lookups`, it seemed like too much
overhead to dynamically extract the list of languages, so for now it's
hard-coded.)
2020-06-15 14:58:29 +02:00
Arvind Srinivasan f698007907 Added Tamil Example Sentences (#5583)
* Added Examples for Tamil Sentences

#### Description
This PR add example sentences for the Tamil language which were missing as per issue #1107 

#### Type of Change
This is an enhancement.

* Accepting spaCy Contributor Agreement

* Signed on my behalf as an individual
2020-06-15 14:58:21 +02:00
Adriane Boyd c94f7d0e75
Updates to docstrings (#5589) 2020-06-15 14:56:51 +02:00
Adriane Boyd c482f20778
Fix and add warnings related to spacy-lookups-data (#5588)
* Fix warning message for lemmatization tables

* Add a warning when the `lexeme_norm` table is empty. (Given the
relatively lang-specific loading for `Lookups`, it seemed like too much
overhead to dynamically extract the list of languages, so for now it's
hard-coded.)
2020-06-15 14:56:04 +02:00
Arvind Srinivasan aa5b40fa64
Added Tamil Example Sentences (#5583)
* Added Examples for Tamil Sentences

#### Description
This PR add example sentences for the Tamil language which were missing as per issue #1107 

#### Type of Change
This is an enhancement.

* Accepting spaCy Contributor Agreement

* Signed on my behalf as an individual
2020-06-13 15:56:26 +02:00
theudas 3f5e2f9d99 Added Parameter to NEL to take n sentences into account (#5548)
* added setting for neighbour sentence in NEL

* added spaCy contributor agreement

* added multi sentence also for training

* made the try-except block smaller
2020-06-12 15:15:03 +02:00
adrianeboyd 4724fa4cf4 Expand Japanese requirements warning (#5572)
Include explicit install instructions in Japanese requirements warning.
2020-06-12 15:14:55 +02:00
adrianeboyd 44967a3f9c Update pytest conf for sudachipy with Japanese (#5574) 2020-06-12 15:14:47 +02:00
theudas fa46e0bef2
Added Parameter to NEL to take n sentences into account (#5548)
* added setting for neighbour sentence in NEL

* added spaCy contributor agreement

* added multi sentence also for training

* made the try-except block smaller
2020-06-12 02:03:23 +02:00