Commit Graph

987 Commits

Author SHA1 Message Date
svlandeg e8fd0c1f1e EL architectures documentation 2020-08-06 17:41:26 +02:00
svlandeg f396f091dc update EL API 2020-08-06 16:40:48 +02:00
svlandeg 81d0b1c390 update EL pipe arguments 2020-08-06 16:22:50 +02:00
svlandeg 0b4d1e1bc4 'debug data' instead of 'debug-data' 2020-08-06 15:47:31 +02:00
svlandeg 881e3f8fd0 add docbin explanation and example 2020-08-06 15:29:44 +02:00
Ines Montani 5d417d3b19 WIP: Update docs [ci skip] 2020-08-06 13:10:15 +02:00
Ines Montani 06e80d95cd
Sync develop with nightly docs state (#5883)
Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com>
2020-08-06 00:28:14 +02:00
Ines Montani 5cc0d89fad
Simplify config overrides in CLI and deserialization (#5880) 2020-08-05 23:35:09 +02:00
Ines Montani 50311a4d37 Update docs [ci skip] 2020-08-05 20:29:53 +02:00
Ines Montani 2a4d56e730 Update docs 2020-08-05 15:01:00 +02:00
Ines Montani cdec46493f Update docs 2020-08-05 15:00:54 +02:00
Ines Montani 4c055f0aa7
Add init CLI and init config (#5854)
* Add init CLI and init config draft

* Improve config validation

* Auto-format

* Don't export anything in debug config

* Update docs
2020-08-02 15:18:30 +02:00
Ines Montani b40f44419b Simplify pipe analysis
- remove unused code
- don't print by default
- integrate attrs info into analysis output
2020-08-01 13:40:06 +02:00
Ines Montani 98c6a85c8b Update docs [ci skip] 2020-07-31 18:55:38 +02:00
Ines Montani e9e8fa2466 Update docs and types 2020-07-31 17:02:54 +02:00
Ines Montani 6365837ca9
Merge pull request #5833 from explosion/feature/scorer-adjustments 2020-07-31 14:00:39 +02:00
Ines Montani 5a221f79c2 Revert "Remove keyword-only from Scorer API docs" [ci skip]
This reverts commit 7a6ac47dc1.
2020-07-31 14:00:21 +02:00
Ines Montani 160f1a5f94 Update docs [ci skip] 2020-07-31 13:26:39 +02:00
Adriane Boyd 9b509aa87f Move Language.evaluate scorer config to new arg
Move `Language.evaluate` scorer config from `component_cfg` to separate
argument `scorer_cfg`.
2020-07-31 11:05:16 +02:00
Adriane Boyd 9d79916792 Merge branch 'develop' into feature/scorer-adjustments 2020-07-31 10:48:14 +02:00
Ines Montani 3449c45fd9 Update docs [ci skip] 2020-07-29 19:48:26 +02:00
Ines Montani 9c80cb673d Update docs [ci skip] 2020-07-29 19:41:34 +02:00
Ines Montani 9f69afdd1e Update docs [ci skip] 2020-07-29 19:09:44 +02:00
Ines Montani 7a21775cd0
Merge pull request #5834 from explosion/feature/vectors 2020-07-29 18:49:26 +02:00
Ines Montani 6a5c853edb Fix docs [ci skip] 2020-07-29 18:45:12 +02:00
Ines Montani 158d8c1e48 Update docs [ci skip] 2020-07-29 18:44:10 +02:00
Matthew Honnibal f7adc9d3b7 Start rewriting vectors docs 2020-07-29 17:10:06 +02:00
Ines Montani b0f57a0cac Update docs and consistency 2020-07-29 15:14:07 +02:00
Ines Montani e0ffe36e79 Update docstrings, docs and types 2020-07-29 11:36:42 +02:00
Adriane Boyd 7a6ac47dc1 Remove keyword-only from Scorer API docs 2020-07-29 10:40:30 +02:00
Ines Montani ac24adec73 Small adjustments to Scorer and docs 2020-07-28 21:39:42 +02:00
Ines Montani 256b24b720 Update arch docs WIP [ci skip] 2020-07-28 20:33:52 +02:00
Ines Montani ae4d8a6ffd Update docstrings, docs and pipe consistency 2020-07-28 13:37:31 +02:00
Ines Montani 0094cb0d04 Remove scores list from config and document 2020-07-28 11:22:24 +02:00
Ines Montani 894e20c466 Merge branch 'develop' into feature/component-scores 2020-07-27 18:14:39 +02:00
Ines Montani d8b519c23c API docs, docstrings and argument consistency 2020-07-27 18:11:45 +02:00
Ines Montani 10b84e1e27 Add flag to toggle sdist creation on package [ci skip] 2020-07-27 16:52:23 +02:00
Adriane Boyd fdf09cb231 Update Scorer API docs for score_cats 2020-07-27 15:34:42 +02:00
Ines Montani 7dd53d0964 Fix typo [ci skip] 2020-07-27 00:34:00 +02:00
Ines Montani 7adbaf9a5b Update docs [ci skip] 2020-07-27 00:29:45 +02:00
Matthew Honnibal fb5dbe30b5 Trim training 101 2020-07-26 13:43:22 +02:00
Matthew Honnibal e6a7deb7cc Edits to the training 101 section 2020-07-26 13:42:08 +02:00
Ines Montani c288dba8e7 Update docs [ci skip] 2020-07-25 18:51:12 +02:00
Ines Montani eb9acae34d
Merge pull request #5791 from adrianeboyd/docs/morphology 2020-07-25 15:10:21 +02:00
Adriane Boyd 2bcceb80c4
Refactor the Scorer to improve flexibility (#5731)
* Refactor the Scorer to improve flexibility

Refactor the `Scorer` to improve flexibility for arbitrary pipeline
components.

* Individual pipeline components provide their own `evaluate` methods
that score a list of `Example`s and return a dictionary of scores
* `Scorer` is initialized either:
  * with a provided pipeline containing components to be scored
  * with a default pipeline containing the built-in statistical
    components (senter, tagger, morphologizer, parser, ner)
* `Scorer.score` evaluates a list of `Example`s and returns a dictionary
of scores referring to the scores provided by the components in the
pipeline

Significant differences:

* `tags_acc` is renamed to `tag_acc` to be consistent with `token_acc`
and the new `morph_acc`, `pos_acc`, and `lemma_acc`
* Scoring is no longer cumulative: `Scorer.score` scores a list of
examples rather than a single example and does not retain any state
about previously scored examples
* PRF values in the returned scores are no longer multiplied by 100

* Add kwargs to Morphologizer.evaluate

* Create generalized scoring methods in Scorer

* Generalized static scoring methods are added to `Scorer`
  * Methods require an attribute (either on Token or Doc) that is
used to key the returned scores

Naming differences:

* `uas`, `las`, and `las_per_type` in the scores dict are renamed to
`dep_uas`, `dep_las`, and `dep_las_per_type`

Scoring differences:

* `Doc.sents` is now scored as spans rather than on sentence-initial
token positions so that `Doc.sents` and `Doc.ents` can be scored with
the same method (this lowers scores since a single incorrect sentence
start results in two incorrect spans)

* Simplify / extend hasattr check for eval method

* Add hasattr check to tokenizer scoring
* Simplify to hasattr check for component scoring

* Reset Example alignment if docs are set

Reset the Example alignment if either doc is set in case the
tokenization has changed.

* Add PRF tokenization scoring for tokens as spans

Add PRF scores for tokens as character spans. The scores are:

* token_acc: # correct tokens / # gold tokens
* token_p/r/f: PRF for (token.idx, token.idx + len(token))

* Add docstring to Scorer.score_tokenization

* Rename component.evaluate() to component.score()

* Update Scorer API docs

* Update scoring for positive_label in textcat

* Fix TextCategorizer.score kwargs

* Update Language.evaluate docs

* Update score names in default config
2020-07-25 12:53:02 +02:00
Adriane Boyd 8f44584bef Update MorphAnalysis.get and related examples 2020-07-23 08:51:31 +02:00
Adriane Boyd 941b9e33f7 Add Token.morph_ 2020-07-22 17:59:45 +02:00
Ines Montani be476e495e
Merge pull request #5787 from adrianeboyd/docs/morphologizer
Initial draft of Morphologizer API docs
2020-07-22 17:16:57 +02:00
Adriane Boyd d3385f4be2 Add Morphology and MorphAnalysis to overview 2020-07-21 13:06:22 +02:00
Adriane Boyd fcd3a4abe3 Add morph to Token API docs 2020-07-21 13:05:58 +02:00
Adriane Boyd 14df00ae98 Add Morphology and MorphAnalsysis API docs
Add initial draft of `Morphology` and `MorphAnalysis` API docs.
2020-07-21 10:33:46 +02:00
Ines Montani 644074b954 Merge branch 'develop' into master-tmp 2020-07-20 14:58:04 +02:00
Adriane Boyd 986f7e4d69 Initial draft of Morphologizer API docs 2020-07-20 12:53:02 +02:00
Adriane Boyd 39ebcd9ec9
Refactor Chinese tokenizer configuration (#5736)
* Refactor Chinese tokenizer configuration

Refactor `ChineseTokenizer` configuration so that it uses a single
`segmenter` setting to choose between character segmentation, jieba, and
pkuseg.

* replace `use_jieba`, `use_pkuseg`, `require_pkuseg` with the setting
`segmenter` with the supported values: `char`, `jieba`, `pkuseg`
* make the default segmenter plain character segmentation `char` (no
additional libraries required)

* Fix Chinese serialization test to use char default

* Warn if attempting to customize other segmenter

Add a warning if `Chinese.pkuseg_update_user_dict` is called when
another segmenter is selected.
2020-07-19 13:34:37 +02:00
Adriane Boyd cd5af72c9a
Update pkuseg version (#5774)
* Update pkuseg version in Chinese tokenizer warnings
* Update pkuseg version in `Makefile`
* Remove warning about python3.8 wheels in docs
2020-07-19 11:09:49 +02:00
Ines Montani 872938ec76
Merge pull request #5747 from explosion/feature/refactor-config-args 2020-07-14 00:00:22 +02:00
Ines Montani 5f6f4ff594 Remove object subclassing 2020-07-12 14:03:23 +02:00
Ines Montani c96535e338 Update command docstrings and docs 2020-07-12 13:53:49 +02:00
Ines Montani 3f948b9c74 Update docs 2020-07-12 12:32:28 +02:00
Ines Montani 11bbc82c24 Update cli.md [ci skip] 2020-07-10 23:37:52 +02:00
Ines Montani 9455b060d2 Update cli.md 2020-07-10 22:57:22 +02:00
Ines Montani 7b5717cac3 Merge branch 'develop' into feature/refactor-config-args 2020-07-10 22:50:07 +02:00
Ines Montani e6a6587a9a Update projects.md [ci skip] 2020-07-10 22:41:27 +02:00
Ines Montani f2cd982e7b Update training.md 2020-07-10 22:34:27 +02:00
Ines Montani 52e9b5b472 Fix formatting 2020-07-09 23:25:58 +02:00
Ines Montani 28cdae898a Update projects.md 2020-07-09 22:35:54 +02:00
Ines Montani 7bcf9f7cfb Document new features 2020-07-09 21:10:36 +02:00
Ines Montani ea01831f6a Update projects docs etc. 2020-07-09 19:43:25 +02:00
Ines Montani 175d34d8f9 Update sidebar menu 2020-07-09 11:44:09 +02:00
Ines Montani 9ee5b71412 Update cli.md 2020-07-09 11:44:00 +02:00
Ines Montani 9ae4040183 Update API docs 2020-07-08 13:34:35 +02:00
svlandeg c94279ac1b remove tensors, fix predict, get_loss and set_annotations 2020-07-08 13:11:54 +02:00
svlandeg 90b100c39f remove component.Model, update constructor, losses is return value of update 2020-07-08 12:14:30 +02:00
Ines Montani 2298e129e6 Update example and training docs 2020-07-07 20:30:12 +02:00
svlandeg 2b60e894cb fix component constructors, update, begin_training, reference to GoldParse 2020-07-07 19:17:19 +02:00
svlandeg 14a796e3f9 add Example API with examples of Example usage 2020-07-07 14:46:41 +02:00
Ines Montani bb3ee38cf9 Update WIP 2020-07-06 22:22:37 +02:00
Ines Montani 44da24ddd0 Update doc.md 2020-07-06 18:17:00 +02:00
Ines Montani 44790c1c32 Update docs and add keyword-only tag 2020-07-06 18:14:57 +02:00
Ines Montani a35236e5f0 Update v3 docs WIP [ci skip] 2020-07-06 15:57:44 +02:00
Ines Montani 63247cbe87 Update v3 docs [ci skip] 2020-07-05 16:11:16 +02:00
Matthew Honnibal 3e78e82a83
Experimental character-based pretraining (#5700)
* Use cosine loss in Cloze multitask

* Fix char_embed for gpu

* Call resume_training for base model in train CLI

* Fix bilstm_depth default in pretrain command

* Implement character-based pretraining objective

* Use chars loss in ClozeMultitask

* Add method to decode predicted characters

* Fix number characters

* Rescale gradients for mlm

* Fix char embed+vectors in ml

* Fix pipes

* Fix pretrain args

* Move get_characters_loss

* Fix import

* Fix import

* Mention characters loss option in pretrain

* Remove broken 'self attention' option in pretrain

* Revert "Remove broken 'self attention' option in pretrain"

This reverts commit 56b820f6af.

* Document 'characters' objective of pretrain
2020-07-05 15:48:39 +02:00
Ines Montani dc8c9d912f Update docs [ci skip] 2020-07-04 16:47:24 +02:00
Ines Montani 4498dfe99d Update docs 2020-07-04 16:25:30 +02:00
Ines Montani 1e0d54edd1 Update docs 2020-07-04 14:23:10 +02:00
Ines Montani fe224dc2dd Merge branch 'develop' into nightly.spacy.io 2020-07-03 16:48:27 +02:00
Ines Montani 06f1ecb308 Update v3 docs 2020-07-03 16:48:21 +02:00
Ines Montani cdf9ee1716 Add stub for Example API docs [ci skip] 2020-07-03 15:46:10 +02:00
Ines Montani fa8e097c04 Update convert docs [ci skip] 2020-07-03 15:42:04 +02:00
Jan Jessewitsch e4dcac4a4b
Merging multiple docs into one (#5032)
* Add static method to Doc to allow merging of multiple docs.

* Add error description for the error that occurs if docs with different
vocabs (from different languages) are merged in Doc.from_docs().

* Add test for Doc.from_docs() implementation.

* Fix using numpy's concatenate in Doc.from_docs.

* Replace typing's type annotations in from_docs.

* Simply remove type annotations in from_docs.

* Add documentation for Doc.from_docs to api.

* Simplify from_docs, its test and the api doc for codebase consistency.

* Fix merging of Doc objects that end with whitespaces (Achieved by simply not setting the SPACY attribute on whitespace tokens). Remove two unnecessary imports of attributes.

* Add merging of user data from Doc objects in from_docs. Add user data test case to corresponding test. Add applicable warning messages.

* Fix incorrect setting of tokens idx by using concatenated spaces (again). Add test case to corresponding test.

* Add MORPH to attrs

* Update warnings calls

* Remove out-dated error from merge

* Rename space_delimiter to ensure_whitespace

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2020-07-03 11:32:42 +02:00
Adriane Boyd a723fa02a1
DocBin: add version number, missing attributes and strings (#5685)
* Add version number to DocBin

Add a version number to DocBin for future use.

* Add POS to all attributes in DocBin

* Add morph string to strings in DocBin

* Update DocBin API

* Add string for ENT_KB_ID in DocBin
2020-07-02 17:41:50 +02:00
Ines Montani b5268955d7 Update matcher usage examples [ci skip] 2020-07-02 15:39:45 +02:00
Ines Montani a4cfe9fc33 Remove inline notes on v2 changes [ci skip] 2020-07-01 22:29:22 +02:00
Ines Montani fe4cfd0632 Start updating website for v3 [ci skip] 2020-07-01 21:26:39 +02:00
Ines Montani 26df4efa94 Add new in v3.0 2020-07-01 13:02:17 +02:00
Ines Montani 18a900abc2 Fix markup 2020-07-01 13:02:07 +02:00
Ines Montani 414dc7ace1 Merge branch 'spacy.io' into spacy.io-develop 2020-07-01 11:47:47 +02:00
Álvaro Abella Bascarán 7111b9de2e Fix in docs: pipe(docs) instead of pipe(texts) (#5680)
Very minor fix in docs, specifically in this part:

```
 matcher = PhraseMatcher(nlp.vocab)
>   for doc in matcher.pipe(texts, batch_size=50):
>       pass
```

`texts` suggests the input is an iterable of strings. I replaced it for `docs`.
2020-06-30 20:01:12 +02:00
Álvaro Abella Bascarán ff0dbe5c64
Fix in docs: pipe(docs) instead of pipe(texts) (#5680)
Very minor fix in docs, specifically in this part:

```
 matcher = PhraseMatcher(nlp.vocab)
>   for doc in matcher.pipe(texts, batch_size=50):
>       pass
```

`texts` suggests the input is an iterable of strings. I replaced it for `docs`.
2020-06-30 20:00:50 +02:00
Matthias Hertel 305221f3e5 Website: fixed the token span in the text about the rule-based matching example (#5669)
* fixed token span in pattern matcher example

* contributor agreement
2020-06-30 19:58:55 +02:00
Matthias Hertel 8b0f749606
Website: fixed the token span in the text about the rule-based matching example (#5669)
* fixed token span in pattern matcher example

* contributor agreement
2020-06-30 19:58:23 +02:00
Adriane Boyd d777d9cc38 Extend v2.3 migration guide (#5653)
* Extend preloaded vocab section

* Add section on tag maps
2020-06-26 14:13:01 +02:00
Adriane Boyd c4d0209472
Extend v2.3 migration guide (#5653)
* Extend preloaded vocab section

* Add section on tag maps
2020-06-26 14:12:29 +02:00
Adriane Boyd a2660bd9c6 Fix backslashes in warnings config diff (#5640)
Fix backslashes in warnings config diff in v2.3 migration section.
2020-06-24 10:26:57 +02:00
Adriane Boyd fd4287c178
Fix backslashes in warnings config diff (#5640)
Fix backslashes in warnings config diff in v2.3 migration section.
2020-06-24 10:26:12 +02:00
Adriane Boyd 4f73ced914 Extend what's new in v2.3 with vocab / is_oov (#5635) 2020-06-23 16:50:43 +02:00
Adriane Boyd 7ce451c211
Extend what's new in v2.3 with vocab / is_oov (#5635) 2020-06-23 16:48:59 +02:00
Adriane Boyd fcdecefacf Add warnings example in v2.3 migration guide (#5627) 2020-06-22 14:38:06 +02:00
Adriane Boyd bc1cb30b21
Add warnings example in v2.3 migration guide (#5627) 2020-06-22 14:37:24 +02:00
Ines Montani 52728d8fa3 Merge branch 'develop' into master-tmp 2020-06-20 15:52:00 +02:00
Adriane Boyd 66889de166 Warning for sudachipy 0.4.5 (#5611) 2020-06-19 13:45:23 +02:00
Adriane Boyd 931d80de72
Warning for sudachipy 0.4.5 (#5611) 2020-06-19 12:43:41 +02:00
Ines Montani 6d712f3e06
Merge pull request #5599 from adrianeboyd/docs/v2.3.0-minor 2020-06-16 13:49:25 -07:00
Adriane Boyd 02369f91d3 Fix spacy convert argument 2020-06-16 20:41:17 +02:00
Adriane Boyd f0fd77648f Change example title to Dr.
Change example title to Dr. so the current model does exclude the title
in the initial example.
2020-06-16 20:36:21 +02:00
Adriane Boyd a6abdfbc3c Fix numpy.zeros() dtype for Doc.from_array 2020-06-16 20:35:45 +02:00
Adriane Boyd 9aff317ca7 Update POS in tagging example 2020-06-16 20:26:57 +02:00
Adriane Boyd 457babfa0c Update alignment example for new gold.align 2020-06-16 20:22:03 +02:00
Ines Montani 44af53bdd9 Add pkuseg warnings and auto-format [ci skip] 2020-06-16 17:13:35 +02:00
Ines Montani a9e5b840ee Fix typos and auto-format [ci skip] 2020-06-16 16:38:45 +02:00
Adriane Boyd d5110ffbf2
Documentation updates for v2.3.0 (#5593)
* Update website models for v2.3.0

* Add docs for Chinese word segmentation

* Tighten up Chinese docs section

* Merge branch 'master' into docs/v2.3.0 [ci skip]

* Merge branch 'master' into docs/v2.3.0 [ci skip]

* Auto-format and update version

* Update matcher.md

* Update languages and sorting

* Typo in landing page

* Infobox about token_match behavior

* Add meta and basic docs for Japanese

* POS -> TAG in models table

* Add info about lookups for normalization

* Updates to API docs for v2.3

* Update adding norm exceptions for adding languages

* Add --omit-extra-lookups to CLI API docs

* Add initial draft of "What's New in v2.3"

* Add new in v2.3 tags to Chinese and Japanese sections

* Add tokenizer to migration section

* Add new in v2.3 flags to init-model

* Typo

* More what's new in v2.3

Co-authored-by: Ines Montani <ines@ines.io>
2020-06-16 15:37:35 +02:00
Sofie Van Landeghem c0f4a1e43b
train is from-config by default (#5575)
* verbose and tag_map options

* adding init_tok2vec option and only changing the tok2vec that is specified

* adding omit_extra_lookups and verifying textcat config

* wip

* pretrain bugfix

* add replace and resume options

* train_textcat fix

* raw text functionality

* improve UX when KeyError or when input data can't be parsed

* avoid unnecessary access to goldparse in TextCat pipe

* save performance information in nlp.meta

* add noise_level to config

* move nn_parser's defaults to config file

* multitask in config - doesn't work yet

* scorer offering both F and AUC options, need to be specified in config

* add textcat verification code from old train script

* small fixes to config files

* clean up

* set default config for ner/parser to allow create_pipe to work as before

* two more test fixes

* small fixes

* cleanup

* fix NER pickling + additional unit test

* create_pipe as before
2020-06-12 02:02:07 +02:00
Sofie Van Landeghem 4d1ba6feb4
add tag variant for 2.3 (#5542) 2020-06-04 19:16:33 +02:00
Ines Montani 810fce3bb1 Merge branch 'develop' into master-tmp 2020-06-03 14:36:59 +02:00
svlandeg 5f0a91cf37 fix conv-depth parameter 2020-05-29 09:56:29 +02:00
Ines Montani 262d306eaa unicode -> str consistency 2020-05-24 17:23:00 +02:00
Ines Montani 5d3806e059 unicode -> str consistency 2020-05-24 17:20:58 +02:00
Jannis aa53ce6996
Documentation Typo Fix (#5492)
* Fix typo

Change 'realize' to 'realise'

* Add contributer agreement
2020-05-22 19:50:26 +02:00
Matthew Honnibal f6078d866a
Merge pull request #5121 from adrianeboyd/bugfix/revert-token-match
Revert token_match priority changes from #4374 and extend token match options
2020-05-22 14:42:51 +02:00
Ines Montani 65c7e82de2 Auto-format and remove 2.3 feature [ci skip] 2020-05-22 13:50:30 +02:00
Adriane Boyd e4a1b5dab1 Rename to url_match
Rename to `url_match` and update docs.
2020-05-22 12:41:03 +02:00
Adriane Boyd 730fa493a4 Merge remote-tracking branch 'upstream/master' into bugfix/revert-token-match 2020-05-22 12:18:00 +02:00
Ines Montani 24f72c669c Merge branch 'develop' into master-tmp 2020-05-21 18:39:06 +02:00
Sofie Van Landeghem 0d94737857
Feature toggle_pipes (#5378)
* make disable_pipes deprecated in favour of the new toggle_pipes

* rewrite disable_pipes statements

* update documentation

* remove bin/wiki_entity_linking folder

* one more fix

* remove deprecated link to documentation

* few more doc fixes

* add note about name change to the docs

* restore original disable_pipes

* small fixes

* fix typo

* fix error number to W096

* rename to select_pipes

* also make changes to the documentation

Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
2020-05-18 22:27:10 +02:00
Ines Montani f333c2a011
Merge pull request #5386 from svlandeg/fix/nel-docs 2020-05-10 12:00:09 +02:00
adrianeboyd 4a15b559ba
Clarify Token.pos as UPOS (#5419) 2020-05-08 10:36:25 +02:00
adrianeboyd a2345618f1
Fix Token API docs from #5375 (#5418) 2020-05-08 10:25:02 +02:00
Adriane Boyd 565e0eef73 Add tokenizer option for token match with affixes
To fix the slow tokenizer URL (#4374) and allow `token_match` to take
priority over prefixes and suffixes by default, introduce a new
tokenizer option for a token match pattern that's applied after prefixes
and suffixes but before infixes.
2020-05-05 10:35:33 +02:00
Adriane Boyd 792c8af8cf Merge remote-tracking branch 'upstream/master' into bugfix/revert-token-match 2020-05-05 09:25:57 +02:00
svlandeg ebaed7dcfa Few more updates to the EL documentation 2020-04-30 10:17:06 +02:00
adrianeboyd bdff76dede
Various updates/additions to CLI scripts (#5362)
* `debug-data`: determine coverage of provided vectors

* `evaluate`: support `blank:lg` model to make it possible to just evaluate
tokenization

* `init-model`: add option to truncate vectors to N most frequent vectors
from word2vec file

* `train`:

  * if training on GPU, only run evaluation/timing on CPU in the first
    iteration

  * if training is aborted, exit with a non-0 exit status
2020-04-29 12:56:46 +02:00
Sofie Van Landeghem cfdaf99b80
Fix passing of component configuration (#5374)
* add kwargs to to_disk methods in docs - otherwise crashes on 'exclude' argument

* add fix and test for Issue 5137
2020-04-29 12:56:17 +02:00
Sofie Van Landeghem f67343295d
Update NEL examples and documentation (#5370)
* simplify creation of KB by skipping dim reduction

* small fixes to train EL example script

* add KB creation and NEL training example scripts to example section

* update descriptions of example scripts in the documentation

* moving wiki_entity_linking folder from bin to projects

* remove test for wiki NEL functionality that is being moved
2020-04-29 12:53:53 +02:00
adrianeboyd a6e521cd79
Add is_sent_end token property (#5375)
Reconstruction of the original PR #4697 by @MiniLau.

Removes unused `SENT_END` symbol and `IS_SENT_END` from `Matcher` schema
because the Matcher is only going to be able to support `IS_SENT_START`.
2020-04-29 12:53:16 +02:00
adrianeboyd 90ce34db42
Add cuda101 and cuda102 options to setup (#5377)
* Add cuda101 and cuda102 options to setup

* Update cudaNNN options in docs
2020-04-29 12:51:12 +02:00
adrianeboyd 792aa7b6ab
Remove references to textcat spans (#5360)
Remove references to unimplemented `TextCategorizer` span labels in
`GoldParse` and `Doc`.
2020-04-27 18:01:12 +02:00
adrianeboyd 90c754024f
Update nlp.vectors to nlp.vocab.vectors (#5357) 2020-04-27 10:53:05 +02:00
Mike 481574cbc8
[minor doc change] embedding vis. link is broken in `website/docs/usage/examples.md` (#5325)
* The embedding vis. link is broken

The first link seems to be reasonable for now unless someone has an updated embedding vis they want to share?

* contributor agreement

* Update Mlawrence95.md

* Update website/docs/usage/examples.md

Co-Authored-By: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2020-04-21 20:35:12 +02:00
laszabine fb73d4943a
Amend documentation to Language.evaluate (#5319)
* Specified usage of arguments to Language.evaluate

* Created contributor agreement
2020-04-16 20:00:18 +02:00
Sofie Van Landeghem a3965ec13d
tag-map-path since 2.2.4 instead of 2.2.3 (#5289) 2020-04-14 14:53:47 +02:00