Commit Graph

2613 Commits

Author SHA1 Message Date
Adriane Boyd e39d1bd4ab
Various docs updates for v3.1 (#8406)
* Update for Catalan/Italian lemmatizer changes

* Add warning about relevance of section
2021-06-21 09:33:50 +02:00
Ines Montani 02d2fdb123 Add link anchor [ci skip] 2021-06-20 11:29:19 +10:00
Matthew Honnibal 6f5e308d17
Support negative examples in partial NER annotations (#8106)
* Support a cfg field in transition system

* Make NER 'has gold' check use right alignment for span

* Pass 'negative_samples_key' property into NER transition system

* Add field for negative samples to NER transition system

* Check neg_key in NER has_gold

* Support negative examples in NER oracle

* Test for negative examples in NER

* Fix name of config variable in NER

* Remove vestiges of old-style partial annotation

* Remove obsolete tests

* Add comment noting lack of support for negative samples in parser

* Additions to "neg examples" PR (#8201)

* add custom error and test for deprecated format

* add test for unlearning an entity

* add break also for Begin's cost

* add negative_samples_key property on Parser

* rename

* extend docs & fix some older docs issues

* add subclass constructors, clean up tests, fix docs

* add flaky test with ValueError if gold parse was not found

* remove ValueError if n_gold == 0

* fix docstring

* Hack in environment variables to try out training

* Remove hack

* Remove NER hack, and support 'negative O' samples

* Fix O oracle

* Fix transition parser

* Remove 'not O' from oracle

* Fix NER oracle

* check for spans in both gold.ents and gold.spans and raise if so, to prevent memory access violation

* use set instead of list in consistency check

Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2021-06-17 17:33:00 +10:00
Sofie Van Landeghem e796aab4b3
Resizable textcat (#7862)
* implement textcat resizing for TextCatCNN

* resizing textcat in-place

* simplify code

* ensure predictions for old textcat labels remain the same after resizing (WIP)

* fix for softmax

* store softmax as attr

* fix ensemble weight copy and cleanup

* restructure slightly

* adjust documentation, update tests and quickstart templates to use latest versions

* extend unit test slightly

* revert unnecessary edits

* fix typo

* ensemble architecture won't be resizable for now

* use resizable layer (WIP)

* revert using resizable layer

* resizable container while avoid shape inference trouble

* cleanup

* ensure model continues training after resizing

* use fill_b parameter

* use fill_defaults

* resize_layer callback

* format

* bump thinc to 8.0.4

* bump spacy-legacy to 3.0.6
2021-06-16 11:45:00 +02:00
Adriane Boyd 5646fcbe46 Merge remote-tracking branch 'upstream/develop' into chore/develop-into-master-v3.1 2021-06-15 15:05:17 +02:00
Sofie Van Landeghem 0fd0d949c4
fix 's typo's across code base (#8384) 2021-06-15 10:57:08 +02:00
Adriane Boyd 507422149f
Various docs updates for v3.0 (#8353)
* Update cats score names in Scorer API docs

* Refer to performance in meta

* Update package naming/versions, lemmatizer details

* Minor formatting fixes

* Provide more explanation for cats_score_desc

* Provide language-specific lemmatizer defaults in API docs

Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>
2021-06-14 12:19:36 +02:00
Adriane Boyd 63d748f80e
Add Catalan and Danish trf to website models (#8378) 2021-06-14 09:50:13 +02:00
Ines Montani 3259faad42 Update YouTube embed [ci skip] 2021-06-14 10:21:01 +10:00
Ines Montani 7f0f674a1b Fix universe.json and auto-format [ci skip] 2021-06-14 10:18:06 +10:00
Francisco Aranda 0a1a4c665d
update spacy-wordnet code example (#8327)
* update spacy-wordnet code example

- include spaCy 2.x and 3.x init alternatives
- upgrade recognai logo

* fix escape chars
2021-06-10 21:53:11 +02:00
Paul O'Leary McCann 5aba213349 Fix skweak Github URL
Github entry should not contain url, just user/repo
2021-05-31 18:00:43 +09:00
Kristian Boda dc8d8d15d2
Add hmrb to spaCy Universe (#8129)
* docs: add hmrb to spacy universe

* docs: add sentence on spacy versions

* docs: update description and images

* misc: add spaCy Contributor Agreement
2021-05-31 18:40:48 +10:00
Sofie Van Landeghem 3c58c0323f
fix docs (#8200) 2021-05-27 10:48:59 +02:00
Paul O'Leary McCann 0c553ecd4e Fix docs (fix #8189) 2021-05-24 19:47:30 +09:00
Sofie Van Landeghem 202943bc8c
KB & NEL to/from bytes (#8113)
* unit test for pickling KB

* add pickling test for NEL

* KB to_bytes and from_bytes

* NEL to_bytes and from_bytes

* xfail pickle tests for now

* fix docs

* cleanup
2021-05-20 18:11:30 +10:00
Adriane Boyd 6baab565eb
Minor updates to quickstart settings/instructions (#7965)
* Minor updates to quickstart settings/instructions

* set default value of textcat exclusive to `false` until the default
checkbox behavior is updated
* add the `morphologizer` to the list of components
* add a note that v3.0.6+ is required

* Switch to warning above quickstart

* Undo changes to textcat default in quickstart

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2021-05-17 16:55:22 +02:00
Julien Salinas c496f78245 Add NLP Cloud to Universe. 2021-05-14 11:13:44 +02:00
Frederic R. Hopp c5962b9fba
Update universe.json
fixed typo
2021-05-13 07:40:05 -07:00
Frederic R. Hopp a9ca221e03
Update universe.json
Added more detailed description to eMFDscore project
2021-05-12 09:20:17 -07:00
Frederic R. Hopp 7bba9cdc14
Update universe.json 2021-05-11 19:18:19 -07:00
Ines Montani 3883d49446 Fix default transformer in quickstart generator (resolves #8018) [ci skip] 2021-05-11 11:27:08 +10:00
Adriane Boyd 71c2a3ab47
Fix new version for match_alignments (#8021) 2021-05-07 09:55:20 +02:00
Jeno Pizarro 5cf76ab608
Update negspacy example code for spaCy 3.0 (#8022) 2021-05-07 09:33:21 +02:00
Sofie Van Landeghem 02a6a5fea0
Fix 'debug model' for transformers + generalize (#7973)
* add overrides to docs

* fix debug model with transformer

* assume training data is set in config
2021-05-06 18:43:32 +10:00
Paul O'Leary McCann 66bfabd839
Fix pretraining objectives fragment (#8005)
* Fix pretraining objectives fragment

The fragment here is reused from a heading higher up, so you couldn't
link to this section.

* Fix section link to new fragment
2021-05-06 08:27:36 +02:00
meghanabhange debaab7021
Update details in universe denomme | Multilingual Name Detection (#7982)
* Add denomme

* spaCy contributor agreement

* Update install and thumb

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2021-05-05 17:12:13 +02:00
Ines Montani 12d3d0fedd Fix quickstart default checked of conditional fields [ci skip] 2021-05-03 11:48:12 +10:00
Adriane Boyd 2320791f6d
Fix Transformer.initialize example (#7963) 2021-04-30 12:21:31 +02:00
Adriane Boyd 95c0833656
Add training option to set annotations on update (#7767)
* Add training option to set annotations on update

Add a `[training]` option called `set_annotations_on_update` to specify
a list of components for which the predicted annotations should be set
on `example.predicted` immediately after that component has been
updated. The predicted annotations can be accessed by later components
in the pipeline during the processing of the batch in the same `update`
call.

* Rename to annotates / annotating_components

* Add test for `annotating_components` when training from config

* Add documentation
2021-04-26 16:53:53 +02:00
Adriane Boyd bdb485cc80
Add callback to copy vocab/tokenizer from model (#7750)
* Add callback to copy vocab/tokenizer from model

Add callback `spacy.copy_from_base_model.v1` to copy the tokenizer
settings and/or vocab (including vectors) from a base model.

* Move spacy.copy_from_base_model.v1 to spacy.training.callbacks

* Add documentation

* Modify to specify model as tokenizer and vocab params
2021-04-22 12:36:50 +02:00
Adriane Boyd f68fc29130
Update sent_starts in Example.from_dict (#7847)
* Update sent_starts in Example.from_dict

Update `sent_starts` for `Example.from_dict` so that `Optional[bool]`
values have the same meaning as for `Token.is_sent_start`.

Use `Optional[bool]` as the type for sent start values in the docs.

* Use helper function for conversion to ternary ints
2021-04-22 11:32:45 +02:00
Adriane Boyd d2bdaa7823
Replace negative rows with 0 in StaticVectors (#7674)
* Replace negative rows with 0 in StaticVectors

Replace negative row indices with 0-vectors in `StaticVectors`.

* Increase versions related to StaticVectors

* Increase versions of all architctures and layers related to
`StaticVectors`
* Improve efficiency of 0-vector operations

Parallel `spacy-legacy` PR: https://github.com/explosion/spacy-legacy/pull/5

* Update config defaults to new versions

* Update docs
2021-04-22 18:04:15 +10:00
Sofie Van Landeghem 6f565cf39d
fix typo in entity_linker docs 2021-04-22 09:59:24 +02:00
Sofie Van Landeghem 2e746dbf32
update EL training data format in docs (#7839)
* update EL training data format

* fix typo

* all -1 because reasons
2021-04-22 08:50:09 +02:00
meghanabhange 49ff1126bf
Project Idea : denomme | Multilingual Name Detection (#7845)
* Add denomme

* spaCy contributor agreement

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2021-04-22 08:48:17 +02:00
Sam Edwardes b8c6c10c6f
Added a logo to spaCyTextBlob (#7818)
* Added a logo to spaCyTextBlob

* Updated to better thumb
2021-04-22 08:41:55 +02:00
Diego Palma bbade153ed
Add TRUNAJOD to spaCy universe. (#7754)
* Add TRUNAJOD to spaCy universe.

* Add trunajod logo and thumb.

Co-authored-by: Diego <dpalma@evernote.com>
2021-04-22 08:40:28 +02:00
Ines Montani a9e5ae9b5c Auto-format [ci skip] 2021-04-22 10:58:05 +10:00
Pierre Lison debfb46088 adding skweak to the SpaCy universe 2021-04-22 00:58:09 +02:00
Shantam Raj 6017fcf693
Default code for Setting Entity annotations on the website errors (#7738)
* the default example for "Setting entity annotations" errors on Binder

* updating contributer info

* using a new variable to store original entities
2021-04-21 09:16:32 +02:00
hudsonr 2722424ec5 Added universe entry for Coreferee 2021-04-19 14:28:06 +02:00
langdonholmes df541c6b5e
Update processing-pipelines.md to mention method for doc metadata (#7480)
* Update processing-pipelines.md

Under "things to try," inform users they can save metadata when using nlp.pipe(foobar, as_tuples=True)

Link to a new example on the attributes page detailing the following:

> ```
> data = [
>   ("Some text to process", {"meta": "foo"}),
>   ("And more text...", {"meta": "bar"})
> ]
> 
> for doc, context in nlp.pipe(data, as_tuples=True):
>     # Let's assume you have a "meta" extension registered on the Doc
>     doc._.meta = context["meta"]
> ```

from https://stackoverflow.com/questions/57058798/make-spacy-nlp-pipe-process-tuples-of-text-and-additional-information-to-add-as

* Updating the attributes section

Update the attributes section with example of how extensions can be used to store metadata.

* Update processing-pipelines.md

* Update processing-pipelines.md

Made as_tuples example executable and relocated to the end of the "Processing Text" section.

* Update processing-pipelines.md

* Update processing-pipelines.md

Removed extra line

* Reformat and rephrase

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2021-04-19 11:58:12 +02:00
Adriane Boyd 0e7f94b247
Update Tokenizer.explain with special matches (#7749)
* Update Tokenizer.explain with special matches

Update `Tokenizer.explain` and the pseudo-code in the docs to include
the processing of special cases that contain affixes or whitespace.

* Handle optional settings in explain

* Add test for special matches in explain

Add test for `Tokenizer.explain` for special cases containing affixes.
2021-04-19 19:08:20 +10:00
Sofie Van Landeghem c786e98e56
assemble CLI command (#7783)
* assemble CLI command

* ensure assemble runs even without training section

* cleanup
2021-04-19 18:39:11 +10:00
Bram Vanroy ed561cf428
Terminology: deprecated vs obsolete (#7621)
* Terminology: deprecated vs obsolete

Typically, deprecated is used for functionality that is bound to become unavailable but that can still be used. Obsolete is used for features that have been removed. In E941, I think what is meant is "obsolete" since loading a model by a shortcut simply does not work anymore (and throws an error). This is different from downloading a model with a shortcut, which is deprecated but still works.

In light of this, perhaps all other error codes should be checked as well.

* clarify that the link command is removed and not just deprecated

Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com>
2021-04-12 14:37:00 +02:00
Adriane Boyd 673e2bc4c0
Add usage docs for streamed train corpora (#7693) 2021-04-09 16:15:38 +02:00
Sofie Van Landeghem 3e5bd5055e
expand quickstart widget with cuda 11.1 and 11.2 (#7615) 2021-04-08 12:25:42 +02:00
Sofie Van Landeghem 204c2f116b
Extend score_spans for overlapping & non-labeled spans (#7209)
* extend span scorer with consider_label and allow_overlap

* unit test for spans y2x overlap

* add score_spans unit test

* docs for new fields in scorer.score_spans

* rename to include_label

* spell out if-else for clarity

* rename to 'labeled'

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2021-04-08 12:19:17 +02:00
broaddeep ee159b8543
Support match alignments (#7321)
* Support match alignments

* change naming from match_alignments to with_alignments, add conditional flow if with_alignments is given, validate with_alignments, add related test case

* remove added errors, utilize bint type, cleanup whitespace

* fix no new line in end of file

* Minor formatting

* Skip alignments processing if as_spans is set

* Add with_alignments to Matcher API docs

* Update website/docs/api/matcher.md

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2021-04-08 18:10:14 +10:00