Commit Graph

1012 Commits

Author SHA1 Message Date
Ines Montani 68721af628 Formatting and preliminary intro [ci skip] 2021-06-24 20:32:23 +10:00
Adriane Boyd 92dc6b409e Notes on source with vectors 2021-06-24 10:34:07 +02:00
Adriane Boyd 35425d7e26 Add details for Catalan and Danish 2021-06-24 10:10:33 +02:00
Ines Montani 5daf450f51 Update upgrading notes [ci skip] 2021-06-24 18:06:28 +10:00
Ines Montani 528746129d Merge branch 'master' into docs/new-in-v3-1 2021-06-24 13:11:37 +10:00
Ines Montani 3e058dee62 Update features [ci skip] 2021-06-24 12:36:04 +10:00
Ines Montani a1e4aca267 Fix sentence [ci skip] 2021-06-24 11:40:36 +10:00
Ines Montani ca0d904faa Update details [ci skip] 2021-06-23 13:05:56 +10:00
themrmax d96c422cfc
Fix broken link
change /api/registry to /api/top-level#registry
2021-06-22 15:34:06 -07:00
Ines Montani e9b68d4f4c Update details and add example [ci skip] 2021-06-22 17:51:03 +10:00
Nick Sorros 31504f5982
Switch model and data path in prodigy project.yml recipe (#8467) 2021-06-22 09:41:45 +02:00
Ines Montani bc93c34f54 Add "New in v3.1" guide 2021-06-22 15:23:18 +10:00
Ines Montani 02d2fdb123 Add link anchor [ci skip] 2021-06-20 11:29:19 +10:00
svlandeg bb9d2f1546 extend example to ensure the text is preserved 2021-06-16 23:56:35 +02:00
Sofie Van Landeghem e796aab4b3
Resizable textcat (#7862)
* implement textcat resizing for TextCatCNN

* resizing textcat in-place

* simplify code

* ensure predictions for old textcat labels remain the same after resizing (WIP)

* fix for softmax

* store softmax as attr

* fix ensemble weight copy and cleanup

* restructure slightly

* adjust documentation, update tests and quickstart templates to use latest versions

* extend unit test slightly

* revert unnecessary edits

* fix typo

* ensemble architecture won't be resizable for now

* use resizable layer (WIP)

* revert using resizable layer

* resizable container while avoid shape inference trouble

* cleanup

* ensure model continues training after resizing

* use fill_b parameter

* use fill_defaults

* resize_layer callback

* format

* bump thinc to 8.0.4

* bump spacy-legacy to 3.0.6
2021-06-16 11:45:00 +02:00
svlandeg 29d83dec0c adjust whitespace tokenizer to avoid sep in split() 2021-06-16 10:58:45 +02:00
Adriane Boyd 5646fcbe46 Merge remote-tracking branch 'upstream/develop' into chore/develop-into-master-v3.1 2021-06-15 15:05:17 +02:00
Sofie Van Landeghem 0fd0d949c4
fix 's typo's across code base (#8384) 2021-06-15 10:57:08 +02:00
Adriane Boyd 6baab565eb
Minor updates to quickstart settings/instructions (#7965)
* Minor updates to quickstart settings/instructions

* set default value of textcat exclusive to `false` until the default
checkbox behavior is updated
* add the `morphologizer` to the list of components
* add a note that v3.0.6+ is required

* Switch to warning above quickstart

* Undo changes to textcat default in quickstart

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2021-05-17 16:55:22 +02:00
Paul O'Leary McCann 66bfabd839
Fix pretraining objectives fragment (#8005)
* Fix pretraining objectives fragment

The fragment here is reused from a heading higher up, so you couldn't
link to this section.

* Fix section link to new fragment
2021-05-06 08:27:36 +02:00
Adriane Boyd 95c0833656
Add training option to set annotations on update (#7767)
* Add training option to set annotations on update

Add a `[training]` option called `set_annotations_on_update` to specify
a list of components for which the predicted annotations should be set
on `example.predicted` immediately after that component has been
updated. The predicted annotations can be accessed by later components
in the pipeline during the processing of the batch in the same `update`
call.

* Rename to annotates / annotating_components

* Add test for `annotating_components` when training from config

* Add documentation
2021-04-26 16:53:53 +02:00
Adriane Boyd d2bdaa7823
Replace negative rows with 0 in StaticVectors (#7674)
* Replace negative rows with 0 in StaticVectors

Replace negative row indices with 0-vectors in `StaticVectors`.

* Increase versions related to StaticVectors

* Increase versions of all architctures and layers related to
`StaticVectors`
* Improve efficiency of 0-vector operations

Parallel `spacy-legacy` PR: https://github.com/explosion/spacy-legacy/pull/5

* Update config defaults to new versions

* Update docs
2021-04-22 18:04:15 +10:00
Shantam Raj 6017fcf693
Default code for Setting Entity annotations on the website errors (#7738)
* the default example for "Setting entity annotations" errors on Binder

* updating contributer info

* using a new variable to store original entities
2021-04-21 09:16:32 +02:00
langdonholmes df541c6b5e
Update processing-pipelines.md to mention method for doc metadata (#7480)
* Update processing-pipelines.md

Under "things to try," inform users they can save metadata when using nlp.pipe(foobar, as_tuples=True)

Link to a new example on the attributes page detailing the following:

> ```
> data = [
>   ("Some text to process", {"meta": "foo"}),
>   ("And more text...", {"meta": "bar"})
> ]
> 
> for doc, context in nlp.pipe(data, as_tuples=True):
>     # Let's assume you have a "meta" extension registered on the Doc
>     doc._.meta = context["meta"]
> ```

from https://stackoverflow.com/questions/57058798/make-spacy-nlp-pipe-process-tuples-of-text-and-additional-information-to-add-as

* Updating the attributes section

Update the attributes section with example of how extensions can be used to store metadata.

* Update processing-pipelines.md

* Update processing-pipelines.md

Made as_tuples example executable and relocated to the end of the "Processing Text" section.

* Update processing-pipelines.md

* Update processing-pipelines.md

Removed extra line

* Reformat and rephrase

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2021-04-19 11:58:12 +02:00
Adriane Boyd 0e7f94b247
Update Tokenizer.explain with special matches (#7749)
* Update Tokenizer.explain with special matches

Update `Tokenizer.explain` and the pseudo-code in the docs to include
the processing of special cases that contain affixes or whitespace.

* Handle optional settings in explain

* Add test for special matches in explain

Add test for `Tokenizer.explain` for special cases containing affixes.
2021-04-19 19:08:20 +10:00
Bram Vanroy ed561cf428
Terminology: deprecated vs obsolete (#7621)
* Terminology: deprecated vs obsolete

Typically, deprecated is used for functionality that is bound to become unavailable but that can still be used. Obsolete is used for features that have been removed. In E941, I think what is meant is "obsolete" since loading a model by a shortcut simply does not work anymore (and throws an error). This is different from downloading a model with a shortcut, which is deprecated but still works.

In light of this, perhaps all other error codes should be checked as well.

* clarify that the link command is removed and not just deprecated

Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com>
2021-04-12 14:37:00 +02:00
Adriane Boyd 673e2bc4c0
Add usage docs for streamed train corpora (#7693) 2021-04-09 16:15:38 +02:00
Ayush Chaurasia 3c2ce41dd8
W&B integration: Optional support for dataset and model checkpoint logging and versioning (#7429)
* Add optional artifacts logging

* Update docs

* Update spacy/training/loggers.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update spacy/training/loggers.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update spacy/training/loggers.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Bump WandbLogger Version

* Add documentation of v1 to legacy docs

* bump spacy-legacy to 3.0.2 (to be released)

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com>
2021-04-01 19:36:23 +02:00
Santiago Castro af07fc3bc1
Add support for CUDA 11.2 (#7583)
* Add support for CUDA 11.2

* Update the docs

* Format

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2021-03-30 09:47:33 +02:00
Álvaro Abella Bascarán 5b4dde38a3
fix fn name: tokenizer.infixes_finditer -> tokenizer.infix_finditer (#7606) 2021-03-30 09:45:49 +02:00
Adriane Boyd 0d2b723e8d Update entity setting section 2021-03-20 11:38:55 +01:00
Adriane Boyd 6a9a467766
Update website/docs/usage/processing-pipelines.md
Co-authored-by: Ines Montani <ines@ines.io>
2021-03-19 08:12:49 +01:00
Adriane Boyd 40e5d3a980 Update saving/loading example 2021-03-18 16:56:10 +01:00
Adriane Boyd 0fb1881f36 Reformat processing pipelines 2021-03-18 13:31:42 +01:00
Adriane Boyd acc58719da Update custom similarity hooks example 2021-03-18 13:31:42 +01:00
Adriane Boyd c9e1a9ac17 Add multiprocessing section 2021-03-18 13:31:42 +01:00
Adriane Boyd 9a254d3995 Include all en_core_web_sm components in examples 2021-03-18 13:31:42 +01:00
bsweileh 61472e7cb3
Update _training.md - Fix broken link on backpropagation (#7431)
* Update _training.md

Fix broken link on backpropagation

* Add agreement

add spacy contributor agreement
2021-03-15 09:21:35 +01:00
Adriane Boyd d746ea6278
Add warning about GPU selection in Jupyter notebooks (#7075)
* Initial warning

* Update check

* Redo edit

* Move jupyter warning to helper method

* Add link with details to warnings
2021-03-09 15:35:21 +01:00
Sofie Van Landeghem 932887b950
textcat scoring fix and multi_label docs (#6974)
* add multi-label textcat to menu

* add infobox on textcat API

* add info to v3 migration guide

* small edits

* further fixes in doc strings

* add infobox to textcat architectures

* add textcat_multilabel to overview of built-in components

* spelling

* fix unrelated warn msg

* Add textcat_multilabel to quickstart [ci skip]

* remove separate documentation page for multilabel_textcategorizer

* small edits

* positive label clarification

* avoid duplicating information in self.cfg and fix textcat.score

* fix multilabel textcat too

* revert threshold to storage in cfg

* revert threshold stuff for multi-textcat

Co-authored-by: Ines Montani <ines@ines.io>
2021-03-09 23:04:22 +11:00
Ines Montani dfb23a419e Merge branch 'spacy.io' [ci skip] 2021-03-06 17:38:54 +11:00
graue70 7d085d5b1c
Fix typo in docs 2021-03-05 18:30:09 +01:00
svlandeg d900c55061 consistently use registry as callable 2021-03-02 17:56:28 +01:00
svlandeg 08fd901a1b kb.get_candidates renamed to get_alias_candidates 2021-02-25 20:09:36 +01:00
Ines Montani 24cecbb3f4
Merge pull request #7126 from adrianeboyd/docs/gpu-id-opt [ci skip]
Add tip about --gpu-id to training quickstart
2021-02-24 22:34:17 +11:00
Tocic b1996a51a1
fix typo in models.md (#7157) 2021-02-22 09:00:38 +01:00
Adriane Boyd 7198be0f4b Add tip about --gpu-id to training quickstart 2021-02-19 14:07:51 +01:00
Sofie Van Landeghem 709c9e75af
span.ent only returns first sentence (#7084)
* return first sentence when span contains sentence boundary

* docs fix

* small fixes

* cleanup
2021-02-19 23:02:38 +11:00
palandlom 9b82586699
var batch is useless (#7111)
It seems that nlp.update(examples) should be nlp.update(batch)
2021-02-18 09:44:22 +01:00
Ines Montani fc4fb6eb3a Make v2.x docs more prominent [ci skip] 2021-02-17 23:42:27 +11:00
Ines Montani c08b3f294c Support env vars and CLI overrides for project.yml 2021-02-10 13:45:27 +11:00
svlandeg 9a7f33c916 final 3.0 benchmark numbers 2021-02-09 21:28:33 +01:00
svlandeg bb7482bef8 fix link 2021-02-08 18:39:59 +01:00
Ines Montani 433835d9b0
Merge pull request #6889 from adrianeboyd/docs/source-install-dup [ci skip] 2021-02-05 13:35:16 +11:00
Ines Montani 2cdfcd2d19 Update naming [ci skip] 2021-02-03 12:48:31 +11:00
Adriane Boyd 37a68a06ab Update to recommend editable installs for source installs 2021-02-02 16:51:27 +01:00
Adriane Boyd 3a3e4daf60 Update install instructions
* Remove duplicate section about compiling from source
2021-02-02 14:44:15 +01:00
Pengcheng YIN 6fdc33203a
Fix a typo 2021-02-01 17:26:28 -05:00
Ines Montani a59f3fcf5d Make wheel the default format and update docs [ci skip] 2021-02-01 23:18:43 +11:00
Ines Montani 31b842d6ce Update table [ci skip] 2021-02-01 14:17:52 +11:00
Ines Montani 7752f80f39 Update docs [ci skip] 2021-01-31 16:11:24 +11:00
Ines Montani a8a1231ccd Update README and docs [ci skip] 2021-01-31 12:36:04 +11:00
Ines Montani ae07416fda Merge branch 'website/v3-launch' into develop 2021-01-30 20:31:06 +11:00
Ines Montani 2332c4280b Update and use unified --build option 2021-01-30 13:11:36 +11:00
Ines Montani 2609ba4e89 Support building wheel in spacy package 2021-01-30 11:54:02 +11:00
Ines Montani 95e958a229
Merge pull request #6852 from explosion/feature/replace-listeners 2021-01-30 00:58:08 +11:00
Ines Montani 7694f76dd1 Update warning and mention replace_listeners 2021-01-29 23:46:01 +11:00
Adriane Boyd 8b76cb8095 Rephrase transformers PyTorch instructions 2021-01-29 13:36:56 +01:00
Adriane Boyd e3e87e7275 Update transfomers install docs
* Recommend installing PyTorch separately
* Add instructions for `sentencepiece`
2021-01-29 13:27:43 +01:00
Ines Montani 99af9e7125 Update documentation 2021-01-29 18:45:48 +11:00
Ines Montani 35d79c0a5d Adjust formatting [ci skip] 2021-01-27 13:31:25 +11:00
Ines Montani 5d79d1af50
Merge pull request #6796 from svlandeg/docs/benchmarks [ci skip] 2021-01-27 13:01:23 +11:00
Ines Montani 1ed7029d47 Update website for v3 launch 2021-01-27 12:39:47 +11:00
Adriane Boyd 61c9f8bf24
Remove transformers model max length section (#6807) 2021-01-25 19:59:34 +08:00
svlandeg 56064faed9 update caption 2021-01-23 00:57:00 +01:00
svlandeg d7c0f40a96 update comment 2021-01-22 18:55:18 +01:00
svlandeg a071279bc7 add speed comparison to docs 2021-01-22 18:46:35 +01:00
svlandeg b132cb3036 update accuracies for new a1 models 2021-01-21 20:24:05 +01:00
Sofie Van Landeghem e680efc7cc
Set annotations in update (#6767)
* bump to 3.0.0rc4

* do set_annotations in component update calls

* update docs and remove set_annotations flag

* fix EL test
2021-01-20 11:49:25 +11:00
Sofie Van Landeghem 57640aa838
warn when frozen components break listener pattern (#6766)
* warn when frozen components break listener pattern

* few notes in the documentation

* update arg name

* formatting

* cleanup

* specify listeners return type
2021-01-20 11:12:35 +11:00
Ines Montani 4a1029a9b6 Add infobox [ci skip] 2021-01-19 19:18:39 +11:00
Sofie Van Landeghem fed8f48965
raise NotImplementedError when noun_chunks iterator is not implemented (#6711)
* raise NotImplementedError when noun_chunks iterator is not implemented

* bring back, fix and document span.noun_chunks

* formatting

Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
2021-01-17 19:56:05 +08:00
Adriane Boyd bf0cdae8d4
Add token_splitter component (#6726)
* Add long_token_splitter component

Add a `long_token_splitter` component for use with transformer
pipelines. This component splits up long tokens like URLs into smaller
tokens. This is particularly relevant for pretrained pipelines with
`strided_spans`, since the user can't change the length of the span
`window` and may not wish to preprocess the input texts.

The `long_token_splitter` splits tokens that are at least
`long_token_length` tokens long into smaller tokens of `split_length`
size.

Notes:

* Since this is intended for use as the first component in a pipeline,
the token splitter does not try to preserve any token annotation.
* API docs to come when the API is stable.

* Adjust API, add test

* Fix name in factory
2021-01-17 19:54:41 +08:00
Matthew Honnibal f277bfdf0f
Add SpanGroup and Graph container types to represent arbitrary annotations (#6696)
* Draft out initial Spans data structure

* Initial span group commit

* Basic span group support on Doc

* Basic test for span group

* Compile span_group.pyx

* Draft addition of SpanGroup to DocBin

* Add deserialization for SpanGroup

* Add tests for serializing SpanGroup

* Fix serialization of SpanGroup

* Add EdgeC and GraphC structs

* Add draft Graph data structure

* Compile graph

* More work on Graph

* Update GraphC

* Upd graph

* Fix walk functions

* Let Graph take nodes and edges on construction

* Fix walking and getting

* Add graph tests

* Fix import

* Add module with the SpanGroups dict thingy

* Update test

* Rename 'span_groups' attribute

* Try to fix c++11 compilation

* Fix test

* Update DocBin

* Try to fix compilation

* Try to fix graph

* Improve SpanGroup docstrings

* Add doc.spans to documentation

* Fix serialization

* Tidy up and add docs

* Update docs [ci skip]

* Add SpanGroup.has_overlap

* WIP updated Graph API

* Start testing new Graph API

* Update Graph tests

* Update Graph

* Add docstring

Co-authored-by: Ines Montani <ines@ines.io>
2021-01-14 17:30:41 +11:00
Adriane Boyd a45d89f09a Add initialize.before_init and after_init callbacks
Add `initialize.before_init` and `initialize.after_init` callbacks to
the config. The `initialize.before_init` callback is a place to
implement one-time tokenizer customizations that are then saved with the
model.
2021-01-12 13:07:44 +01:00
Sofie Van Landeghem a612a5ba3f
fix small typos (#6698) 2021-01-08 09:39:47 +01:00
Sofie Van Landeghem 75d9019343
Fix types of Tok2Vec encoding architectures (#6442)
* fix TorchBiLSTMEncoder documentation

* ensure the types of the encoding Tok2vec layers are correct

* update references from v1 to v2 for the new architectures
2021-01-07 16:39:27 +11:00
Sofie Van Landeghem 82ae95267a
Docs for pretrain architectures (#6605)
* document pretraining architectures

* formatting

* bit more info

* small fixes
2021-01-06 16:12:30 +11:00
Sofie Van Landeghem afc5714d32
multi-label textcat component (#6474)
* multi-label textcat component

* formatting

* fix comment

* cleanup

* fix from #6481

* random edit to push the tests

* add explicit error when textcat is called with multi-label gold data

* fix error nr

* small fix
2021-01-06 13:07:14 +11:00
Ines Montani 85ca8c2bdd Merge branch 'master' into develop 2020-12-11 13:44:41 +11:00
Ines Montani fb43a30a71
Merge pull request #6545 from svlandeg/feature/discussions [ci skip] 2020-12-11 10:20:35 +11:00
svlandeg 5afa567767 replace gitter with discussions in 101 2020-12-10 20:17:36 +01:00
Adriane Boyd 27bb75e2a0 Docs and extras updates for v2.3.5
* Update install instructions for updated packages

* Add `cuda110` and `cuda111` extras, remove upper `cupy` pins (only
compatible with `thinc>=7.4.4`)
2020-12-10 15:34:34 +01:00
Ines Montani 513c4e332a
Include custom code via spacy package command (#6531) 2020-12-10 20:36:46 +08:00
Ines Montani 1980203229 Merge branch 'master' into pr/6444 2020-12-09 11:09:40 +11:00
Ines Montani 05a2812ae0 Merge branch 'develop' into pr/6444 2020-12-09 11:04:03 +11:00
Ines Montani 8921364579
Merge pull request #6521 from explosion/feature/config-stdin
Allow reading config from stdin in spacy train
2020-12-08 22:07:43 +11:00
Ines Montani 94a5a9814f Update argument handling and documentation 2020-12-08 20:41:18 +11:00
Ines Montani ef59ce783b Adjust install instructions [ci skip] 2020-12-08 18:06:50 +11:00
Ines Montani d8e01ca931
Merge pull request #6391 from adrianeboyd/docs/install-guide 2020-12-08 07:42:16 +01:00