Commit Graph

885 Commits

Author SHA1 Message Date
Adrian Wälchli b42efa7d86
support launching Lightning ddp with traditional command (#7480)
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2021-07-14 11:25:36 +00:00
Carlos Mocholí 6ce77a102b
Set minimum PyTorch version to 1.6 (#8288)
Co-authored-by: Jirka <jirka.borovec@seznam.cz>
2021-07-13 17:12:49 +00:00
Carlos Mocholí 321689f52e
Add `ModelCheckpoint(save_on_train_epoch_end)` (#8389)
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2021-07-13 14:47:59 +00:00
Luis Perez 000fbe63d3
Expose `extract_batch_size` method and add corresponding tests. (#8357)
* expose extract_batch and make public

* first pass

* early return

* add changelog

* move to utilities/data.py

* add test_data.py

* tests are passing

* precommit hook

* address pep8 failure

Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
2021-07-13 11:35:10 +00:00
Kaushik B 9d5ad7639c
Add logger flag to save_hyperparameters (#7960)
* Add log flag to save_hyperparameters

* FIx setter

* Add test & Update changelog

* Address comments

* Fix conflicts

* Update trainer

* Update CHANGELOG.md

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* Fix datamodule hparams fix

* Fix datamodule hparams fix

* Update test with patch

* Update pytorch_lightning/utilities/hparams_mixin.py

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* Move log_hyperparams to mixin

* Update hparams mixin

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2021-07-13 11:36:36 +02:00
Carlos Mocholí 733cdbb9ad
`every_n_val_epochs` -> `every_n_epochs` (#8383) 2021-07-13 01:20:20 +02:00
thomas chaton 370fa67004
[Refactor] Improve loops API 1/n (#8334)
* resolve issues

* update

* update

* update

* add more exceptions

* resolve bug

* update

* update

* update changelog

* resolve bug

* resolve comments

* update

* update

* update changelog

* update

* update

* remove space

* update

* re-order protected trainer attr

* move public method up

* add docs to state dict methods

* combine __load with load_state_dict

* rename shadowed variable

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* move changelog entry to refactor section

* refactor loop_progress property for test helper function

* update trainer setter docstring

* Update CHANGELOG.md

* Update pytorch_lightning/loops/base.py

* remove trainer check

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Justus Schock <12886177+justusschock@users.noreply.github.com>
2021-07-12 11:13:50 +00:00
Kaushik B 825c5dbe8c
Add support for (accelerator='cpu'|'gpu'|'tpu'|'ipu'|'auto') (#7808)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: Ethan Harris <ewah1g13@soton.ac.uk>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: SeanNaren <sean@grid.ai>
2021-07-09 15:28:54 +00:00
Tilman Krokotsch 09ff295177
Hyperparameters for datamodule (#3792)
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Ethan Harris <ewah1g13@soton.ac.uk>
Co-authored-by: Tilman Krokotsch <tilman.krokotsch@iav.de>
Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Ethan Harris <ethanwharris@gmail.com>
Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com>
Co-authored-by: Kaushik B <kaushikbokka@gmail.com>
2021-07-09 15:10:00 +00:00
Andrew Tritt 3102922647
Add LSF support (#5102)
* add ClusterEnvironment for LSF systems

* update init file

* add available cluster environments

* clean up LSFEnvironment

* add ddp_hpc as a distributed backend

* clean up SLURMEnvironment

* remove extra blank line

* init device for DDPHPCAccelerator

We need to do this so we don't send the model to the same device from multiple ranks

* committing current state

* add additional methods to ClusterEnvironments

* add NVIDIA mixin for setting up CUDA envars

* remove troubleshooting prints

* cleanup SLURMEnvironment

* fix docstring

* cleanup TorchElasticEnvironment and add documentation

* PEP8 puts a cork in it

* add set_ranks_to_trainer

* remove unused import

* move to new location

* update LSF environment

* remove mixin

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* changelog

* reset slurm env

* add tests

* add licence

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* test node_rank

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add lsf env to docs

* add auto detection for lsf environment

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix is_using_lsf() and test

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2021-07-09 16:14:26 +02:00
Dusan Drevicky 1b06edf2f2
Add the `on_before_optimizer_step` hook (#8048)
Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
2021-07-09 13:30:52 +02:00
thomas chaton 1c825a2a9c
Add the `on_before_backward` hook (#7865)
* Add callback to hook tests and add predict test

* Fix lambda callback test

* Simplify lambda call test

* Use LambdaCallback

* Dynamically append to called for the model

* Remove print

* Consistency

* Consistency

* Prepare args/kwargs testing

* yapf doesn't like dict literals

* Add arguments for fit no val test

* Add arguments for fit no val test

* add before_backward_hook

* add test

* resolve flake8

* resolve tests

* update changelog

* add on_before_backward to LightningModule

* update on comments

* Test arguments

* Datamodule refactor

* Fix eval test

* remove extra file

* resolve bug

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* move to hooks

* update

* resolve flake8

* update on comments

* Update full fit + val test

* Update test

* Remove FIXME

* Remove FIXME

* Undo change

* Fix

* Parametrize fit hook test

* Comment

* Parametrize fit hook test with different precision plugins

* Fix tests

* Parametrize fit hook test with manual optimization

* Unnecessary parenthesis

* WIP

* Comments

* Fix message

* Test CI error

* Revert "Test CI error"

This reverts commit 39c4a85a83.

* Add ddp training type teardown

* Update CHANGELOG

* Adrian's fix

* Use destructor

* Update CHANGELOG.md

* RPC destructor

* Update pytorch_lightning/plugins/training_type/ddp.py

* Why do you not work :(

* Missing condition

* Fix deepspeed test

* GC collect in conftest

* Do not show warnings for special tests

* Needs to run on 1.8

To avoid: "RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:32, unhandled cuda error, NCCL version 2.4.8"

* Run torch 1.8

* Skip test due to 'Python bus error'

* Debug NCCL

* shm size

* Disable warnings for special tests

* Remove NCCL_DEBUG statement

* Try smaller shm size

* Revert "Skip test due to 'Python bus error'"

This reverts commit e0a3e8785d.

* README and adjust versions

* Avoid self.on_gpu call

* empty cache cleanup

* More garbage collection

* Unroll parametrizations

* Do not reuse mock

* Undo changes

* Undo notebooks modification

* resolve test

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* delete file

* Undo

* Fix test

* Revert "WIP"

This reverts commit f5828a8c42.

* Rename

* Remove optimizers

* Fix bug with LightningOptimizer

* Add optimizers

* update

* update

* Update CHANGELOG

* On after backward refactor

* Do not call super

* Fixes

* Remove should_accumulate

* pre/post backward refactor

* Call the LM backward hook

* Update tests

* Remove dev debug patch

* Fix test

* Remove optimizer arguments and typing

* Docs fixes

* Fix comment

* Undo changes

* Split manual and auto

* Undo change

* Deepsource

* Remove optimizers

* Undo changes

* Call the hook

* Docs

* Docs

Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2021-07-09 06:15:57 +00:00
Carlos Mocholí eb6d991218
Refactor plugins backward (#8328) 2021-07-08 16:02:09 +02:00
thomas chaton 7956c6bd4b
[Feat] Add FastForwardSampler 2/n - Fault Tolerant Training (#8307)
* wip

* update

* resolve bug

* wip

* wip

* wip

* resolved tests

* update on comments

* update

* update

* Update pytorch_lightning/utilities/auto_restart.py

Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com>

* update on comments

* Update pytorch_lightning/utilities/auto_restart.py

Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com>

* resolve bug

* update

* move properties to top

* update docs for fast forward sampler

* move public attribute to top

* add missing super call

* update docs for state_dict

* fix merge conflict

* add missing super() call

* move property to top

* update on comments

* update

* resolve bug

* update

* update on comments

* activate coverage for CaptureIterableDataset

* update on comments

Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2021-07-07 20:21:21 +00:00
Carlos Mocholí 368ac1c622
[CLI] Drop `ArgumentParser` when pickling and save before spawning (#8017)
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2021-07-07 17:56:13 +00:00
Carlos Mocholí 07d7c37a79
Remove magic monitor support for `ModelCheckpoint` (#8293) 2021-07-07 18:36:19 +01:00
Carlos Mocholí 398eed508f
Fix `self.optimizers()` not returning a single `LightningOptimizer` (#8326) 2021-07-07 18:57:45 +02:00
Sidhant Sundrani 20df24d2a2
Enables reload of dataloaders on every n epochs from every epoch (#5043)
* edit arg to reload_dataloaders_every_n_epoch

* init reload_dataloaders_every_n_epoch

* edit logic to reload dl

* update arg to test datamodule

* update arg test dataloader

* edit reload dl logic in eval loop

* fix var name in reset_train_val_dataloaders

* fix error, use current_epoch attribute

* edit every_n_epoch to every_n_epochs

* edit every_n_epoch to every_n_epochs

* edit every_n_epoch to every_n_epochs

* edit every_n_epoch to every_n_epochs

* edit every_n_epoch to every_n_epochs

* edit every_n_epoch to every_n_epochs

* assert reload_dataloaders_every_n_epochs positive

* assert reload_dataloaders_every_n_epochs positive

* add trainer property should reload dl

* update should reload dl in train loop

* condition on should reload dl in eval loop

* pep8

* fix update should reload dl in train loop

* add test case

* replace assertion with misconfig exception

* remove unused variable

* remove unnecessary checks

* replace to BoringModel

* remove unrequired comment

* deprecate _every_epoch

* add deprecated argument to trainer

* test case for deprecated arg

* remove unrequired assertion in train loop

Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>

* modify misconfig exception for int

Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>

* conv bool to int of depreciated _every_epoch

Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>

* update description of deprecated param

Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>

* update deprecation warning

Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>

* modify argument to int only

* fix deprecated test function name

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

* merge tests for reload dls

* add propery should reload dl

* removed and added to trainer property

* use property in train loop

* remove deprecated test

* add deprecated test to new file

* test case for exception

* update test datamodule every_n_epochs

* update trainer docs

* update hooks with every_n_epochs

* edit format if statement

Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>

* Update CHANGELOG.md

* Apply suggestions from code review

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

* typo in exception

* pytest check only misconfig exception

* remove unnecessary code in test

* remove unnecessary code in deprec test

* added match in test

* typo in comment

* revert to prev, keep only req in context manager

* Apply suggestions from code review

* docs

* rebase

* Apply suggestions from code review

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix import: model_helpers instead of model_utils

* fix, add reload_dataloaders_every_n_epochs argument to data connector

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add required imports

* move deprecated log

* add missing import rank_zero_warn

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update varname in should_reload_dl_epoch

suggestion from code review

* Fix CHANGELOG. Update deprecation versions

* Minor change

* change property name, mark protected

* update property name

* update property name

* Remove deprecated *_loop.py files

* Rename test func

* Update CHANGELOG.md

* use rank_zero_deprecation

* update deprecation message in trainer api docs

* test deprecation with real arg name in message

* fix typo in trainer docs

Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Akihiro Nitta <nitta@akihironitta.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2021-07-07 13:10:08 +02:00
Kaushik B 2b6edae205
Decouple device parsing logic from Acc connector to Trainer (#8180) 2021-07-07 15:05:26 +05:30
Carlos Mocholí 8fead58273
Add `functools.wraps` support for `is_overridden` (#8296) 2021-07-06 10:40:54 +02:00
Adrian Wälchli f1341a555e
Remove deprecated optimizer argument from `manual_backward` (#8287)
Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
2021-07-06 08:18:08 +00:00
Carlos Mocholí 7ddcdb26d8
Deprecate `trainer.disable_validation` (#8291) 2021-07-05 16:52:49 +02:00
Carlos Mocholí 441e16f61c
Default `EarlyStopping.check_on_train_epoch_end=True` (#8286) 2021-07-05 15:45:23 +02:00
Adrian Wälchli ced2c94a3e
fix missing call to untoggle_optimizer when accumulating gradients (#8284)
* add fix

* toggle test

* re-structure

* update changelog

* update comment

* remove debugging assertion
2021-07-05 11:59:04 +00:00
Kaushik B 3a8322deda
Add XLAStatsMonitor Callback (#8235) 2021-07-05 17:09:46 +05:30
Carlos Mocholí 3379477242
Connect progress tracking dataclasses to loops (#8244)
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2021-07-05 13:33:12 +02:00
Adrian Wälchli ea5cfd2005
move batch to device before sending it to hooks (#7378)
* update train step

* test

* x

* limits

* val

* typeo

* x

* x

* step

* min gpus

* run all loops

* x

* limit test

* profiler

* clean up accelerator code

* move files

* rename

* move tests

* changelog

* reorder callbacks and model hooks

* add test description

* replace unneccessary method

* fix chlog

* adjust batch_to_device for DP Plugin

* update tests for dataloader idx

* unused imports

* hook change

* switch None

* clear memory

* change to None

* None

* None

* memory savings

* remove redundant todo

* hack

* cheat

* Revert "cheat"

This reverts commit a8433bd0b4.

* Revert "hack"

This reverts commit 43a6d1edeb.

* update new epoch loop

* remove from old loop code

* update chlog

* update hook test

* changelog

* teardown

* integrate changes in new eval loop

* fix hook calls

* add prediction step

* bad merge

* Revert "bad merge"

This reverts commit 488080863c.

* fix train batch hook test

* rm -rf _notebooks

* update chlog

* release memory

* fix type

* notebooks mess

* debug

* Revert "debug"

This reverts commit eec4ee2f77.

* teardown

* fix teardown bug

* debug

* x

* debug

* Revert "debug"

This reverts commit a6e6101946.

Revert "debug"

This reverts commit 5ddeaec069.

debug


debug


Revert "debug"

This reverts commit 605be746f7daedf265b2c05a1c153ce543394435.

Revert "Revert "debug""

This reverts commit a7612d5410409ed886cfb609457349ecf44cbfa8.

debug


x


x


x


s


tol


x


tol

* Fix changelog

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2021-07-05 09:31:39 +01:00
Carlos Mocholí 0e19d16ca6
Move result teardown to loops (#8245)
* Move result teardown to loops

* Update CHANGELOG

* Remove teardown from run

* Move previous teardown to on_run_end

* Add comment

* Merge 8250

* Remove stage set to None where it shouldnt
2021-07-02 14:36:14 +01:00
thomas chaton f3e74abad0
[feat] Add restore to base loop (#8247)
* add loop restart

* update
2021-07-02 13:40:31 +01:00
Adrian Wälchli e7139ab9f7
Support `DDPPlugin` to be used on CPU (#6208)
* Skip test due to 'Python bus error'

* Debug NCCL

* Remove NCCL_DEBUG statement

* Revert "Skip test due to 'Python bus error'"

This reverts commit e0a3e8785d.

* fix

* add test

* changelog

* yapf

* patch os environ

* make a special test

* destroy pg

* debug

* revert

* revert

* problematic test

* skip

* try the fixture

* test

* update sensitive test

* update changelog

* remove comment

* update wrong test

* update test name

* parameterization

* Revert "parameterization"

This reverts commit b0542f43f59c5ce66800883b5e2f0c66a97408cc.

* remove conftest

* ignore test

* teardown

* fix merge

* deep speed parameterization

* uncomment test

* update chlog

* update changelog

* split tests

* update test


update test


update test


update test

* update test comments

* unroll test

* unroll test

* unroll test

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* increase shm

* sudo

* unroll ipu

* Revert "sudo"

This reverts commit 6cc68c1478.

* Revert "increase shm"

This reverts commit 8c27163483.

* x

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* find guilty test

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* POPTORCH_WAIT_FOR_IPU=1

* move test

* redo parameterize for ipu

* de-comment test

* move chlog

* Update tests/accelerators/test_accelerator_connector.py

Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com>

* Update tests/accelerators/test_accelerator_connector.py

Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com>

Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com>
2021-07-02 12:00:24 +01:00
Adrian Wälchli af52de1198
update changelog after 1.3.8 patch release (#8239)
Co-authored-by: Jirka <jirka.borovec@seznam.cz>
2021-07-01 21:49:06 +00:00
Guillaume Tauzin baa7de2d9e
Fix truncated_bptt_steps hiddens detach() and improve docs (#8145)
* Fix truncated_bptt_steps hiddens detach()
* Improve truncated_bptt_docs
* Add missing import
* Improve documentation wordings
* pep8
* detach typo
* Update test
* Implement comments
* parametrize test
* Apply suggestions from code review

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

Signed-off-by: Guillaume Tauzin <guillaumetauzin.ut@gmail.com>

* Remove import

Signed-off-by: Guillaume Tauzin <guillaumetauzin.ut@gmail.com>

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2021-07-01 22:16:14 +01:00
ananthsub 8b0aec8565
Deprecate `LightningModule.loaded_optimizer_states_dict` (#8229) 2021-07-01 23:02:29 +02:00
thomas chaton d51b0ae7fc
Add `state_dict` to loops (#8197)
Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
2021-07-01 15:54:37 +00:00
Palermo 36b893c43e
Add `ModelSummary.max_depth` (#8062)
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2021-07-01 12:08:16 +02:00
Mauricio Villegas 3c74502919
Add support for optimizers and learning rate schedulers to LightningCLI (#8093)
Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2021-07-01 12:04:11 +02:00
thomas chaton acb6f26006
[Refactor] Remove should_raise_exception (#8202)
Co-authored-by: Ethan Harris <ethanwharris@gmail.com>
2021-06-30 17:02:10 +00:00
Ethan Harris 57dce7244c
Fix double precision casting complex buffers (#8208)
* Fix double precision casting complex buffers

* Update CHANGELOG.md

* Fixes

* Fixes

* Fix

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2021-06-30 10:57:42 +01:00
Carlos Mocholí 2e537b75e3
Deprecate `DDPPlugin.task_idx` (#8203) 2021-06-30 01:02:55 +02:00
thomas chaton bae08514d1
[refactor] Add should_raise_exception for gpus / tpus utilities (#8194)
* add should_raise

* update changelog

* Update pytorch_lightning/utilities/device_parser.py

Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com>

* add to tpu_cores parser

* add should_raise description

* update on comments

* update changelog

Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com>
2021-06-29 10:00:06 -04:00
Carlos Mocholí 571a810a7c
Improvements and changes to progress tracking dataclasses (#8140)
* Improvements to progress dataclasses

* Update CHANGELOG

* Rename function

* Undo CODEOWNERS update
2021-06-29 13:47:41 +01:00
Justus Schock d6435a5b73
Bugfix/swa iterable dset (#8172)
* add test

* add fix

* Update CHANGELOG.md
2021-06-28 21:18:25 +00:00
Ethan Harris b1d8840fd8
Fix metric attribute lookup (#8181)
* Fix metric attribute lookup

* Update CHANGELOG.md

* Split tests
2021-06-28 20:17:43 +00:00
Adrian Wälchli bf54ac1cad
fix NCCL error with non-consecutive trainer gpus (#8165)
* device ids in barrier


x


x


s


same fix for spawn


fix non-nccl 


x

* add changelog

* get nccl backend

* get backend

Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com>
2021-06-28 22:08:10 +02:00
Kaushik B 2f3c65e57b
XLA Profiler integration (#8014) 2021-06-29 00:58:05 +05:30
thomas chaton c521624a92
[bugfix] Add mechanism to prevent deadlock for DDP on Exception Trigger (#8167)
* add mechanism to prevent deadlock

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* resolve flake8 + update changelog

* update on comments

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update

* remove space

* resolve bugs

* overwrite config

* update on comments

* update on comments

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update

* update

* update test with comments

* Update pytorch_lightning/plugins/training_type/parallel.py

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* update on comments

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2021-06-28 19:26:03 +00:00
thomas chaton 1f025789fc
[bugfix] Clean Validation Sanity Checking metrics (#8171)
* resolve logging issue

* update changelog

* remove breakpoint

* resolve bugs

* remove pass
2021-06-28 13:49:56 -04:00
thomas chaton c4492ad6aa
Merge pull request #8174 from PyTorchLightning/bugfix/8159_log_gpu_memory_on_step
[bugfix] Resolve memory not logged when missing metrics
2021-06-28 09:39:17 -04:00
Ethan Harris 2a372e3682
Fix module dict in base finetuning (#8170)
* Fix module dict in base finetuning

* Update CHANGELOG.md
2021-06-28 10:55:32 +00:00
Adrian Wälchli 51ea84222b
resurface lost ddp info message (#8111) 2021-06-27 21:51:15 +02:00