Commit Graph

81 Commits

Author SHA1 Message Date
Sean Naren bac8b1be81
Add support for CPU AMP autocast (#9084) 2021-08-25 12:18:00 +00:00
Sean Naren e9f4bffe0a
Add validate logic for precision (#9080) 2021-08-24 20:00:09 +00:00
ananthsub 1e4d8929fb
Simplify checkpoint connector loading after Checkpoint IO plugin's introduction (#9045)
* Simplify checkpoint connector loading after Checkpoint IO plugins introduction

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2021-08-23 13:12:18 -07:00
Carlos Mocholí 93ab24d1ee
Replace DataLoader sampler once for IPUs (#8858) 2021-08-16 11:28:05 +02:00
Sean Naren b2973a035e
Introduce CheckpointIO Plugin (#8743) 2021-08-13 17:35:31 +01:00
Carlos Mocholí ed13040729
Connect the model to the training type plugin at the start of run (#8536) 2021-08-04 17:43:34 +02:00
samlurye f90849cc95
Deprecate LightningModule.summarize() in favor of pl.utilities.model_summary.summarize() (#8513)
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: ananthsub <ananth.subramaniam@gmail.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2021-08-03 22:08:51 +00:00
Kaushik B d01d8334b5
Fix `ddp` accelerator choice for cpu (#8645)
* Fix ddp accelerator choice for cpu
2021-08-02 21:24:07 +00:00
Kaushik B 850416f0a0
Fix distributed types support for CPUs (#8667) 2021-08-02 16:42:28 +05:30
Sean Naren 07b7dc9c17
[Fix] Add delay property for checkpointing, refactor loading checkpoint (DeepSpeed Checkpointing Fix 1/n) (#8627)
* Add property to delay checkpointing, move loading checkpoint file into the run function to allow deepspeed engine to be loaded

* Add a small test

* Apply suggestions from code review

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* Update pytorch_lightning/accelerators/accelerator.py

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* Address review

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2021-07-30 11:31:08 +01:00
Kaushik B f457f97d97
Raise exception for ddp_cpu not supported for TPUs (#8530) 2021-07-26 13:52:34 +00:00
Carlos Mocholí a64cc37394
Replace `yapf` with `black` (#7783)
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2021-07-26 13:37:35 +02:00
Kaushik B 5452590872
fix: Enable manual optimization for TPUs (#8458) 2021-07-22 15:33:35 +05:30
Kaushik B 556879e5cf
Add support for devices flag to Trainer (#8440)
* Support devices flag to Trainer
2021-07-20 04:33:12 +00:00
Kaushik B 825c5dbe8c
Add support for (accelerator='cpu'|'gpu'|'tpu'|'ipu'|'auto') (#7808)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: Ethan Harris <ewah1g13@soton.ac.uk>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: SeanNaren <sean@grid.ai>
2021-07-09 15:28:54 +00:00
Sean Naren 6d558961e3
[IPU] Allow poptorch.Options to override Trainer (#8233)
* Add test for poptorch Options

* Hacks to get manual plugin support

* Revert changes

* Fix tests + ensure logic follow suit

* Update pytorch_lightning/plugins/training_type/ipu.py

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Cleaner

* Cleaner

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2021-07-05 13:42:00 +00:00
Sean Naren 07b1ce227c
[IPU] Fix Custom Poptorch options to IPUPlugin (#8241)
* Fixes to ensure ipu options are respected

* Better setter

* Add test for poptorch Options

* Fix test

* fix ipu test

* Update pytorch_lightning/plugins/training_type/ipu.py

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2021-07-02 11:23:57 +00:00
Adrian Wälchli e7139ab9f7
Support `DDPPlugin` to be used on CPU (#6208)
* Skip test due to 'Python bus error'

* Debug NCCL

* Remove NCCL_DEBUG statement

* Revert "Skip test due to 'Python bus error'"

This reverts commit e0a3e8785d.

* fix

* add test

* changelog

* yapf

* patch os environ

* make a special test

* destroy pg

* debug

* revert

* revert

* problematic test

* skip

* try the fixture

* test

* update sensitive test

* update changelog

* remove comment

* update wrong test

* update test name

* parameterization

* Revert "parameterization"

This reverts commit b0542f43f59c5ce66800883b5e2f0c66a97408cc.

* remove conftest

* ignore test

* teardown

* fix merge

* deep speed parameterization

* uncomment test

* update chlog

* update changelog

* split tests

* update test


update test


update test


update test

* update test comments

* unroll test

* unroll test

* unroll test

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* increase shm

* sudo

* unroll ipu

* Revert "sudo"

This reverts commit 6cc68c1478.

* Revert "increase shm"

This reverts commit 8c27163483.

* x

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* find guilty test

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* POPTORCH_WAIT_FOR_IPU=1

* move test

* redo parameterize for ipu

* de-comment test

* move chlog

* Update tests/accelerators/test_accelerator_connector.py

Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com>

* Update tests/accelerators/test_accelerator_connector.py

Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com>

Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com>
2021-07-02 12:00:24 +01:00
nisheethlahoti 06f8349291
Support calling fit and test scripts using "python -m" module syntax with DDP (#8073)
Co-authored-by: Nisheeth Lahoti <nisheeth@rephrase.ai>
2021-06-23 02:42:04 +00:00
Andrew Tritt e808f9fb28
Use DistributedSampler when running with custom accelerator (#7814)
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2021-06-18 14:34:05 +02:00
Sean Naren 024cf23c67
Remove convert_to_half, suggest using `model.half` (#7974) 2021-06-14 18:48:02 +01:00
Sean Naren 96433d03ea
IPU Integration 5/5 (#7867)
* Initial changes

* Add broken example for now

* Fix reference

* Fix format

* Code runs

* Fixes

* Clear up files

* Add tests, helpers, fixes

* Small cleanups

* Refactors based on review

* Swap to special tests

* Add special tests

* Add source

* Cleanups

* Add logic to attach/detach model from devices

* Fixes for tests

* Fixes for tests

* Move earlier

* Cleanups

* Add check for nvcc

* Add tests, cleanups

* Fix errors

* fix

* Try condition

* Add missing annotation

* Clearer

* Clearer message

* Fix variable

* Cleanups

* Add comment

* CHANGELOG.md

* Add simple selection test

* Remove special=True to see what happens

* Fix test

* Update tests/accelerators/test_ipu.py

Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com>

* Convert ipu_cores -> ipus

* Add typing, fail earlier

* simplify precision

* Add test, add helper

* fix accum

* Update pytorch_lightning/plugins/training_type/ipu.py

Co-authored-by: thomas chaton <thomas@grid.ai>

* Use stages

* Make sure warning message returned

* thorw error

* Add more tests, use fs

* add comment

* Clean

* Address feedback, add IPU tests

* Fixes

* Fix signature

* Add types

* Remove autoround

* Add docstring

* ipu_cores -> ipus

* Add test, remove unnecessary precision set

* Add optimizer test

* Add precision back with test

* Address code review

* Change to probs

* Move some of the asserts earlier

Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com>
Co-authored-by: thomas chaton <thomas@grid.ai>
2021-06-11 15:07:04 +00:00
Sean Naren 6388c29e87
[IPU] Add reset dataloader hooks to training type plugin 3/n (#7861)
* Add hooks

* Add tests for hooks

* Add changelog

* Test changes, add typing
2021-06-07 10:37:09 +00:00
Ethan Harris 03bb389b21
Fix double precision + ddp_spawn (#6924)
* Initial fix

* Initial fix

* Initial fix

* Updates

* Updates

* Update typing and docs

* Undo accidental refactor

* Remove unused imports

* Add DDP double precision test

* Remove unused variable

* Update CHANGELOG.md

* Fix test

* Update tests

* Formatting

* Revert bad change

* Add back changes

* Correct wrapping order

* Improve unwrapping

* Correct wrapping order

* Fix... finally

* Respond to comments

* Drop ddp test

* Simplify ddp spawn test

* Simplify ddp spawn test

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2021-06-01 15:21:17 +00:00
Carlos Mocholí d47173bb72
Use typing forward references (#7770)
* Use typing forward references

* Update pytorch_lightning/core/lightning.py
2021-05-31 09:54:28 +02:00
Carlos Mocholí a69beab499
Clean existing logging tests (#7760)
* Remove dev debugger metric tracking

* Fix tests

* Fix test

* Import

* Clean logging tests

* flake8

* Docstring
2021-05-30 16:36:52 +02:00
Carlos Mocholí bc3238be8c
Remove metric tracking from dev debugger (#7759)
* Remove dev debugger metric tracking

* Fix tests

* Fix test

* Import

* Fix tests

* Fix test

* flake8

* Fix tests
2021-05-30 12:03:42 +02:00
Kaushik B 3f460b150a
Move parameter validation specific to TPU Training plugins (#7415)
* Move parameter validation specific to TPU Training plugins

* update docstring
2021-05-24 16:02:01 +00:00
Adrian Wälchli 502adbced3
refactor optimizer loop logic for manual and automatic optimization (#7526)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: ananthsub <ananth.subramaniam@gmail.com>
Co-authored-by: Justus Schock <12886177+justusschock@users.noreply.github.com>
2021-05-17 14:42:01 +02:00
Kaushik B bf46730d92
Support TPU Pod Training (n/n) (#7296) 2021-05-17 11:33:44 +00:00
Nic Eggert f4f51e0dcf
Add kubeflow cluster environment (#7300)
* Add kubeflow cluster environment

* Add KubeflowEnvironment to docs

* Add KubeflowEnvironment to the changelog

* break up a long line

* Add method to detect kubeflow environment

* Select Kubeflow environment when available

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Run pre-commit

* task_idx == 0

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2021-05-17 09:05:24 +01:00
Alan Du 6ac16ff348
Fix DistribType for `ddp_cpu` (spawn) (#7492) 2021-05-14 20:53:26 +01:00
Jirka Borovec d4ec75164c
Prune deprecated trainer attributes (#7501)
* use_single_gpu

* use_horovod

* use_ddp2

* use_ddp

* use_dp

* on_gpu

* use_tpu

* on_tpu

* on_cpu

* cleaning

* chlog

* Apply suggestions from code review

* Apply suggestions from code review
2021-05-12 20:10:15 +00:00
Ethan Harris 45143fd825
Improve val step logging (#7351)
* Fix val step logging

* Add a type

* Fix

* Update CHANGELOG.md
2021-05-07 22:58:03 +00:00
Martin Kristiansen c3fc0313ef
Updating docs and error message: half precision not available on CPU (#7384)
* Updating docs and error message to specify that half precission not available on CPU

* update messages

Co-authored-by: Martin Kristiansen <martinkristiansen@sixgill.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: jirka <jirka.borovec@seznam.cz>
2021-05-06 09:05:50 +00:00
Carlos Mocholí 8c0ea92af2
`TrainerState` refactor [5/5] (#7173)
* `TrainerState` refactor

* flake8

* Update finished check

* Test cleanup

* Fix tests

* Fixes

* Reorder

* flake8

* Update CHANGELOG

* Better docs

* Better docs

* Remove default

* Update tests

* Bad merge
2021-05-04 12:50:56 +02:00
Carlos Mocholí 40f80230fe
Remove `trainer.fit` return value [2/n] (#7237)
* `_fit_impl` refactor and types

* Fix return

* Remove return docstring

* Fixes

* Fixes

* Remove `trainer.fit` return value

* Update CHANGELOG

* flake8

* Undo results change

* Fix test

* Revert changes for a separate PR

* flake8
2021-04-28 19:11:32 +01:00
thomas chaton 5a113a2f05
[bug/feat] Support parameters_to_ignore in DDP (#7239)
* update

* update

* update

* update on comments

* update
2021-04-27 17:49:32 +00:00
Kaushik B f168a535ca
Add MpModelWrapper in TPU Spawn (#7045)
Co-authored-by: tchaton <thomas@grid.ai>
Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
2021-04-20 13:05:27 +00:00
thomas chaton 3cc0b2c063
[test] Add checks for gpus=1 (#7105)
* update

* remove cluster env
2021-04-19 20:39:28 +02:00
Adrian Wälchli e9fca760ac
Set `DistributedSampler` seed if `seed_everything` was called (#7024)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2021-04-19 14:50:31 +01:00
Adrian Wälchli 33cc9fe138
Clean up environment access in plugins (#6941)
Co-authored-by: ananthsub <ananth.subramaniam@gmail.com>
Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com>
2021-04-13 20:07:40 +02:00
Kaushik B f79a13e495
[Model Parallel] Add configure sharded model hook (#6679)
* Add base hook for model parallel

* fix callback signature

* Simplify hook

* Add hook logic

* add tests

* add property setter

* add logic for being called once

* Update changelog

* Fix

* fix return type

* fix lambda callback test

* Fix tests

* Apply code suggestions

* add logic for setup_optimizers_predispatch

* add common dummy model

* Swap call order

* Remove test that isn't needed anymore

* Update tests

* Add a bit more doc

* Few code review fixes

* Update pytorch_lightning/accelerators/accelerator.py

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

* Change hook name

* Fix test

* Test setup hook, refactor names

* Swap call order of callbacks and model initialization

* Change name of context manager

Co-authored-by: SeanNaren <sean@grid.ai>
Co-authored-by: Sean Naren <sean.narenthiran@gmail.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2021-03-29 14:50:51 -06:00
Carlos Mocholí 21fc5eb21e
Automatically find and run special tests (#6669) 2021-03-26 17:04:59 +00:00
thomas chaton fd5cb7fcc3
Add PyTorch 1.8 Profiler 5/5 (#6618)
* Refactor profilers

* Update PassThrough

* WIP - This is broken and will change

* Update pytorch_lightning/profiler/pytorch.py

Co-authored-by: thomas chaton <thomas@grid.ai>

* resolve tests

* resolve tests

* find output

* try something

* update

* add support for test and predict

* update

* update

* use getattr

* test

* test

* update

* tests

* update

* update

* update

* update

* update

* remove file

* update

* update

* update

* update

* update

* test

* update#

* update

* update tests

* update

* add suport for 1.8

* rename records

* add support for 1.8

* update

* resolve flake8

* resolve test

* Refactor basic profilers

* Fixes

* Unused import

* Introduce setup

* Profile on all ranks. Print to stdout on 0

* Introduce dirpath + filename

* CHANGELOG

* Add tests. Address comments

* add `on_run_stage_setup`

* add on_run_stage_setup function

* update

* add test for RegisterRecordFunction

* update lightnng flow direction

* move variable to private

* remove trace

* Undo code that should be in 3/4

* Multi-stage multi-rank

* 2/5 changes

* Pass stage in __del__

* Remove TODOs

* Describe on_evaluation_end. Add tests

* Typo

* Address comments

* deepcopy tests

* Advanced teardown

* Fix teardown test

* Fix tests

* Minor change

* Update CHANGELOG.md

* Fix test

* Quick fixes

* Fix 6522

* resolve ddp tests

* resolve tests

* resolve some tests

* update tests

* resolve tests

* update

* resolve tests

* resolve some tests

* Missed fixes from 3/5

* Fixes

* resolve some tests

* resolve test for 1.7.1

* Broken refactor

* Missed stage

* Minor changes

* resolve tests

* Update CHANGELOG

* resolve bug

* remove print

* Typo

* Cleanup

* resolve ddp test

* remove barrier

* update profiler

* update

* Smaller model

* update

* resolve tests

* update

* Minor changes. CHANGELOG

* Minimize diff

* update to 1.8.1

* RunIf. Extra code. Check segfault

* resolve tests

* Typo. Bad merge

* Fixing a bad merge

* replace for kineto

* Update pytorch_lightning/profiler/pytorch.py

Co-authored-by: ananthsub <ananth.subramaniam@gmail.com>

* Update pytorch_lightning/profiler/pytorch.py

Co-authored-by: ananthsub <ananth.subramaniam@gmail.com>

* Minor changes

* Bad merge

* Use lists for flexibility

* Use sets

* predict_step

* Ananth's suggestion

* update

* Docs

* Update pl_examples/basic_examples/profiler_example.py

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* update example

* update example

Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
Co-authored-by: ananthsub <ananth.subramaniam@gmail.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2021-03-23 20:43:21 +00:00
Jirka Borovec efce2b7777
Prune metrics: regression 8/n (#6636)
* explained_variance

* tests

* mean_absolute_error

* mean_squared_error

* mean_relative_error

* mean_squared_log_error

* chlog
2021-03-23 09:35:51 +01:00
Sean Naren 58c9fa7edb
Allow training type plugin to delay optimizer creation (FSDP 2/n) (#6331)
* Allow training_type_plugin to delay optimizer configure

* Add missing references to trainer, add a CPU accelerator based test
2021-03-22 11:43:53 +00:00
Sean Naren 4e9b453854
[Fix] Move init dist connection into the setup function (#6506)
* Move connection setup into the setup function. Call setup hook after we set up the accelerator

* Added CHANGELOG.md

* fix setup order in callback test

* fix input arguments in test

* Mock distributed function, remove protection to turn into training type hook

* Remove import

* Add missing mock, ensure custom plugin does not create children process

* Skip test on windows

* Update deepspeed to init connection in setup

* Do not initialize distributed module

* Move DeepSpeed tests to special tests since dist communication is being set up

* Special the test to see if this fixes CI

* Delete accelerator connector test to see if its causing build to fail

* Delete deepspeed test

* Revert "Delete accelerator connector test to see if its causing build to fail"

This reverts commit edde60b8

* Revert "Delete deepspeed test"

This reverts commit 9d317429

* Reverse hook

* Reverse setup hooks to debug again

* Add todo so i know where i left off

* For single device move in pre_dispatch after setup function

* Add additional model to device hook if any additional parameters have been set

* See if we can enable deepspeed tests

* Revert "See if we can enable deepspeed tests"

This reverts commit b5450def

* See if this hook approach works

* Introduce new granular hooks

* Remove import, fix tpu spawn by moving the function to setup

* Added missing special test

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2021-03-18 14:33:39 -07:00
Jirka Borovec b341b53f70
deprecate metrics pkg (#6505)
* deprecate metrics

* examples

* req

* docs

* Apply suggestions from code review

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: Nicki Skafte <skaftenicki@gmail.com>

* pep8

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: Nicki Skafte <skaftenicki@gmail.com>
2021-03-15 14:39:38 +00:00
Rohit Gupta c53edce1a1
Disable batch transfer in DP mode (#6098)
* add exceptions and test

* hook

* fix

* clean up

* clean up

* regex

* regex

* docs

* rev

* comment and docs

* chlog

* Apply suggestions from code review

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

* Apply suggestions from code review

Co-authored-by: chaton <thomas@grid.ai>

* Monkey-patch device count

* docs

* pep

* api_change

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: chaton <thomas@grid.ai>
2021-03-11 10:51:10 -05:00