Commit Graph

18 Commits

Author SHA1 Message Date
shuyingsunshine21 2242423b75
refactor accelerator teardown -> training type plugin teardown (#7579) 2021-05-22 13:19:24 -07:00
Adrian Wälchli a1a655d006
Reduce log output size in special tests (#7481) 2021-05-11 17:36:20 +02:00
Leonard Lausen 98b94b810c
Fix DeepSpeedPlugin with IterableDataset (#7362)
* deepspeed add train_micro_batch_size_per_gpu argument

* Update naming and doc

* Modify to use auto naming convention, add test

* Add iterable tests

* Fix tests, attempt by mocking

* Import correct package

* Fix comparison

* Set as special test

* Remove import

* Add Changelog

Co-authored-by: SeanNaren <sean@grid.ai>
2021-05-07 10:46:03 +01:00
Adrian Wälchli e9fca760ac
Set `DistributedSampler` seed if `seed_everything` was called (#7024)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2021-04-19 14:50:31 +01:00
Sean Naren b46cc557ef
[Feat] DeepSpeed single file saving (#6900)
* Add single checkpoint capability

* Fix checkpointing in test, few cleanups

* Add comment

* Change restore logic

* Move vars around, add better explanation, make todo align with DeepSpeed team

* Fix checkpointing

* Remove deepspeed from extra, install in Dockerfile

* push

* pull

* Split to two tests to see if it fixes Deepspeed error

* Add comment
2021-04-12 22:44:09 +00:00
thomas chaton 1302766f83
DeepSpeed ZeRO Update (#6546)
* Add context to call hook to handle all modules defined within the hook

* Expose some additional parameters

* Added docs, exposed parameters

* Make sure we only configure if necessary

* Setup activation checkpointing regardless, saves the user having to do it manually

* Add some tests that fail currently

* update

* update

* update

* add tests

* change docstring

* resolve accumulate_grad_batches

* resolve flake8

* Update DeepSpeed to use latest version, add some comments

* add metrics

* update

* Small formatting fixes, clean up some code

* Few cleanups

* No need for default state

* Fix tests, add some boilerplate that should move eventually

* Add hook removal

* Add a context manager to handle hook

* Small naming cleanup

* wip

* move save_checkpoint responsability to accelerator

* resolve flake8

* add BC

* Change recommended scale to 16

* resolve flake8

* update test

* update install

* update

* update test

* update

* update

* update test

* resolve flake8

* update

* update

* update on comments

* Push

* pull

* Update pytorch_lightning/plugins/training_type/deepspeed.py

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* Update pytorch_lightning/plugins/training_type/deepspeed.py

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* update

* Apply suggestions from code review

* Swap to using world size defined by plugin

* update

* update todo

* Remove deepspeed from extra, keep it in the base cuda docker install

* Push

* pull

* update

* update

* update

* update

* Minor changes

* duplicate

* format

* format2

Co-authored-by: SeanNaren <sean@grid.ai>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: Sean Naren <sean.narenthiran@gmail.com>
Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Jirka Borovec <jirka.borovec@seznam.cz>
2021-03-30 13:39:02 -04:00
Sean Naren 4e9b453854
[Fix] Move init dist connection into the setup function (#6506)
* Move connection setup into the setup function. Call setup hook after we set up the accelerator

* Added CHANGELOG.md

* fix setup order in callback test

* fix input arguments in test

* Mock distributed function, remove protection to turn into training type hook

* Remove import

* Add missing mock, ensure custom plugin does not create children process

* Skip test on windows

* Update deepspeed to init connection in setup

* Do not initialize distributed module

* Move DeepSpeed tests to special tests since dist communication is being set up

* Special the test to see if this fixes CI

* Delete accelerator connector test to see if its causing build to fail

* Delete deepspeed test

* Revert "Delete accelerator connector test to see if its causing build to fail"

This reverts commit edde60b8

* Revert "Delete deepspeed test"

This reverts commit 9d317429

* Reverse hook

* Reverse setup hooks to debug again

* Add todo so i know where i left off

* For single device move in pre_dispatch after setup function

* Add additional model to device hook if any additional parameters have been set

* See if we can enable deepspeed tests

* Revert "See if we can enable deepspeed tests"

This reverts commit b5450def

* See if this hook approach works

* Introduce new granular hooks

* Remove import, fix tpu spawn by moving the function to setup

* Added missing special test

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2021-03-18 14:33:39 -07:00
Jirka Borovec b9cf1223b9
missing tests default_root_dir=tmpdir (#6314)
* default_root_dir=tmpdir

* miss
2021-03-04 19:23:12 +00:00
Kaushik B 4157b35062
Add fairscale & deepspeed to skipif 4/n (#6281)
* add fairscale & windows to skipif

* add deepspeed to runif

* fairscale

* deepspeed

* flake8

Co-authored-by: Jirka Borovec <jirka.borovec@seznam.cz>
2021-03-02 19:45:13 +00:00
Jirka Borovec d1a03153f3
Refactor: runif for spec 6/6 (#6307)
* special

* rpc
2021-03-02 18:57:13 +00:00
Jirka Borovec b46d22197d
Refactor: skipif for AMPs 3/n (#6293)
* args

* native

* apex

* isort
2021-03-02 18:13:53 +05:30
Jirka Borovec 0f9134e043
Refactor: skipif for Windows 2/n (#6268)
* win

* isort

* flake8
2021-03-02 09:36:01 +00:00
Jirka Borovec eb815000f6
Refactor: skipif for multi - gpus 1/n (#6266)
* ngpus

* gpu

* isort

* pt

* flake8
2021-03-02 09:03:32 +01:00
Carlos Mocholí 97b4b3ee68
Collapse 2 DeepSpeed tests (#6108) 2021-02-21 21:15:37 +00:00
Sean Naren 432e5637d6
Expose DeepSpeed FP16 parameters due to loss instability (#6115)
* Expose deepspeed config parameters to init function due to instability in parameters

* See if tests can run on normal CI, without special tests

* Add changelog

* Update pytorch_lightning/plugins/training_type/deepspeed.py

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2021-02-21 21:43:11 +01:00
Sean Naren 3b0e4e0b2b
Enable ZeRO tests for CI, fix to/half function calls (#6070)
* Enable ZeRO optimization, and make sure that the lightning module hook is called when we move to half precision

* Added test, update to function
2021-02-21 00:24:44 +00:00
Adrian Wälchli 6cc1a06078
rename accelerator_backend -> accelerator (#6034)
* rename accelerator backend

* rename new additions from master

* add proper deprecation

* pep8

* warning match

* add missing warning type
2021-02-18 15:54:12 +00:00
Sean Naren 7189d673f6
DeepSpeed Integration (#5954)
* Add initial deepspeed changes

* Address code review

* Move static method outside of function

* Fixes

* Add missing annotation

* Remove seed setting

* Doc changes

* Doc changes, add address reviews

* Fix docs

* Try fixing issue by moving to torch adam

* Clean up check

* Changes, better APIs!

* Add wrapper, swap to git install revision

* Add special test

* Add warning

* Address review

* Add better disclaimer

* Turn off ZeRO for testing due to compilation

* Add description on modifying parameters via the plugin

* Doc strings clear

* Small doc fixes

* Fix hash, reduce test

* Added CI change

* Move to azure pipeline

* Fix test name

* Add missing flag

* Remove sudo...

* Try conda instead

* Swap to conda base

* Try suggested install

* Apply suggestions from code review

* Apply suggestions from code review

* Revert "Apply suggestions from code review"

This reverts commit 41cca05a

* Revert "Apply suggestions from code review"

This reverts commit e06ec29e

* Remove setter

* Address most review

* Move out function, remove DeepSpeed from requirements

* Install deepspeed/mpi4py within container

* Use special tests, move to master commit for deepspeed

* Export path

* Force compile to happen first

* Remove!

* Debugging ninja

* Fix error in optimizer step logic

* Attempt to fix symbolic link

* Reverse to aid debugging

* Export path again

* Clean up mess

* var

* Revert "var"

This reverts commit 3450eaca

* Address review, add todo

* Add note about unsupported functionality

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: tchaton <thomas@grid.ai>
Co-authored-by: Jirka Borovec <jirka.borovec@seznam.cz>
2021-02-17 15:23:42 -05:00