* Ensure we move the model to eval mode before running evaluation
* Ensure we set the flag appropriately across all stages
* Add test, move hooks logic
* Apply same fix to the validate loop
* Update pytorch_lightning/trainer/trainer.py
* Fix function name
* Fix order, add predict
* Shorten the name
* Fix input dm, drop duplicate on predict start hook call, as it's called in the setup function
* Use hook, remove double call
* Fix some test errors
Summary:
Test Plan:
Reviewers:
Subscribers:
Tasks:
Tags:
* checkpoint consolidation
* Update ddp_spawn.py
* Update test_metric_result_integration.py
* Update test_results.py
* Update utils.py
* Update utils.py
* Update test_all_gather_grad.py
* Update test_all_gather_grad.py
* Update test_results.py
* Revert "Update test_results.py"
This reverts commit 9d4a2b891d.
* Revert "Merge pull request #1 from shuyingsunshine21/shuyingsunshine21-checkpoint_consolidate"
This reverts commit c5053da789, reversing
changes made to 0d23d75bc9.
* Revert "Update test_all_gather_grad.py"
This reverts commit 0d23d75bc9.
* Revert "Update utils.py"
This reverts commit 70fe5da9c6.
* Revert "Update utils.py"
This reverts commit a9aae99f6e.
* Revert "Update test_results.py"
This reverts commit ea74906878.
* Revert "Update test_metric_result_integration.py"
This reverts commit bf70e431b3.
* Revert "Update ddp_spawn.py"
This reverts commit f17210183b.
* Revert "checkpoint consolidation"
This reverts commit 536c1323b0.
* Revert "Revert "checkpoint consolidation""
This reverts commit 3a9fde915a.
* Revert "Revert "Revert "checkpoint consolidation"""
This reverts commit 7a369f47e1.
* Revert "Revert "Update ddp_spawn.py""
This reverts commit 8222dc98ea.
* Revert "Revert "Update test_metric_result_integration.py""
This reverts commit 6c095b2370.
* Revert "Revert "Update test_results.py""
This reverts commit 250d0aaaa2.
* Revert "Revert "Update utils.py""
This reverts commit 8651d54d79.
* Revert "Revert "Update test_all_gather_grad.py""
This reverts commit dcdcd29731.
* modify distributed environment to make test pass
* add DDP communication hook
* remove test related setting
* remove more test related setting
* fix ddp comm hook util import issue
* comments
* one more fix for test_custom_plugin
* fix ddp spwan
* fix sgd
* address comments and add tests
* 1. add is gpu checking 2. modify test a bit 3. formatting
* formatting nit
* fix conda 3.7 1.7 issue for no torch.distributed.algorithms module
* need at least 1.8.0
* minor fix
* modify changelog
* changelog should link to PR number instead of issue number
* refine a bit on doc for register_ddp_comm_hook function, like ddp_comm_wrapper explanation and add hyperparameter for power sgd states in example usge
* move single device checking before call register_ddp_comm_hook
* formatting
* comments
* typo
* pre-commit formatting
* Add test for symlink support and initial fix
* Respond to comment and add docstring
* Update CHANGELOG.md
* Simplify
* Update pytorch_lightning/utilities/cloud_io.py
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
* Make `LightningLocalFileSystem` protected
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
* Add context to call hook to handle all modules defined within the hook
* Expose some additional parameters
* Added docs, exposed parameters
* Make sure we only configure if necessary
* Setup activation checkpointing regardless, saves the user having to do it manually
* Add some tests that fail currently
* update
* update
* update
* add tests
* change docstring
* resolve accumulate_grad_batches
* resolve flake8
* Update DeepSpeed to use latest version, add some comments
* add metrics
* update
* Small formatting fixes, clean up some code
* Few cleanups
* No need for default state
* Fix tests, add some boilerplate that should move eventually
* Add hook removal
* Add a context manager to handle hook
* Small naming cleanup
* wip
* move save_checkpoint responsability to accelerator
* resolve flake8
* add BC
* Change recommended scale to 16
* resolve flake8
* update test
* update install
* update
* update test
* update
* update
* update test
* resolve flake8
* update
* update
* update on comments
* Push
* pull
* Update pytorch_lightning/plugins/training_type/deepspeed.py
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
* Update pytorch_lightning/plugins/training_type/deepspeed.py
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
* update
* Apply suggestions from code review
* Swap to using world size defined by plugin
* update
* update todo
* Remove deepspeed from extra, keep it in the base cuda docker install
* Push
* pull
* update
* update
* update
* update
* Minor changes
* duplicate
* format
* format2
Co-authored-by: SeanNaren <sean@grid.ai>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: Sean Naren <sean.narenthiran@gmail.com>
Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Jirka Borovec <jirka.borovec@seznam.cz>
* Add base hook for model parallel
* fix callback signature
* Simplify hook
* Add hook logic
* add tests
* add property setter
* add logic for being called once
* Update changelog
* Fix
* fix return type
* fix lambda callback test
* Fix tests
* Apply code suggestions
* add logic for setup_optimizers_predispatch
* add common dummy model
* Swap call order
* Remove test that isn't needed anymore
* Update tests
* Add a bit more doc
* Few code review fixes
* Update pytorch_lightning/accelerators/accelerator.py
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
* Change hook name
* Fix test
* Test setup hook, refactor names
* Swap call order of callbacks and model initialization
* Change name of context manager
Co-authored-by: SeanNaren <sean@grid.ai>
Co-authored-by: Sean Naren <sean.narenthiran@gmail.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>