The speed up is achieved by:
- Moving the "where" out of the loop (and replacing with min for simplicity).
- Replacing manual sum and pow with torch.norm. Even though this results
in unnessecary computation (computing pow(root)) this is still a lot
faster.
- Preallocating the output gives a slight speed up.
Note that calling .to for all parameters results in a small speed
penalty (~4 ms in my case) but allows parameters on different devices.
Overall this reduces the time used for gradient clipping from 206ms to
74 ms for my model (Resnet50 + few additional vars, all vars on GPU).
* reactor into gpu accelerator
* reactor into gpu accelerator
* reactor into gpu accelerator
* reactor into gpu accelerator
* reactor into gpu accelerator
* reactor into gpu accelerator
* reactor into gpu accelerator
* reactor into gpu accelerator
* reactor into gpu accelerator
* reactor into gpu accelerator
* reactor into gpu accelerator
* reactor into gpu accelerator
* add state_dict for early stopping
* move best attr after monitor_op defined
* improve early stopping and model checkpoint callbacks
* fix formatting
* fix attr init order
* clean up setting of default_root_dir attr
* logger needs default root dir set first
* reorg trainer init
* remove direct references to checkpoint callback
* more fixes
* more bugfixes
* run callbacks at epoch end
* update tests to use on epoch end
* PR cleanup
* address failing tests
* refactor for homogeneity
* fix merge conflict
* separate tests
* tests for early stopping bug regressions
* small fixes
* revert model checkpoint change
* typo fix
* fix tests
* update train loop
* cannot pass an int as default_save_path
* refactor log message
* fix test case
* appease the linter
* fix some doctests
* move config to callback
* fixes from rebase
* fixes from rebase
* chlog
* docs
* reformat
* formatting
* fix
* fix
* fixes from rebase
* add new test for patience
* Update pytorch_lightning/callbacks/model_checkpoint.py
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
* Update pytorch_lightning/callbacks/model_checkpoint.py
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
* Update tests/callbacks/test_early_stopping.py
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
* fix formatting
* remove enable_early_stop attribute
* add state_dict for early stopping
* move best attr after monitor_op defined
* improve early stopping and model checkpoint callbacks
* fix formatting
* fix attr init order
* clean up setting of default_root_dir attr
* logger needs default root dir set first
* reorg trainer init
* remove direct references to checkpoint callback
* more fixes
* more bugfixes
* run callbacks at epoch end
* update tests to use on epoch end
* PR cleanup
* address failing tests
* refactor for homogeneity
* fix merge conflict
* separate tests
* tests for early stopping bug regressions
* small fixes
* revert model checkpoint change
* typo fix
* fix tests
* update train loop
* fix test case
* appease the linter
* fix some doctests
* move config to callback
* fixes from rebase
* fixes from rebase
* chlog
* docs
* reformat
* formatting
* fix
* fix
* fixes from rebase
* add new test for patience
* Update pytorch_lightning/callbacks/model_checkpoint.py
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
* Update pytorch_lightning/callbacks/model_checkpoint.py
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
* Update tests/callbacks/test_early_stopping.py
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
* fix formatting
* remove enable_early_stop attribute
* fix test with new epoch indexing
* fix progress bar totals
* fix off by one error (see #2289) epoch starts at 0 now
* added missing imports
* fix hpc_save folderpath
* fix formatting
* fix tests
* small fixes from a rebase
* fix
* tmpdir
* tmpdir
* tmpdir
* wandb
* fix merge conflict
* add back evaluation after training
* test_resume_early_stopping_from_checkpoint TODO
* undo the horovod check
* update changelog
* remove a duplicate test from merge error
* try fix dp_resume test
* add the logger fix from master
* try remove default_root_dir
* try mocking numpy
* try import numpy in docs test
* fix wandb test
* pep 8 fix
* skip if no amp
* dont mock when doctesting
* install extra
* fix the resume ES test
* undo conf.py changes
* revert remove comet pickle from test
* Update CHANGELOG.md
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
* Update weights_loading.rst
* Update weights_loading.rst
* Update weights_loading.rst
* renamed flag
* renamed flag
* revert the None check in logger experiment name/version
* add the old comments
* _experiment
* test chckpointing on DDP
* skip the ddp test on windows
* cloudpickle
* renamed flag
* renamed flag
* parentheses for clarity
* apply suggestion max epochs
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Jeremy Jordan <jtjordan@ncsu.edu>
Co-authored-by: Jirka <jirka@pytorchlightning.ai>
Co-authored-by: Jeremy Jordan <13970565+jeremyjordan@users.noreply.github.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: William Falcon <waf2107@columbia.edu>
* drop train_percent_check
* drop train_percent_check
* drop train_percent_check
* drop train_percent_check
* drop train_percent_check
* drop train_percent_check
* drop train_percent_check
* drop train_percent_check
* drop train_percent_check
* drop train_percent_check
* drop train_percent_check
* drop train_percent_check
* drop train_percent_check
* drop train_percent_check
* drop train_percent_check
* drop train_percent_check
* drop train_percent_check
* chlog
* deprecated
* deprecated
* deprecated
* tests
* tests
* Apply suggestions from code review
* tests
* hydra support
* tests
* hydra support
* hydra support
* hydra support
* tests
* typo
* typo
* Update test_dataloaders.py
* docs
* docs
* docs
* docs
Co-authored-by: Jirka <jirka@pytorchlightning.ai>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
* Fix pyright member access errors in training module
* Fix Trainer instantiation error due to inheritence order
* Add GH workflow for pyright
* Fix more pyright errors in trainer module
* Add pyrightconfig and setup python environment in type-check workflow
* Exclude pyrightconfig.json
* suggestions
Co-authored-by: Jirka <jirka@pytorchlightning.ai>
* check for nan values
* test nan detection on loss
* sys.exit
* whitespace
* detect nan and inf values in loss and params
* update
* added documentation
* moved detect nan to training loop, remove flag for print
* blank line
* test
* rename
* deprecate print_nan_grads
* deprecated print_nan_grads
* remove unused imports
* update changelog
* fix line too long
* correct deprecated version
Co-Authored-By: Jirka Borovec <Borda@users.noreply.github.com>
* raise exception instead of sysexit
Co-Authored-By: Jirka Borovec <Borda@users.noreply.github.com>
* raise exception instead of sysexit
Co-Authored-By: Jirka Borovec <Borda@users.noreply.github.com>
* Update pytorch_lightning/trainer/training_tricks.py
Co-Authored-By: Jirka Borovec <Borda@users.noreply.github.com>
* Update pytorch_lightning/trainer/training_tricks.py
Co-Authored-By: Jirka Borovec <Borda@users.noreply.github.com>
* fix test
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
* Add callback system + associated test
* Add trainer and pl_module args to callback methods
* typing
* typo in docstring
* Switch to on_.*_start()
* fix on_test_start
* fix the mess after rebasing