Commit Graph

1394 Commits

Author SHA1 Message Date
ananthsub 851f9e3997
Move NaN/Inf detection to a separate utilities file (#6834)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2021-04-09 01:47:02 +02:00
Sean Naren 742c48e994
[Fix] Ensure we set the eval/train flag correctly on accelerator model (#6877)
* Ensure we move the model to eval mode before running evaluation

* Ensure we set the flag appropriately across all stages

* Add test, move hooks logic

* Apply same fix to the validate loop

* Update pytorch_lightning/trainer/trainer.py

* Fix function name

* Fix order, add predict

* Shorten the name

* Fix input dm, drop duplicate on predict start hook call, as it's called in the setup function

* Use hook, remove double call
2021-04-08 14:04:26 -04:00
Ethan Harris 1c2ecbf70c
TPUSpawn + IterableDataset error message (#6875)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2021-04-08 19:57:48 +05:30
scart97 eb15abcd82
Fix finetuning complex models correctly unfreezes. (#6880)
Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
2021-04-08 12:59:06 +05:30
Kaushik B 9fbe724b2b
Update Changelog for v1.2.7 (#6874)
* Update Changelog for v1.2.7

* legacy

Co-authored-by: Jirka Borovec <jirka.borovec@seznam.cz>
2021-04-07 22:58:41 +00:00
shuyingsunshine21 313e81638d
Supporting Adding DDP Communication Hooks (#6736)
* Fix some test errors
Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

* checkpoint consolidation

* Update ddp_spawn.py

* Update test_metric_result_integration.py

* Update test_results.py

* Update utils.py

* Update utils.py

* Update test_all_gather_grad.py

* Update test_all_gather_grad.py

* Update test_results.py

* Revert "Update test_results.py"

This reverts commit 9d4a2b891d.

* Revert "Merge pull request #1 from shuyingsunshine21/shuyingsunshine21-checkpoint_consolidate"

This reverts commit c5053da789, reversing
changes made to 0d23d75bc9.

* Revert "Update test_all_gather_grad.py"

This reverts commit 0d23d75bc9.

* Revert "Update utils.py"

This reverts commit 70fe5da9c6.

* Revert "Update utils.py"

This reverts commit a9aae99f6e.

* Revert "Update test_results.py"

This reverts commit ea74906878.

* Revert "Update test_metric_result_integration.py"

This reverts commit bf70e431b3.

* Revert "Update ddp_spawn.py"

This reverts commit f17210183b.

* Revert "checkpoint consolidation"

This reverts commit 536c1323b0.

* Revert "Revert "checkpoint consolidation""

This reverts commit 3a9fde915a.

* Revert "Revert "Revert "checkpoint consolidation"""

This reverts commit 7a369f47e1.

* Revert "Revert "Update ddp_spawn.py""

This reverts commit 8222dc98ea.

* Revert "Revert "Update test_metric_result_integration.py""

This reverts commit 6c095b2370.

* Revert "Revert "Update test_results.py""

This reverts commit 250d0aaaa2.

* Revert "Revert "Update utils.py""

This reverts commit 8651d54d79.

* Revert "Revert "Update test_all_gather_grad.py""

This reverts commit dcdcd29731.

* modify distributed environment to make test pass

* add DDP communication hook

* remove test related setting

* remove more test related setting

* fix ddp comm hook util import issue

* comments

* one more fix for test_custom_plugin

* fix ddp spwan

* fix sgd

* address comments and add tests

* 1. add is gpu checking 2. modify test a bit 3. formatting

* formatting nit

* fix conda 3.7 1.7 issue for no torch.distributed.algorithms module

* need at least 1.8.0

* minor fix

* modify changelog

* changelog should link to PR number instead of issue number

* refine a bit on doc for register_ddp_comm_hook function, like ddp_comm_wrapper explanation and add hyperparameter for power sgd states in example usge

* move single device checking before call register_ddp_comm_hook

* formatting

* comments

* typo

* pre-commit formatting
2021-04-07 12:35:57 +01:00
ananthsub 86e1d9f759
[fix] Better support for rank_zero_only setting for SLURM and torchelastic (#6802)
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2021-04-07 12:25:13 +01:00
Adrian Wälchli b7a22ba046
CI: fixture for global rank variable reset (#6839) 2021-04-06 09:37:17 -07:00
Anthony Kim 7f6154fcad
Add `Trainer(gradient_clip_algorithm='value'|'norm')` (#6123)
* add changelog

* add clip by value

* fix bug in training tricks.rst

* fix bug in trainer.rst

* Update trainer.rst

* Update trainer.rst

* Update CHANGELOG.md

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* Update pytorch_lightning/plugins/precision/deepspeed_precision.py

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* Update pytorch_lightning/utilities/enums.py

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* yapf formatting

* update training tricks

* update based on comment

* update based on comment

* Update pytorch_lightning/trainer/trainer.py

Co-authored-by: ananthsub <ananth.subramaniam@gmail.com>

* update based on comment

* pep8

* mypy

* mypy

* Update docs/source/advanced/training_tricks.rst

Co-authored-by: thomas chaton <thomas@grid.ai>

* Update sharded_native_amp.py

* Update test_sharded_parity.py

* update test codes

* Update test_tpu.py

* Update pytorch_lightning/trainer/connectors/training_trick_connector.py

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

* Update test_trainer.py

* Update enums.py

* Update enums.py

* add super-class initialization to precision plugins.

* add clip_grad horovod cpu test

* add clip_grad horovod cpu test

* use subprocess check_call

* change order of horovod tests

* set max_epochs 2 in horovod test

* remove clip_grad_val test from horovod-cpu

* remove "type: ignore"

* divide clip grad val test in horovod

* update based on comments

* add super-class initialization to precision plugins.

* bugfix

* bugfix

* revert some changes

* revert some changes

* Update tests/models/test_horovod.py

* merge master

* Delete signature test

No point in testing a signature

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: thomas chaton <thomas@grid.ai>
Co-authored-by: ananthsub <ananth.subramaniam@gmail.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: Jirka Borovec <jirka.borovec@seznam.cz>
2021-04-06 08:27:37 -05:00
Mauricio Villegas b7f3a3c421
Simple reproducibility with minimum boilerplate CLI training with `LightningCLI` (#4492)
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2021-04-06 14:19:11 +01:00
Adrian Wälchli 127c52af74
Fix EarlyStopping logic when min_epochs not met (#6705)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2021-04-06 12:41:07 +01:00
Ethan Harris 89b5326ca5
Fix support for symlink save_dir in TensorBoardLogger (#6730)
* Add test for symlink support and initial fix

* Respond to comment and add docstring

* Update CHANGELOG.md

* Simplify

* Update pytorch_lightning/utilities/cloud_io.py

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

* Make `LightningLocalFileSystem` protected

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2021-04-06 11:36:25 +02:00
Kaushik B cf8e828559
[Fix] TPU Training Type Plugin (#6816) 2021-04-06 15:02:44 +05:30
Michael Baumgartner 6dc1078822
Enforce an epoch scheduler interval when using SWA (#6588)
Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
2021-04-06 02:57:33 +00:00
Karthik Prasad c3da7f50bb
Sanitize `None` params during pruning (#6836)
* sanitize none params during pruning

* amend
2021-04-06 01:47:59 +02:00
Adrian Wälchli 264aa689de
fix boolean check on iterable dataset when len not defined (#6828)
* fix iterable dataset len check

* update predict and validate

* add validate to test

* add changelog

* add predict
2021-04-05 17:47:21 +01:00
Yuan-Hang Zhang 1bd5f36a5b
Fix validation progress counter with check_val_every_n_epoch > 1 (#5952)
Co-authored-by: rohitgr7 <rohitgr1998@gmail.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2021-04-02 17:40:41 +09:00
Kaushik B a72a7992a2
Update clip gradients signature for precision plugins (#6764) 2021-03-31 17:06:48 +05:30
Carlos Mocholí 495c385a54
Add 1.2.6 section to CHANGELOG (#6732)
* Add 1.2.6 sections to CHANGELOG

* Update CHANGELOG.md

* legacy

Co-authored-by: Jirka Borovec <jirka.borovec@seznam.cz>
2021-03-30 18:25:22 -07:00
Carlos Mocholí 0dd2deebea
Remove legacy support for the magic `log`/`progress_bar` keys in dict returns (#6734) 2021-03-31 00:28:04 +02:00
thomas chaton 1302766f83
DeepSpeed ZeRO Update (#6546)
* Add context to call hook to handle all modules defined within the hook

* Expose some additional parameters

* Added docs, exposed parameters

* Make sure we only configure if necessary

* Setup activation checkpointing regardless, saves the user having to do it manually

* Add some tests that fail currently

* update

* update

* update

* add tests

* change docstring

* resolve accumulate_grad_batches

* resolve flake8

* Update DeepSpeed to use latest version, add some comments

* add metrics

* update

* Small formatting fixes, clean up some code

* Few cleanups

* No need for default state

* Fix tests, add some boilerplate that should move eventually

* Add hook removal

* Add a context manager to handle hook

* Small naming cleanup

* wip

* move save_checkpoint responsability to accelerator

* resolve flake8

* add BC

* Change recommended scale to 16

* resolve flake8

* update test

* update install

* update

* update test

* update

* update

* update test

* resolve flake8

* update

* update

* update on comments

* Push

* pull

* Update pytorch_lightning/plugins/training_type/deepspeed.py

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* Update pytorch_lightning/plugins/training_type/deepspeed.py

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* update

* Apply suggestions from code review

* Swap to using world size defined by plugin

* update

* update todo

* Remove deepspeed from extra, keep it in the base cuda docker install

* Push

* pull

* update

* update

* update

* update

* Minor changes

* duplicate

* format

* format2

Co-authored-by: SeanNaren <sean@grid.ai>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: Sean Naren <sean.narenthiran@gmail.com>
Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Jirka Borovec <jirka.borovec@seznam.cz>
2021-03-30 13:39:02 -04:00
Jirka Borovec 583fcf281c
update chlog v1.2.5 (#6742)
* update chlog v1.2.5

* legacy
2021-03-30 12:45:07 +02:00
Carlos Mocholí 90444706b2
Remove logger_connector legacy code (#6733) 2021-03-30 12:33:33 +02:00
Kaushik B f79a13e495
[Model Parallel] Add configure sharded model hook (#6679)
* Add base hook for model parallel

* fix callback signature

* Simplify hook

* Add hook logic

* add tests

* add property setter

* add logic for being called once

* Update changelog

* Fix

* fix return type

* fix lambda callback test

* Fix tests

* Apply code suggestions

* add logic for setup_optimizers_predispatch

* add common dummy model

* Swap call order

* Remove test that isn't needed anymore

* Update tests

* Add a bit more doc

* Few code review fixes

* Update pytorch_lightning/accelerators/accelerator.py

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

* Change hook name

* Fix test

* Test setup hook, refactor names

* Swap call order of callbacks and model initialization

* Change name of context manager

Co-authored-by: SeanNaren <sean@grid.ai>
Co-authored-by: Sean Naren <sean.narenthiran@gmail.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2021-03-29 14:50:51 -06:00
thomas chaton 3a4c4246ee
[TPU] update is_tpu_exists utils internal logic to rely on xmp.spawn (#6719)
* update_logic

* update

* Update tests/utilities/test_xla_device_utils.py

* Update pytorch_lightning/utilities/xla_device.py

Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com>

* Update pytorch_lightning/utilities/xla_device.py

Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com>

* update test

* Update tests/utilities/test_xla_device_utils.py

* update

* Apply fix

* Docstring

* flake8

* update

Co-authored-by: Your Name <you@example.com>
Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com>
Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
2021-03-29 18:59:20 +01:00
Jirka Borovec 5b5a5cc80b
support python 3.9 (#4944)
* support python 3.9

* update CI

* onnxruntime

* .

* .

* onnxruntime

* t 55

* t 75

* add script

* use

* onnx

* onnx

* onnx

* whl

* np

* find

* 21

* Apply suggestions from code review

* Apply suggestions from code review

* onnx

* CI

* req

* ~ dockers

* min

* .

* drop horovod

* drop horovod

* drop horovod

* fix

* fix

* .
2021-03-29 12:20:13 -04:00
Łukasz Zalewski cca0eca5f3
More explicit exception message when testing with fast_dev_run=True (#6667)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2021-03-29 13:29:54 +00:00
Carlos Mocholí f0c5479de9
Remove legacy `Result` parameters (#6016) 2021-03-28 11:55:08 +02:00
thomas chaton 0e45220263
[warning] Add warning when values are not being reduced (#6417)
* add warning non reduced

* add test

* update test

* update changelog

* Update pytorch_lightning/trainer/connectors/logger_connector/epoch_result_store.py

Co-authored-by: Justus Schock <12886177+justusschock@users.noreply.github.com>

* update

Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com>
Co-authored-by: Justus Schock <12886177+justusschock@users.noreply.github.com>
2021-03-26 18:33:11 +00:00
Carlos Mocholí 21fc5eb21e
Automatically find and run special tests (#6669) 2021-03-26 17:04:59 +00:00
Carlos Mocholí bc613611e2
Do not add return dict items to callback_metrics (#6682) 2021-03-26 14:05:20 +01:00
Ethan Harris 6b990f3fa5
Add artifcact_location arg to MLFlow logger (#6677)
* Add artifcact_location arg to MLFlow logger

* Add CHANGELOG URL

* Update test
2021-03-26 00:12:03 +01:00
Jirka Borovec 217c12a4e7
Simplify deprecations (#6620)
* use external deprecate

* simplify

* simplify

* simplify

* flake8

* .

* others

* .
2021-03-25 15:26:38 +01:00
Rohit Gupta 9be092dbdb
Add on_epoch_start to run at the beginning of every loop irrespective of train/val/test (#6498)
* update docs

* add hook and update docs

* update tests

* chlog

* Update CHANGELOG.md

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* chlog

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2021-03-25 14:20:49 +01:00
ananthsub 40976e4eba
Support teardown hook on DataModule (#4673)
Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
Co-authored-by: chaton <thomas@grid.ai>
2021-03-25 07:51:55 -05:00
Kaushik B 2cbdc01256
Fix checkpoint callback & Trainer.test(_) issue for TPUs (#6654)
* Fix checkpoint callback issue for TPUs

* update changelog

* add barrier

* apply code suggestions

* update trainer test

* remove spaces

* fix tpu tests

* Apply suggestions from code review

* add comment

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2021-03-25 10:37:37 +00:00
Shengyao Zhuang b8ef52baa1
Match the number of outputs of backward with forward for AllGatherGrad (#6625) 2021-03-25 15:07:58 +05:30
Carlos Mocholí 2dd6f9e09d
`MetricsHolder` clean-up + typing (#6645)
* Metrics holder cleanup and better error message

* Update pytorch_lightning/trainer/connectors/logger_connector/logger_connector.py

* _VALUE -> _METRIC_TYPE
2021-03-24 20:34:46 +01:00
Ethan Harris d02fe342c1
Feature/double precision (#6595)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>
Co-authored-by: Justus Schock <12886177+justusschock@users.noreply.github.com>
2021-03-24 15:47:58 +05:30
Jirka Borovec 70beddfc13
Prune metrics: others 11/DoNe (#6659)
* classif

* grad_img

* nlp

* ssl

* format
2021-03-24 09:16:28 +01:00
Ethan Harris 741c452551
Fix disabled grads after call to predict (#6657) 2021-03-23 23:07:48 +01:00
Jirka Borovec 64d0fa4472
update coverage config (#6524)
* update coverage config

* parallel

* parallel

* Apply suggestions from code review

* Apply suggestions from code review

* paralel

* paralel

* paralel

* combine

* combine

* .

* ..

* ..

* ..

* rev

* cb

* cb

* drop

* drop

* .

* ..

* ...

* ...

* ...

* .
2021-03-23 23:05:04 +01:00
thomas chaton fd5cb7fcc3
Add PyTorch 1.8 Profiler 5/5 (#6618)
* Refactor profilers

* Update PassThrough

* WIP - This is broken and will change

* Update pytorch_lightning/profiler/pytorch.py

Co-authored-by: thomas chaton <thomas@grid.ai>

* resolve tests

* resolve tests

* find output

* try something

* update

* add support for test and predict

* update

* update

* use getattr

* test

* test

* update

* tests

* update

* update

* update

* update

* update

* remove file

* update

* update

* update

* update

* update

* test

* update#

* update

* update tests

* update

* add suport for 1.8

* rename records

* add support for 1.8

* update

* resolve flake8

* resolve test

* Refactor basic profilers

* Fixes

* Unused import

* Introduce setup

* Profile on all ranks. Print to stdout on 0

* Introduce dirpath + filename

* CHANGELOG

* Add tests. Address comments

* add `on_run_stage_setup`

* add on_run_stage_setup function

* update

* add test for RegisterRecordFunction

* update lightnng flow direction

* move variable to private

* remove trace

* Undo code that should be in 3/4

* Multi-stage multi-rank

* 2/5 changes

* Pass stage in __del__

* Remove TODOs

* Describe on_evaluation_end. Add tests

* Typo

* Address comments

* deepcopy tests

* Advanced teardown

* Fix teardown test

* Fix tests

* Minor change

* Update CHANGELOG.md

* Fix test

* Quick fixes

* Fix 6522

* resolve ddp tests

* resolve tests

* resolve some tests

* update tests

* resolve tests

* update

* resolve tests

* resolve some tests

* Missed fixes from 3/5

* Fixes

* resolve some tests

* resolve test for 1.7.1

* Broken refactor

* Missed stage

* Minor changes

* resolve tests

* Update CHANGELOG

* resolve bug

* remove print

* Typo

* Cleanup

* resolve ddp test

* remove barrier

* update profiler

* update

* Smaller model

* update

* resolve tests

* update

* Minor changes. CHANGELOG

* Minimize diff

* update to 1.8.1

* RunIf. Extra code. Check segfault

* resolve tests

* Typo. Bad merge

* Fixing a bad merge

* replace for kineto

* Update pytorch_lightning/profiler/pytorch.py

Co-authored-by: ananthsub <ananth.subramaniam@gmail.com>

* Update pytorch_lightning/profiler/pytorch.py

Co-authored-by: ananthsub <ananth.subramaniam@gmail.com>

* Minor changes

* Bad merge

* Use lists for flexibility

* Use sets

* predict_step

* Ananth's suggestion

* update

* Docs

* Update pl_examples/basic_examples/profiler_example.py

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* update example

* update example

Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
Co-authored-by: ananthsub <ananth.subramaniam@gmail.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2021-03-23 20:43:21 +00:00
Carlos Mocholí 51b10f78f4
Refactor PyTorch profiler 4/5 (#6349)
Co-authored-by: thomas chaton <thomas@grid.ai>
2021-03-23 18:13:29 +01:00
thomas chaton 0995d30fab
Flash predict step (#6577)
* add predict_step

* Update predict_loop.py

* Update trainer.py

* Update trainer.py

* resolve bugs

* update

* update

* update

* resolve bug

* resolve some failing tests

* udpate tests

* update

* resolve tests

* add a test

* remove typo

* add a test for attachement

* update

* changed to on_train_dataloader

* remove __flash_special_attr__

* resolve tests

* update

* update

* update

* update on comments

* Update pytorch_lightning/trainer/data_loading.py

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

Co-authored-by: Justus Schock <12886177+justusschock@users.noreply.github.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2021-03-23 11:13:13 -04:00
Jirka Borovec a74909affa
prune metrics: info retrieval (#6649) 2021-03-23 15:05:32 +00:00
Carlos Mocholí 36d180e532
Refactor base profilers 3/5 (#6621)
Co-authored-by: tchaton <thomas@grid.ai>
2021-03-23 10:07:35 +00:00
Jirka Borovec f93414d085
Prune metyrics: regression 9/n (#6637)
* psnr

* r2score

* ssim

* chlog
2021-03-23 10:01:25 +00:00
Jirka Borovec efce2b7777
Prune metrics: regression 8/n (#6636)
* explained_variance

* tests

* mean_absolute_error

* mean_squared_error

* mean_relative_error

* mean_squared_log_error

* chlog
2021-03-23 09:35:51 +01:00
thomas chaton 2064ece582
[refactor] Add setup to profilers + _run_stage_setup to trainer 2/5 (#6633)
* add setup

* update

* updates on comment

* Minor changes

* Extra import

* Docs

Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
2021-03-22 14:32:31 -04:00