Commit Graph

4672 Commits

Author SHA1 Message Date
ananthsub 851f9e3997
Move NaN/Inf detection to a separate utilities file (#6834)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2021-04-09 01:47:02 +02:00
Luis Perez 90e37ba458
fix exception raising (#6901) 2021-04-08 22:23:26 +02:00
Sean Naren 742c48e994
[Fix] Ensure we set the eval/train flag correctly on accelerator model (#6877)
* Ensure we move the model to eval mode before running evaluation

* Ensure we set the flag appropriately across all stages

* Add test, move hooks logic

* Apply same fix to the validate loop

* Update pytorch_lightning/trainer/trainer.py

* Fix function name

* Fix order, add predict

* Shorten the name

* Fix input dm, drop duplicate on predict start hook call, as it's called in the setup function

* Use hook, remove double call
2021-04-08 14:04:26 -04:00
Jirka Borovec 851fd7fae7
Merge pull request #6885 from PyTorchLightning/v1.3.0rc
prepare v1.3.0rc
2021-04-08 14:01:26 -04:00
Ethan Harris 1c2ecbf70c
TPUSpawn + IterableDataset error message (#6875)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2021-04-08 19:57:48 +05:30
Ethan Harris 87f0aeac25
Fix DDP_SPAWN compatibility with bug_report_model.py (#6892) 2021-04-08 19:57:18 +05:30
Oleg 3007872d01
Update mlflow with using resolve_tags (#6746)
* Update mlflow.py

#6745 adds additional info about the run, as in the native API

* Update mlflow.py

trying to fix some backward compatibility issues with `resolve_tags`

* wip on backward compatibility

added a default for `getattr` in case the `registry` object exists, but has no proper attribute (weird case but who knows...)

* fix pep

* impoert

* fix registry import

* try fix failing tests

removed the first if statement, so that `resolve_tags` would be defined either case

* fix formatting

Co-authored-by: Jirka Borovec <jirka.borovec@seznam.cz>
2021-04-08 10:45:23 +01:00
scart97 eb15abcd82
Fix finetuning complex models correctly unfreezes. (#6880)
Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
2021-04-08 12:59:06 +05:30
ananthsub 968ac091c0
Remove hardcoding of rank_zero_only.rank in accelerator connector (#6878) 2021-04-08 12:56:59 +05:30
Carlos Mocholí 128f6ab508
Add separators to performance docs (#6882) 2021-04-08 08:22:50 +01:00
sk 01b9cf8fdc
Fix csv extension check (#6436)
Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
2021-04-08 01:16:31 +00:00
Kaushik B 9fbe724b2b
Update Changelog for v1.2.7 (#6874)
* Update Changelog for v1.2.7

* legacy

Co-authored-by: Jirka Borovec <jirka.borovec@seznam.cz>
2021-04-07 22:58:41 +00:00
Carlos Mocholí 19e67d18c4
Docs fixes (#6870) 2021-04-07 16:57:22 +01:00
shuyingsunshine21 313e81638d
Supporting Adding DDP Communication Hooks (#6736)
* Fix some test errors
Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

* checkpoint consolidation

* Update ddp_spawn.py

* Update test_metric_result_integration.py

* Update test_results.py

* Update utils.py

* Update utils.py

* Update test_all_gather_grad.py

* Update test_all_gather_grad.py

* Update test_results.py

* Revert "Update test_results.py"

This reverts commit 9d4a2b891d.

* Revert "Merge pull request #1 from shuyingsunshine21/shuyingsunshine21-checkpoint_consolidate"

This reverts commit c5053da789, reversing
changes made to 0d23d75bc9.

* Revert "Update test_all_gather_grad.py"

This reverts commit 0d23d75bc9.

* Revert "Update utils.py"

This reverts commit 70fe5da9c6.

* Revert "Update utils.py"

This reverts commit a9aae99f6e.

* Revert "Update test_results.py"

This reverts commit ea74906878.

* Revert "Update test_metric_result_integration.py"

This reverts commit bf70e431b3.

* Revert "Update ddp_spawn.py"

This reverts commit f17210183b.

* Revert "checkpoint consolidation"

This reverts commit 536c1323b0.

* Revert "Revert "checkpoint consolidation""

This reverts commit 3a9fde915a.

* Revert "Revert "Revert "checkpoint consolidation"""

This reverts commit 7a369f47e1.

* Revert "Revert "Update ddp_spawn.py""

This reverts commit 8222dc98ea.

* Revert "Revert "Update test_metric_result_integration.py""

This reverts commit 6c095b2370.

* Revert "Revert "Update test_results.py""

This reverts commit 250d0aaaa2.

* Revert "Revert "Update utils.py""

This reverts commit 8651d54d79.

* Revert "Revert "Update test_all_gather_grad.py""

This reverts commit dcdcd29731.

* modify distributed environment to make test pass

* add DDP communication hook

* remove test related setting

* remove more test related setting

* fix ddp comm hook util import issue

* comments

* one more fix for test_custom_plugin

* fix ddp spwan

* fix sgd

* address comments and add tests

* 1. add is gpu checking 2. modify test a bit 3. formatting

* formatting nit

* fix conda 3.7 1.7 issue for no torch.distributed.algorithms module

* need at least 1.8.0

* minor fix

* modify changelog

* changelog should link to PR number instead of issue number

* refine a bit on doc for register_ddp_comm_hook function, like ddp_comm_wrapper explanation and add hyperparameter for power sgd states in example usge

* move single device checking before call register_ddp_comm_hook

* formatting

* comments

* typo

* pre-commit formatting
2021-04-07 12:35:57 +01:00
ananthsub 86e1d9f759
[fix] Better support for rank_zero_only setting for SLURM and torchelastic (#6802)
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2021-04-07 12:25:13 +01:00
Roger Shieh a2c605785a
Update seed_everything() (#6843)
* Update seed.py

* Update pytorch_lightning/utilities/seed.py

Co-authored-by: thomas chaton <thomas@grid.ai>

* Update seed.py

* Update seed.py

* Update seed.py

Co-authored-by: thomas chaton <thomas@grid.ai>
2021-04-07 13:17:48 +02:00
Adrian Wälchli b7a22ba046
CI: fixture for global rank variable reset (#6839) 2021-04-06 09:37:17 -07:00
Kaushik B a17c027ea1
Update sync_dist warning for multiple processes (#6790) 2021-04-06 16:57:43 +02:00
Anthony Kim 7f6154fcad
Add `Trainer(gradient_clip_algorithm='value'|'norm')` (#6123)
* add changelog

* add clip by value

* fix bug in training tricks.rst

* fix bug in trainer.rst

* Update trainer.rst

* Update trainer.rst

* Update CHANGELOG.md

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* Update pytorch_lightning/plugins/precision/deepspeed_precision.py

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* Update pytorch_lightning/utilities/enums.py

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* yapf formatting

* update training tricks

* update based on comment

* update based on comment

* Update pytorch_lightning/trainer/trainer.py

Co-authored-by: ananthsub <ananth.subramaniam@gmail.com>

* update based on comment

* pep8

* mypy

* mypy

* Update docs/source/advanced/training_tricks.rst

Co-authored-by: thomas chaton <thomas@grid.ai>

* Update sharded_native_amp.py

* Update test_sharded_parity.py

* update test codes

* Update test_tpu.py

* Update pytorch_lightning/trainer/connectors/training_trick_connector.py

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

* Update test_trainer.py

* Update enums.py

* Update enums.py

* add super-class initialization to precision plugins.

* add clip_grad horovod cpu test

* add clip_grad horovod cpu test

* use subprocess check_call

* change order of horovod tests

* set max_epochs 2 in horovod test

* remove clip_grad_val test from horovod-cpu

* remove "type: ignore"

* divide clip grad val test in horovod

* update based on comments

* add super-class initialization to precision plugins.

* bugfix

* bugfix

* revert some changes

* revert some changes

* Update tests/models/test_horovod.py

* merge master

* Delete signature test

No point in testing a signature

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: thomas chaton <thomas@grid.ai>
Co-authored-by: ananthsub <ananth.subramaniam@gmail.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: Jirka Borovec <jirka.borovec@seznam.cz>
2021-04-06 08:27:37 -05:00
Mauricio Villegas b7f3a3c421
Simple reproducibility with minimum boilerplate CLI training with `LightningCLI` (#4492)
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2021-04-06 14:19:11 +01:00
Adrian Wälchli 127c52af74
Fix EarlyStopping logic when min_epochs not met (#6705)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2021-04-06 12:41:07 +01:00
Tharindu Hasthika f581411210
Fixed missing arguments in `lr_find` call (#6784)
There seem to be 3 arguments missing in the `lr_find` call in the tunining.py file.
2021-04-06 11:37:15 +02:00
Ethan Harris 89b5326ca5
Fix support for symlink save_dir in TensorBoardLogger (#6730)
* Add test for symlink support and initial fix

* Respond to comment and add docstring

* Update CHANGELOG.md

* Simplify

* Update pytorch_lightning/utilities/cloud_io.py

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

* Make `LightningLocalFileSystem` protected

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2021-04-06 11:36:25 +02:00
Kaushik B cf8e828559
[Fix] TPU Training Type Plugin (#6816) 2021-04-06 15:02:44 +05:30
Eugene Khvedchenya eafec7d425
Fix DPP + SyncBN (#6838)
* Fix DPP + SyncBN 

Ensure that model is already on correct GPU before applying SyncBN conversion

* Fix order of SyncBN for ddp_spawn
2021-04-06 08:40:29 +01:00
Michael Baumgartner 6dc1078822
Enforce an epoch scheduler interval when using SWA (#6588)
Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
2021-04-06 02:57:33 +00:00
Sadiq Jaffer 7f91c5ebbc
Fix `unfreeze_and_add_param_group` expects `modules` rather than `module` (#6822) 2021-04-06 01:50:42 +02:00
Karthik Prasad c3da7f50bb
Sanitize `None` params during pruning (#6836)
* sanitize none params during pruning

* amend
2021-04-06 01:47:59 +02:00
Adrian Wälchli 264aa689de
fix boolean check on iterable dataset when len not defined (#6828)
* fix iterable dataset len check

* update predict and validate

* add validate to test

* add changelog

* add predict
2021-04-05 17:47:21 +01:00
Kaushik B 22a266d8b8
Update TPU docs for installation (#6794) 2021-04-04 00:19:43 +05:30
ananthsub bb9ace4333
[typing] Add typehint for broadcast in training type plugin (#6777)
* Update training_type_plugin.py

* Update accelerator.py

* Update pytorch_lightning/plugins/training_type/training_type_plugin.py

Co-authored-by: Akihiro Nitta <nitta@akihironitta.com>

Co-authored-by: Akihiro Nitta <nitta@akihironitta.com>
2021-04-02 20:55:34 +02:00
Elizaveta Logacheva f8a379830d
Remove extinct parameters from lightning_module.rst (#6801)
Fixes  #6800
2021-04-02 20:49:20 +02:00
Yuan-Hang Zhang 1bd5f36a5b
Fix validation progress counter with check_val_every_n_epoch > 1 (#5952)
Co-authored-by: rohitgr7 <rohitgr1998@gmail.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2021-04-02 17:40:41 +09:00
Jirka Borovec 0b843848b6
less IDE complain about unused args (#6786)
* less IDE complain about unused args

* ...
2021-04-01 18:19:00 +02:00
thomas chaton 3e3175d074
resolve bug (#6781) 2021-04-01 11:43:23 +01:00
Kaushik B 13f67ad313
Update logic for checking TPUs availability (#6767)
* Update logic for checking TPUs availability

* fix flake8

* add fix
2021-04-01 03:04:33 +05:30
Kaushik B a72a7992a2
Update clip gradients signature for precision plugins (#6764) 2021-03-31 17:06:48 +05:30
Carlos Mocholí 495c385a54
Add 1.2.6 section to CHANGELOG (#6732)
* Add 1.2.6 sections to CHANGELOG

* Update CHANGELOG.md

* legacy

Co-authored-by: Jirka Borovec <jirka.borovec@seznam.cz>
2021-03-30 18:25:22 -07:00
Carlos Mocholí 0dd2deebea
Remove legacy support for the magic `log`/`progress_bar` keys in dict returns (#6734) 2021-03-31 00:28:04 +02:00
Sean Naren f9bb7c641a
DeepSpeed ZeRO Docs update (#6752)
* Added base docs

* Add more information

* Apply suggestions from code review

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2021-03-30 21:52:02 +00:00
thomas chaton 1302766f83
DeepSpeed ZeRO Update (#6546)
* Add context to call hook to handle all modules defined within the hook

* Expose some additional parameters

* Added docs, exposed parameters

* Make sure we only configure if necessary

* Setup activation checkpointing regardless, saves the user having to do it manually

* Add some tests that fail currently

* update

* update

* update

* add tests

* change docstring

* resolve accumulate_grad_batches

* resolve flake8

* Update DeepSpeed to use latest version, add some comments

* add metrics

* update

* Small formatting fixes, clean up some code

* Few cleanups

* No need for default state

* Fix tests, add some boilerplate that should move eventually

* Add hook removal

* Add a context manager to handle hook

* Small naming cleanup

* wip

* move save_checkpoint responsability to accelerator

* resolve flake8

* add BC

* Change recommended scale to 16

* resolve flake8

* update test

* update install

* update

* update test

* update

* update

* update test

* resolve flake8

* update

* update

* update on comments

* Push

* pull

* Update pytorch_lightning/plugins/training_type/deepspeed.py

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* Update pytorch_lightning/plugins/training_type/deepspeed.py

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* update

* Apply suggestions from code review

* Swap to using world size defined by plugin

* update

* update todo

* Remove deepspeed from extra, keep it in the base cuda docker install

* Push

* pull

* update

* update

* update

* update

* Minor changes

* duplicate

* format

* format2

Co-authored-by: SeanNaren <sean@grid.ai>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: Sean Naren <sean.narenthiran@gmail.com>
Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Jirka Borovec <jirka.borovec@seznam.cz>
2021-03-30 13:39:02 -04:00
Akihiro Nitta 9876df16a2
[docs] Update Bolts link (#6743)
* Update Bolts link

* Update Bolts link

* formt

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2021-03-30 22:52:59 +05:30
thomas chaton bb92754119
[bugfix] Add support for omegaconf and tpu (#6741)
* fix_hydra

* update changelog

Co-authored-by: Your Name <you@example.com>
2021-03-30 16:21:25 +01:00
Jirka Borovec 583fcf281c
update chlog v1.2.5 (#6742)
* update chlog v1.2.5

* legacy
2021-03-30 12:45:07 +02:00
Carlos Mocholí 90444706b2
Remove logger_connector legacy code (#6733) 2021-03-30 12:33:33 +02:00
Jirka Borovec 3c86193de0
update readme by v1.2.x (#6728) 2021-03-29 18:06:24 -04:00
Kaushik B f79a13e495
[Model Parallel] Add configure sharded model hook (#6679)
* Add base hook for model parallel

* fix callback signature

* Simplify hook

* Add hook logic

* add tests

* add property setter

* add logic for being called once

* Update changelog

* Fix

* fix return type

* fix lambda callback test

* Fix tests

* Apply code suggestions

* add logic for setup_optimizers_predispatch

* add common dummy model

* Swap call order

* Remove test that isn't needed anymore

* Update tests

* Add a bit more doc

* Few code review fixes

* Update pytorch_lightning/accelerators/accelerator.py

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

* Change hook name

* Fix test

* Test setup hook, refactor names

* Swap call order of callbacks and model initialization

* Change name of context manager

Co-authored-by: SeanNaren <sean@grid.ai>
Co-authored-by: Sean Naren <sean.narenthiran@gmail.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2021-03-29 14:50:51 -06:00
thomas chaton 646cf2f7d4
[refactor] Move save_function to accelerator 1/n [DeepSpeed] (#6689)
* move save_checkpoint responsability to accelerator

* update
2021-03-29 21:02:37 +02:00
thomas chaton 3a4c4246ee
[TPU] update is_tpu_exists utils internal logic to rely on xmp.spawn (#6719)
* update_logic

* update

* Update tests/utilities/test_xla_device_utils.py

* Update pytorch_lightning/utilities/xla_device.py

Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com>

* Update pytorch_lightning/utilities/xla_device.py

Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com>

* update test

* Update tests/utilities/test_xla_device_utils.py

* update

* Apply fix

* Docstring

* flake8

* update

Co-authored-by: Your Name <you@example.com>
Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com>
Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
2021-03-29 18:59:20 +01:00
Jirka Borovec 5b5a5cc80b
support python 3.9 (#4944)
* support python 3.9

* update CI

* onnxruntime

* .

* .

* onnxruntime

* t 55

* t 75

* add script

* use

* onnx

* onnx

* onnx

* whl

* np

* find

* 21

* Apply suggestions from code review

* Apply suggestions from code review

* onnx

* CI

* req

* ~ dockers

* min

* .

* drop horovod

* drop horovod

* drop horovod

* fix

* fix

* .
2021-03-29 12:20:13 -04:00