Commit Graph

4112 Commits

Author SHA1 Message Date
Boris Dayma 1e36cffbca
feat(wandb): support distributed modes (#11650)
Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>
2022-02-09 19:53:21 +01:00
Carlos Mocholí 8394770d4a
Move data fetcher ownership to the loops (#11621) 2022-02-09 20:04:24 +05:30
Biho-Kim 24de29974c
bug fix #10872 (#10965)
Co-authored-by: louie.kim <louie.kim@kakaocorp.comlouie.kim@kakaocorp.com>
Co-authored-by: Jirka <jirka.borovec@seznam.cz>
2022-02-09 14:15:49 +00:00
Carlos Mocholí 8822117200
Return the output of the optimizer step (#11711)
Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>
2022-02-09 09:37:13 +00:00
Danielle Pintz 9e63281a4c
remove todos (#11804) 2022-02-09 08:30:27 +00:00
ananthsub 9d4de3a863
Faster callback configuration validator checks (#11785)
Co-authored-by: rohitgr7 <rohitgr1998@gmail.com>
2022-02-09 08:24:14 +00:00
Rohit Gupta 182c18d319
Configure native deepspeed schedulers with interval='step' (#11788) 2022-02-09 08:20:50 +00:00
jjenniferdai 1203094a20
Introduce `Stateful` DataModule (#11637) 2022-02-07 21:13:24 +01:00
circlecrystal 43a89eb132
bug fix: restore_optimizers correctly handles non-mapping values in optimizer.state.values() (#11757)
Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
2022-02-07 14:55:06 +00:00
Rohit Gupta 9ed44dee0d
Fix to avoid moving batch to device for DataParallel (#11780)
Co-authored-by: ananthsub <ananth.subramaniam@gmail.com>
2022-02-07 14:26:18 +00:00
Rohit Gupta 581bf7f2f2
Deprecate `on_epoch_start/on_epoch_end` hook (#11578) 2022-02-07 14:15:27 +00:00
ananthsub bbf27ed09a
Use fsspec in checkpoint connector for fault-tolerant training (#11776) 2022-02-07 13:29:41 +01:00
ananthsub 0ba25d3cac
Update DDPStrategy to use optimizers property from within class (#11777) 2022-02-07 13:28:37 +01:00
Rohit Gupta 7ec1e66e17
reduce only loss with dp (#11594)
Co-authored-by: Aki Nitta <nitta@akihironitta.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2022-02-07 17:00:29 +05:30
Krishna Kalyan f509e40ae3
Deprecate `on_before_accelerator_backend_setup` callback hook (#11655)
Co-authored-by: rohitgr7 <rohitgr1998@gmail.com>
2022-02-07 11:07:21 +00:00
ananthsub a64438c897
Centralize rank_zero_only utilities into their own module (#11747)
* Centralize rank_zero_only utilities into their own module

Fixes #11746

* PossibleUserWarning

* Update test_warnings.py

* update imports

* more imports

* Update CHANGELOG.md

* Update mlflow.py

* Update cli.py

* Update api_references.rst

* Update meta.py

* add deprecation tests

* debug standalone

* fix standalone tests

* Update CHANGELOG.md
2022-02-07 08:09:55 +00:00
Danielle Pintz 34c454c756
Small improvements to TB and CSV loggers (#11764)
* small improvements to TB and CSV loggers
* addr comments
* remove redundant lines and update tests

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Aki Nitta <nitta@akihironitta.com>
2022-02-07 14:59:39 +09:00
ananthsub 7900aabe62
Keep `is_global_zero` definitions in sync across strategy and trainer (#11761) 2022-02-07 01:33:32 +05:30
ananthsub dfda970572
Update TPU Spawn to use root_device instead of LightningModule's device (#11750) 2022-02-06 06:26:38 +00:00
Dan Dale 9d8faecdb2
Allow Horovod `teardown()` to complete gracefully if exception thrown in callback setup (#11752) 2022-02-05 11:13:21 -08:00
ananthsub 819a747031
Use `root_device` in XLAStatsMonitor callback (#11749) 2022-02-05 10:09:08 -08:00
ananthsub 7d9454a3e9
Use `root_device` in DeviceStatsMonitor callback (#11748)
* Use trainer.strategy.root_device in favor of LightningModule.device in DeviceStatsMonitor

Minor refactor to use the strategy's own `root_device` instead of the LightningModule's device property.

Attempts at manual model parallelization by extending this plugin will face difficulties with the assumption that the LightningModule has all of its parameters on the same device. 

For those use cases, it is critical to remove the assumption that the module has a device property (device in general goes against PyTorch module's design principles:
- https://github.com/pytorch/pytorch/issues/7460
- https://github.com/PyTorchLightning/pytorch-lightning/pull/1790#discussion_r423459412
2022-02-05 11:20:15 +01:00
ananthsub 241c97e6eb
Update HorovodStrategy to use optimizers property from within class (#11728) 2022-02-05 10:04:55 +01:00
Adrian Wälchli cc43d07db1
Remove legacy dead code in DDP script launch (#11678)
Co-authored-by: Jirka <jirka.borovec@seznam.cz>
Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
2022-02-05 11:40:16 +05:30
Dan Dale 3bc2407239
Allow access to ckpt_path within context of fit() (#11696)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2022-02-05 05:23:16 +01:00
Carlos Mocholí 7da931d1ca
Support no pre-fetching (#11606) 2022-02-05 03:59:46 +00:00
Danielle Pintz c71a1d7ea2
Remove `self._log_dir` from `BaseProfiler` (#11740) 2022-02-05 04:45:48 +01:00
ananthsub 72db64d294
Use the strategy's `root_device` instead of the LightningModule's device property (#11734) 2022-02-05 04:33:25 +01:00
Andres Algaba 58324b5197
Improve the result printing at the end of evaluation (#11332)
Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
Co-authored-by: Akihiro Nitta <nitta@akihironitta.com>
2022-02-05 03:03:22 +01:00
NathanGodey 8a1b1eeef8
WandbLogger's log_image can use step argument (#11716)
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2022-02-05 01:02:41 +00:00
wangraying 8c07d8bf90
Add `Trainer(strategy="bagua")` (#11146)
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: Justus Schock <12886177+justusschock@users.noreply.github.com>
Co-authored-by: Sean Naren <sean@grid.ai>
Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: ananthsub <ananth.subramaniam@gmail.com>
Co-authored-by: thomas chaton <thomas@grid.ai>
2022-02-04 17:02:09 +00:00
ananthsub 2eca957b29
Minor refactors to `init_dist_connection` (#11733) 2022-02-04 13:33:49 +01:00
Rohit Gupta 4d72110b51
Deprecate `on_batch_start/on_batch_end` callback hooks (#11577) 2022-02-03 19:51:56 +00:00
Rohit Gupta 400201712f
added warning for distributedsampler in case of evaluation (#11479) 2022-02-03 18:42:13 +00:00
Rohit Gupta 01abe72278
Fix to avoid val progress bar disappear after validate (#11700)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2022-02-03 13:35:38 +00:00
Rohit Gupta e9065e9d42
Fix rich with uneven refresh rate tracking (#11668)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2022-02-03 10:27:05 +00:00
Rohit Gupta 7948ed703d
Avoid enforcing `shuffle=False` for eval dataloaders (#11575) 2022-02-03 09:35:31 +00:00
Danielle Pintz 9ebd7df22a
Move progress bar disabling out of the Trainer (#11377)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2022-02-03 06:29:32 +00:00
Rohit Gupta 0cb64fb8ba
Fix mid-epoch warning call while resuming (#11556)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: Jirka <jirka.borovec@seznam.cz>
2022-02-03 05:42:31 +00:00
four4fish d43fd0d4d6
Lazy initialize Strategy.parallel_devices (#11572)
Co-authored-by: Aki Nitta <nitta@akihironitta.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2022-02-03 04:25:16 +00:00
Rohit Gupta eceefdc602
Fix rich progress bar render only on main pbar (#11690) 2022-02-03 04:18:07 +00:00
Krishna Kalyan 6291af5c19
Replace occurrences of `on_before_accelerator_backend_setup_called` with `setup` (#11568)
Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
2022-02-03 04:14:33 +00:00
Peter Franek ed8a5dadce
Improving instructions in finetuning docstring (#10484) 2022-02-03 04:13:06 +00:00
Anton Schwaighofer f935319622
Allow a `CombinedLoader` as the training data in DDP (#11648)
Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: Jirka <jirka.borovec@seznam.cz>
2022-02-03 04:01:20 +00:00
Sebastian Raschka 0e17f16438
Clarify what the default values for log are based on hooks (#11611) 2022-02-03 03:55:42 +00:00
Jirka Borovec c5de105276
fix available modules (#11526) 2022-02-03 03:38:16 +00:00
Sebastian Raschka 9934569373
Fix typo in `TensorBoardLogger.log_metrics` error message (#11595) 2022-02-03 03:18:54 +00:00
Carlos Mocholí 3d3172d3da
[CLI] Support shorthand for loggers (#11533) 2022-02-03 02:58:14 +00:00
Bhadresh Savani 0ea48416cd
Removed subsection in `LightningDataModule` (#11675) 2022-02-03 02:53:43 +00:00
DuYicong515 0816a1997e
Add typing for utilities/memory.py (#11545)
Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
2022-02-03 02:34:05 +00:00
Piyush Hirapara 72f0e5bfae
Deprecate `on_configure_sharded_model` callback hook for v1.6 (#11627)
Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>
Co-authored-by: Danielle Pintz <38207072+daniellepintz@users.noreply.github.com>
Co-authored-by: ananthsub <ananth.subramaniam@gmail.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2022-02-03 02:29:26 +00:00
Krishna Kalyan 6586dd23b7
Mark `CheckpointConnector` as protected (#11550)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2022-02-03 02:26:08 +00:00
DuYicong515 06e2635c71
Refactor get_filesystem to use native fsspec API (#11708) 2022-02-03 01:55:24 +00:00
Akash Kwatra d5aa7717aa
Remove experiment property from abstract class (#11603)
Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2022-02-03 01:51:34 +00:00
Rohit Gupta ee049e123d
Fix rich progress bar metric render on epoch end (#11689)
Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
Co-authored-by: Jirka <jirka.borovec@seznam.cz>
2022-02-03 01:43:48 +00:00
jjenniferdai ec1379da2c
Rename `_SupportsStateDict` --> `_Stateful` Protocol (#11469) 2022-02-02 23:45:59 +01:00
Carlos Mocholí b8e360dafa
[CLI] Fix bug that forces overriding `configure_optimizers` (#11672) 2022-02-02 22:44:00 +00:00
Akash Kwatra 115a5d08e8
Decouple utilities from `LightningLoggerBase` (#11484)
Co-authored-by: ananthsub <ananth.subramaniam@gmail.com>
Co-authored-by: Jirka <jirka.borovec@seznam.cz>
2022-02-02 23:29:01 +01:00
Aki Nitta fbc1f9f1d9
Rename `Strategy.lr_schedulers` to `Strategy.lr_scheduler_configs` (#11549) 2022-02-02 22:10:01 +00:00
Nithin Rao b8d2c65a37
Set the state before saving "last" or "none" checkpoints (#11481)
Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
2022-02-02 23:07:05 +01:00
Carlos Mocholí d7944a13cd
Teardown all internal components on exception (#11620) 2022-02-02 21:10:19 +00:00
Rohit Gupta 3eee8f18cf
Sort simple profiler summary based on mean duration (#11671) 2022-02-02 20:44:42 +00:00
Rohit Gupta 76175217e4
Fix val_loop run on restart (#11552)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2022-02-02 20:19:34 +00:00
Carlos Mocholí a44881cd90
Changes in preparation to #8578 (#11562) 2022-02-02 19:57:08 +00:00
Carlos Mocholí 79a3ff690b
Add typing to data fetching (#11515) 2022-02-02 20:53:50 +01:00
Chunyang Wen fe34bf2a65
Remove useless pass and abc (#11522) 2022-01-24 08:19:57 +00:00
Chunyang Wen 350c88e621
Let Accelerator inherit from ABC to make sure abstractmethod takes effect (#11521) 2022-01-23 20:47:43 +01:00
Carlos Mocholí 623dc974f5
Construct the hook kwargs inside each loop (#11511) 2022-01-22 15:57:12 +00:00
Carlos Mocholí 5ad5ba54c0
Refactor fetching function (#11516) 2022-01-20 20:06:58 +01:00
Carlos Mocholí 075b8801c9
Fix checkpoint values when saving and resetting the tuner state (#11518) 2022-01-20 18:54:40 +00:00
Carlos Mocholí 7295457a7b
[CLI] Save only the configuration used (#11532) 2022-01-20 12:35:43 +00:00
Rafał Jankowski e78d658c8d
Remove access to `_short_id` in NeptuneLogger (#11517) 2022-01-20 12:07:42 +00:00
Maaz Karim 16a04b29eb
Mark SignalConnector as protected (#11513)
Co-authored-by: rohitgr7 <rohitgr1998@gmail.com>
2022-01-20 08:39:59 +01:00
ananthsub 1bd6fc979e
Remove `Strategy.on_tpu` property (#11536) 2022-01-20 08:25:26 +01:00
ananthsub f41d1e5e5e
Remove `Strategy.on_gpu` (#11537) 2022-01-19 21:27:12 +00:00
Rohit Gupta f7f835fa0e
improve simple profiler output (#11414) 2022-01-18 19:58:34 +00:00
Carlos Mocholí 62818dbace
Use a dataclass as the scheduler config (#11443) 2022-01-18 20:23:32 +01:00
Carlos Mocholí 344ab1e0a5
Move the `lightning_optimizers` ownership to the `Strategy` (#11444) 2022-01-18 12:58:56 +01:00
Rohit Gupta 033dba1494
Disable attaching samplers when using `IterableDataset` (#11507) 2022-01-17 23:33:57 +01:00
Gautam R Gare ef4677ae7b
Change the default `prog_bar=False` to `True` in `LightningModule.log_grad_norm` (#11472)
* Reset on_step flag to True in log_grad_norm
* updated change log

Co-authored-by: Aki Nitta <nitta@akihironitta.com>
2022-01-18 02:34:50 +09:00
Carlos Mocholí 9cf9ded73b
Simplify data fetching (#11466) 2022-01-17 14:46:55 +00:00
Rohit Gupta cad604211b
update load_from_checkpoint docstrings (#11467) 2022-01-16 20:48:27 +00:00
Carlos Mocholí 18bbb39eef
Set `Loop.restarting` recursively (#11442)
* Set `Loop.restarting` recursively
* Docs
* CHANGELOG
* Update pytorch_lightning/loops/epoch/training_epoch_loop.py
Co-authored-by: Aki Nitta <nitta@akihironitta.com>
2022-01-14 19:25:23 +09:00
Rohit Gupta 9771e7dff6
Update introduction docs (#11140) 2022-01-13 21:11:43 +00:00
Carlos Mocholí a80da35d5d
Fix compatibility with old checkpoints and fault-tolerance enabled (#11439) 2022-01-13 14:53:17 +01:00
Rohit Gupta 96a53382ac
Update utilities API references (#11450) 2022-01-13 13:22:58 +00:00
Carlos Mocholí 5914fb748f
Add typing to accelerators/gpu.py (#11333) 2022-01-12 19:44:51 +00:00
Rohit Gupta 00d1758bac
Update training tricks docs (#11169) 2022-01-12 16:26:03 +00:00
Carlos Mocholí f5bbc2cf17
Avoid in-place ops during logging result updates (#11401)
Co-authored-by: rohitgr7 <rohitgr1998@gmail.com>
2022-01-12 09:09:36 +01:00
Rohit Gupta 221091afc4
move profiler docs (#11431) 2022-01-12 05:56:16 +00:00
Aki Nitta 8dc36c3745
Fix inconsistent exceptions raised with no `rich` installed (#11360)
Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
2022-01-12 03:55:51 +00:00
Rohit Gupta 82c8875f33
Add `LightningModule.lr_scheduler_step` (#10249)
Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
2022-01-12 03:53:49 +00:00
Carlos Mocholí 9771040621
Add typing to `TQDMProgressBar` (#11369) 2022-01-12 01:07:30 +00:00
edward-io 6107ce8e0d
Add DETAIL logs for batch use cases (#11008) 2022-01-12 01:22:48 +01:00
Rohit Gupta 06b8f82b8a
Update API references in doc (#11357) 2022-01-07 15:56:17 +01:00
Carlos Mocholí 59a7ba7605
Move `epoch_{start,end}` hooks from `TrainingEpochLoop` to `FitLoop` (#11201) 2022-01-06 15:13:18 +00:00
Danielle Pintz 57567edeab
Move newly added Trainer methods to be with other methods (#11335) 2022-01-06 14:10:21 +00:00
Kaushik B 42a1c72660
Add Accelerators section to Lightning docs (#10755) 2022-01-06 19:12:44 +05:30
Carlos Mocholí 8a549a550c
Integrate progress tracking into the progress bar (#11213) 2022-01-06 14:29:48 +01:00
Adrian Wälchli 3a2df4f75d
Fix typing in `pl.callbacks.xla_stats_monitor` (#11219)
Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
2022-01-06 12:51:02 +00:00
NathanGodey 9b873dcfcc
Changed hook doctstring (#11345) 2022-01-06 12:37:11 +00:00
Adrian Wälchli 9c8f52ccd1
Fix restoring lr scheduler states with deepspeed strategy (#11322)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: thomas chaton <thomas@grid.ai>
2022-01-06 12:34:16 +00:00
Carlos Mocholí 5693a94c32
Extend the deprecation of `Trainer(resume_from_checkpoint)` (#11334) 2022-01-06 13:18:37 +01:00
Kaushik B e15579a4f3
Rename `_distrib_type` to `_strategy_type` (#11328)
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2022-01-06 06:32:50 +00:00
Abhishek Saroha 43c140c8e5
Fix frozen dataclass instance error in `apply_to_collection` (#10927)
Co-authored-by: Justus Schock <12886177+justusschock@users.noreply.github.com>
Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
2022-01-05 23:29:03 +01:00
Danielle Pintz 5b59c951e2
Deprecate `TrainerDataLoadingMixin` and move logic to `DataConnector` (#11282)
Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>
Co-authored-by: Aki Nitta <nitta@akihironitta.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2022-01-05 21:23:57 +01:00
Carlos Mocholí c0726bacbe
Update `LightningCLI(trainer_defaults=...)` doc (#11309)
Co-authored-by: Mauricio Villegas <mauricio_ville@yahoo.com>
2022-01-05 19:43:35 +00:00
Adrian Wälchli 9906a1a54d
Update optimizer configuration info message in `DeepSpeedStrategy` (#11327) 2022-01-05 18:20:06 +00:00
Carlos Mocholí 1b6f851880
Add typing to some utility files (#11316) 2022-01-05 17:14:22 +00:00
Kaushik B 70c975a9f3
Fix exception message for FSDP running on CPU (#11325) 2022-01-05 18:02:31 +01:00
Rohit Gupta 8955081aaf
Update precision docs (#11010)
Co-authored-by: Justus Schock <12886177+justusschock@users.noreply.github.com>
2022-01-05 14:58:04 +00:00
Andrew Tritt dbf1acd5a5
Modify LSFEnvironment to use more reliable environment variable (#10825)
Co-authored-by: thomas chaton <thomas@grid.ai>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2022-01-05 12:45:25 +00:00
Kaushik B 93223ff5ce
Introduce StrategyRegistry (#11233)
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2022-01-05 17:14:18 +05:30
Carlos Mocholí 5ac129e95a
Rename ttp -> strategy (#11312) 2022-01-05 12:12:25 +01:00
Carlos Mocholí 33c3490685
Fix min/max logging default value (#11310) 2022-01-05 11:42:03 +01:00
Adrian Wälchli a8bd7ac73f
Fix lr scheduler state not being dumped to checkpoint in deepspeed strategy (#11307) 2022-01-05 08:38:08 +00:00
Rohit Gupta 7eab379da2
Raise a warning if evaulation is triggered with best ckpt in case of multiple checkpoint callbacks (#11274)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2022-01-04 17:22:32 +00:00
Carlos Mocholí a610e043d7
Add typing for utilities/enums.py (#11298) 2022-01-04 13:30:56 +01:00
Carlos Mocholí e9009d6058
Reset the total fit-validation batch progress on epoch (#11244) 2022-01-04 12:04:20 +01:00
Danielle Pintz 7fa1aebcc9
Remove `profile("training_step_and_backward")` (#11222) 2022-01-04 11:50:11 +01:00
Rohit Gupta 997da52f73
Update logic to make sure logged_metrics always contain tensors (#11270) 2022-01-04 10:32:44 +00:00
Rohit Gupta 98ea79b8b0
Add `opt_idx` to scheduler config if not assigned by user (#11247)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2022-01-04 14:57:15 +09:00
Ed Pizzi cf32127e7e
Avoid non-blocking GPU->CPU copies. (#11288)
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>
Co-authored-by: Justus Schock <12886177+justusschock@users.noreply.github.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2022-01-03 22:17:50 +00:00
Kaushik B 5a4df4ec7d
Update strategy import statements (#11238)
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2022-01-03 15:54:46 +00:00
ananthsub 05ed9a201c
Group metrics generated by `DeviceStatsMonitor` for better visualization (#11254)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2022-01-03 13:26:17 +00:00
Adrian Wälchli 17cb3c70f7
Fix data fetcher selection (#11294) 2022-01-03 13:49:17 +01:00
Danielle Pintz b082715103
Remove `Strategy.optimizer_zero_grad` (#11246) 2022-01-03 13:46:57 +01:00
Adrian Wälchli 4eede7c30b
Add deprecation path for renamed training type plugins (#11227)
Co-authored-by: Kaushik B <kaushikbokka@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2022-01-03 13:41:05 +01:00
jjenniferdai 4b5761539e
Remove `hpc_save` (#11101) 2022-01-03 12:23:13 +00:00
Adam Viola 1fc046cde2
Fix `_should_reload_dl_epoch` causing inconsistent validation dataloader reloading (#11036)
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: thomas chaton <thomas@grid.ai>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2021-12-28 02:20:57 +01:00
Danielle Pintz ca9b25db80
Remove `Strategy.init_optimizers` (#11236) 2021-12-23 18:48:21 +00:00
Danielle Pintz ba6a8ddcad
refactor _configure_schedulers (#11245) 2021-12-23 10:03:28 -08:00
Carlos Mocholí f44b209e72
Fix CLI race condition saving the config (#11199) 2021-12-23 16:45:06 +00:00
Carlos Mocholí 30236c837f
Reset the progress tracking state after sanity checking (#11218) 2021-12-23 16:36:03 +00:00
Kaushik B 0adcd6a048
Rename training_type_plugin file to strategy (#11239)
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2021-12-23 14:01:23 +00:00
Adrian Wälchli c210e338ef
Update strategy import statements (#11231) 2021-12-23 08:26:28 +01:00
Danielle Pintz a6a28e08d2
Deprecate `TrainerOptimizersMixin` and move functionality to `core/optimizer.py` (#11155) 2021-12-22 17:56:37 -08:00
four4fish 81301dbba7
Rename `AcceleratorConnector.training_type_plugin` to `AcceleratorConnector.strategy` (#11212)
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2021-12-23 01:36:23 +00:00
twsl 0b9034baef
Return only unique names/versions for LoggerCollection (#10976)
Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>
2021-12-23 00:35:38 +00:00
Kaushik B 576a5d62a0
Introduce strategies directory for Training Strategies (#11226)
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2021-12-22 20:23:30 +00:00
Carlos Mocholí eb5b350f9a
Remove explicit isinstance checks in strategies for checkpoint io (#11177)
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2021-12-22 04:41:45 +00:00
Adrian Wälchli b6dd1a3878
Fix typing in `pl.callbacks.lr_monitor` (#10802)
Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
2021-12-22 03:50:00 +00:00
Adrian Wälchli ba8e7cd787
Fix BF16 teardown for TPU precision plugin (#10990)
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Justus Schock <12886177+justusschock@users.noreply.github.com>
Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>
Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com>
Co-authored-by: thomas chaton <thomas@grid.ai>
2021-12-22 03:47:14 +00:00
four4fish cf5ef32f7b
Deprecate Trainer.training_type_plugin in favor of trainer.strategy (#11141)
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2021-12-22 02:11:43 +00:00
Adrian Wälchli 17ad1a4c00
Rename `ParallelPlugin` to `ParallelStrategy` (#11123) 2021-12-22 01:09:17 +00:00
four4fish 4bfe5bda0f
Rename the DDPSpawnShardedPlugin to DDPSpawnShardeedStrategy (#11210)
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2021-12-22 00:27:36 +00:00
Aki Nitta 28ce9105e4
Rename `SingleDevicePlugin` to `SingleDeviceStrategy` (#11181)
Co-authored-by: ananthsub <ananth.subramaniam@gmail.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2021-12-21 23:56:14 +00:00
four4fish f98cd78e9e
Renamed the `DDPSpawnPlugin` to `DDPSpawnStrategy` (#11145) 2021-12-21 23:06:14 +00:00
four4fish 0c69c757d4
Rename the `DataParallelPlugin` to `DataParallelStrategy` (#11183) 2021-12-21 22:00:24 +00:00
Aki Nitta c3cd4d050f
Rename `SingleTPUPlugin` to `SingleTPUStrategy` (#11182) 2021-12-21 20:09:30 +00:00
four4fish 1c5a5c3dfe
Renamed the DDP2Plugin to DDP2Strategy (#11185)
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2021-12-21 19:21:00 +00:00
Carlos Mocholí b2c3d01b3e
Fix master import conflict (#11203) 2021-12-21 18:47:56 +00:00
Danielle Pintz ac8dc2c2f3
Deprecate `TrainerCallbackHookMixin` (#11148) 2021-12-21 09:47:08 -08:00
four4fish caab69aabb
Renamed DDPShardPlugin to DDPShardStrategy (#11187)
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2021-12-21 17:18:25 +00:00
Carlos Mocholí f696326060
Remove `should_rank_save_checkpoint` property from TTP (#11070) 2021-12-21 18:11:20 +01:00
Carlos Mocholí 3692eba807
Drop Python 3.6 support (#11117) 2021-12-21 17:06:15 +00:00
Aki Nitta 9da78a94bd
Rename `TPUSpawnPlugin` to `TPUSpawnStrategy` (#11190)
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2021-12-21 16:36:16 +00:00
Danielle Pintz 1177389d5a
Move `TrainerCallbackHookMixin.on_save/load_checkpoint` to `Trainer` and rename for clarity (#11179) 2021-12-21 17:30:01 +01:00
Kaushik B 2e947a88e0
Rename IPUPlugin to IPUStrategy (#11193) 2021-12-21 15:55:41 +00:00
Kaushik B 283bdece0a
Rename DeepSpeedPlugin to DeepSpeedStrategy (#11194) 2021-12-21 15:18:01 +00:00
Oliver Borchert 17aceafa80
Suppress Warning in `PredictionEpochLoop` (#11189)
Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2021-12-21 14:40:41 +00:00
Kaushik B ba0c901395
Rename HorovodPlugin to HorovodStrategy (#11195) 2021-12-21 14:31:41 +01:00
Rohit Gupta 93ce2d7cc9
Avoid torch amp cuda warning with bf16 on cpu (#11161)
Co-authored-by: Justus Schock <12886177+justusschock@users.noreply.github.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2021-12-21 18:24:26 +05:30
four4fish b64dea9dc3
Rename `DDPPlugin` to `DDPStrategy` (#11142)
* Raname DDPPlugin to DDPStrategy

* Change ddp_plugin to ddp_strategy

* update changelog

* rename occurences in docs

* rename more occurrences

* fix line too long

* more fixes

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2021-12-21 08:55:51 +00:00
jjenniferdai 31f39c9578
Move `CheckpointConnector.fault_tolerant_auto_save_path` out of `CheckpointConnector.hpc_resume_path` (#11092) 2021-12-21 02:24:01 +01:00
Rohit Gupta 787f41eff6
update optimizer_step example in docs (#10420) 2021-12-21 08:19:40 +09:00
Adrian Wälchli 08e661ff72
Rename `restore_checkpoint_after_pre_dispatch` to `restore_checkpoint_after_setup` (#11166)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2021-12-20 17:16:52 +00:00
Carlos Mocholí e8169bbd46
Fix setter usage for checkpoint io and precision in TTP (#11071)
Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>
2021-12-20 17:45:32 +01:00
Adrian Wälchli f5c2881b68
3/n Simplify spawn plugins: Merge `pre_dispatch` and `setup` logic (#11137) 2021-12-20 17:41:22 +01:00
Adrian Wälchli 2e47e2f4ae
Set spawn_method on initialization (#11162) 2021-12-20 17:39:54 +01:00
four4fish 0ee78e96ef
Rename `DDPFullyShardedPlugin` to `DDPFullyShardedStrategy` (#11143)
* Rename DDPFullyShardedPlugin to DDPFullyShardedStrategy

* update fsdp_plugin to fsdp_strategy

* update changelog

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2021-12-20 17:11:20 +01:00
ORippler 86a3c5e2a3
Add required states for resumed ModelCheckpoint GC (#10995)
* Add required states for resumed ModelCheckpoint GC

* Add backwards compatibility with legacy cktps

Co-authored-by: Justus Schock <12886177+justusschock@users.noreply.github.com>

* Add test to check if attrs are written to ckpt

Note that we do not yet check for proper loading/reinstantiation of
ModelCheckpooint based on the ckpt written to disk

* Test if attributes are restored properly from ckpt

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix broken `test_callbacks_state_fit_ckpt_path`

`ModelCheckpoint` is configured to save after every epoch,
but `trainer.fit` is called with `max_steps = 1`

Note there may be a better way of doing this, where `ModelCheckpoint`
is called after `training_step`

* Update test_restore.py

* Update test_restore.py

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Check that all attributes are restored properly

* revert changes, use fix on master

* Convert to proper unit test

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Refactor `test_mode_checkpoint_saveload_ckpt`

* First save, then load ckpt.
* Instantiate ModelCheckpoint twice.

Co-authored-by: Justus Schock <12886177+justusschock@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2021-12-20 17:05:15 +01:00
Danielle Pintz b1baf460d9
Include hook's object name when profiling (#11026) 2021-12-20 15:18:24 +01:00
Adrian Wälchli 29eb9cccf2
Rename the `TrainingTypePlugin` base to `Strategy` (#11120)
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>
Co-authored-by: four4fish <88516121+four4fish@users.noreply.github.com>
2021-12-20 12:50:11 +00:00
guyang3532 cc4a978bf6
Safely disable profiler (#11167) 2021-12-20 11:51:46 +00:00
Carlos Mocholí 7ed3dbf191
Fix evaluation logging on epoch end with multiple dataloaders (#11132) 2021-12-19 15:51:01 +01:00
Danielle Pintz f95976d602
rename _call_ttp_hook to _call_strategy_hook (#11150) 2021-12-18 17:53:03 -08:00
Rohit Gupta 3461af0ddb
Add support for returning callback from `LightningModule.configure_callbacks` (#11060) 2021-12-18 10:46:35 +00:00
Rafał Jankowski 3cc69f992b
Fixed NeptuneLogger when using DDP (#11030)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>
2021-12-18 01:40:13 +00:00
Carlos Mocholí 62f1e82e03
Fix CVE-2020-1747 and CVE-2020-14343 (#11099) 2021-12-17 20:27:15 +00:00
Carlos Mocholí 8508cce37d
Mark all result classes as protected (#11130) 2021-12-17 19:35:17 +00:00
Rohit Gupta 860959fb3f
Enable logging hparams only if there are any (#11105) 2021-12-17 19:40:56 +01:00
Carlos Mocholí dbb7f56b35
Deprecate `Trainer.verbose_evaluate` (#10931) 2021-12-17 19:26:32 +01:00
Carlos Mocholí 75d96d9897
Reset the current progress tracking state during double evaluation (#11119) 2021-12-17 19:20:11 +01:00
Adrian Wälchli 978f5e6ad6
Fix AttributeError when using CombinedLoader in prediction (#11111)
Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
2021-12-17 18:02:25 +00:00
quancs 179b4dd415
remove redundant methods in RichProgressBar (#11100)
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2021-12-17 17:40:31 +00:00
Carlos Mocholí 7e10f6d41f
Save the loop progress state by default (#10784) 2021-12-17 16:00:27 +00:00
Carlos Mocholí fa6d17c96f
Fix typing for utilities.warnings (#11115) 2021-12-17 15:07:27 +01:00
Adrian Wälchli 6582249a0c
Fix signal teardown outside main thread (#11124) 2021-12-17 14:12:02 +01:00
Carlos Mocholí 5956a0716b
Track the evaluation loop outputs in the loop (#10928) 2021-12-17 14:00:47 +01:00
Adrian Wälchli 210ff845c1
Mark `Trainer.run_stage` as protected (#11000)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Justus Schock <12886177+justusschock@users.noreply.github.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2021-12-17 13:46:03 +01:00
Sean Naren c66cd12445
Remove partitioning of model in ZeRO 3 (#10655) 2021-12-17 12:36:53 +00:00
Carlos Mocholí 4415677994
Add typing for `trainer.logger` (#11114) 2021-12-17 13:34:18 +01:00
Carlos Mocholí 5932f52b2f
Avoid the deprecated `onnx.export(example_outputs=...)` in torch 1.10 (#11116) 2021-12-17 10:11:11 +01:00
Adrian Wälchli e19d93f69e
Initialize ModelCheckpoint state as early as possible (#11108) 2021-12-17 00:18:29 +01:00
Adrian Wälchli 262aefc8df
Remove obsolete `pre_dispatch` in `DDPSpawnShardedPlugin` (#10988) 2021-12-16 21:43:15 +01:00
Adrian Wälchli 2b0075a47e
Teardown sync-batchnorm after training (#11078)
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: ananthsub <ananth.subramaniam@gmail.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2021-12-16 18:58:44 +00:00
Carlos Mocholí 46d6fbf11b
Add `Loop.replace` (#10324)
Co-authored-by: tchaton <thomas@grid.ai>
Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>
2021-12-16 17:41:38 +00:00
Adrian Wälchli c335a7891d
Remove redundant special case for disabling the progress bar on TPU (#11061) 2021-12-16 18:02:50 +01:00
Carlos Mocholí f37bd4677d
Update mypy (#11096) 2021-12-16 17:53:12 +01:00