lightning

Commit Graph

Author	SHA1	Message	Date
four4fish	a451997c4d	Avoid wrapping LightningModule in DDP plugins when not fitting (#9096 ) * Avoid wrapping LightningModule in DDP plugins when not fitting * Avoid wrapping LightningModule in DDP plugins when not fitting	2021-09-02 02:23:59 +00:00
B. Kerim Tshimanga	65b3dc4495	scheduled removal of DeepSpeedPlugin.cpu_offload* parameters (#9244 )	2021-09-01 12:02:30 +02:00
four4fish	b497fb80e5	Remove reference to DistributedDataParallel from parallel plugin teardown (#8943 )	2021-08-26 17:51:05 -07:00
Yi Wang	366fb39d2e	Support post-localSGD in Lightning DDP plugin (#8967 ) Co-authored-by: ananthsub <ananth.subramaniam@gmail.com> Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>	2021-08-26 08:24:49 +01:00
four4fish	f01a9a6cd2	Remove `BasePlugin` (#9066 ) * Remove BasePlugin Co-authored-by: ananthsub <ananth.subramaniam@gmail.com> Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>	2021-08-25 19:10:28 +00:00
Sean Naren	bac8b1be81	Add support for CPU AMP autocast (#9084 )	2021-08-25 12:18:00 +00:00
Sean Naren	1bab0a17a9	Fix torch bfloat import version (#9089 )	2021-08-24 19:18:12 +00:00
Sean Naren	1feec8c601	Add bfloat16 support to Lightning Trainer (#9049 )	2021-08-24 09:47:21 +00:00
ananthsub	1e4d8929fb	Simplify checkpoint connector loading after Checkpoint IO plugin's introduction (#9045 ) * Simplify checkpoint connector loading after Checkpoint IO plugins introduction Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com> Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>	2021-08-23 13:12:18 -07:00
four4fish	c912ebf889	Remove TrainingTypePlugin.on_save and Accelerator.on_save (#9023 ) * Remove TrainingTypePlugin.on_save and Accelerator.on_save	2021-08-23 10:11:00 -07:00
Adrian Wälchli	49c52b0d4b	update an outdated error message in DDPPlugin (#9005 )	2021-08-23 15:29:07 +00:00
Kaushik B	0461107972	Move `init_ddp_connection` to distributed utilities (#9044 ) Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	2021-08-23 14:01:01 +05:30
Sean Naren	c6b6888387	Add DeepSpeed Stage 1 + doc improvements for model parallel (#8974 ) * Add stage 1 support + small doc improvements * Add CHANGELOG.md	2021-08-18 19:40:19 +05:30
Danielle Pintz	77bc5d4004	Replace instances of `self.lightning_module.trainer` with `trainer` directly in ddp_spawn and tpu_spawn (#8942 ) * Replace instances of `self.lightning_module.trainer` with `trainer` directly in ddp_spawn and tpu_spawn	2021-08-17 13:15:33 -07:00
Yifu Wang	14f1475c25	Ensure the existence of `DDPPlugin._sync_dir` in `reconciliate_processes` (#8939 ) Co-authored-by: Yifu Wang <yifuwang@2012@gmail.com>	2021-08-17 13:47:33 +05:30
Carlos Mocholí	93ab24d1ee	Replace DataLoader sampler once for IPUs (#8858 )	2021-08-16 11:28:05 +02:00
Sean Naren	b2973a035e	Introduce CheckpointIO Plugin (#8743 )	2021-08-13 17:35:31 +01:00
Carlos Mocholí	a1264a6850	Automatic string fixes (#8886 )	2021-08-13 14:28:14 +00:00
Binh Tang	efec3d461c	Move logger and profiler finalization to trainer's teardown (#8685 ) Co-authored-by: ananthsub <ananth.subramaniam@gmail.com> Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>	2021-08-05 10:09:43 +02:00
Carlos Mocholí	ed13040729	Connect the model to the training type plugin at the start of run (#8536 )	2021-08-04 17:43:34 +02:00
Sean Naren	49d03f87fe	[docs] Update deepspeed docs, add some more information and link to streamlit (#8691 )	2021-08-03 16:12:36 +00:00
Sean Naren	e5d9e21dea	Fix save/load/resume from checkpoint for DeepSpeed Plugin (#8397 )	2021-08-02 22:31:05 +00:00
thomas chaton	9e61de2063	Torch Elastic DDP DeadLock bug fix (#8655 ) Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>	2021-08-02 21:48:43 +02:00
Jirka Borovec	f67892ea96	CI: yesqa (#8564 ) * add yesqa * fix flake8 Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	2021-08-02 16:05:56 +00:00
Sean Naren	7a1e97203e	Add property to skip restoring optimizers and schedulers via plugin (#8644 )	2021-07-31 10:08:10 +02:00
Sean Naren	07b7dc9c17	[Fix] Add delay property for checkpointing, refactor loading checkpoint (DeepSpeed Checkpointing Fix 1/n) (#8627 ) * Add property to delay checkpointing, move loading checkpoint file into the run function to allow deepspeed engine to be loaded * Add a small test * Apply suggestions from code review Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * Update pytorch_lightning/accelerators/accelerator.py Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * Address review * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	2021-07-30 11:31:08 +01:00
thomas chaton	c7f8c8c3c8	[bugfix] DeepSpeed with no schedulers (#8580 ) Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	2021-07-27 15:28:10 +00:00
Carlos Mocholí	e63968ab88	Add `pyupgrade` to `pre-commit` (#8557 ) Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	2021-07-26 14:38:12 +02:00
Carlos Mocholí	a64cc37394	Replace `yapf` with `black` (#7783 ) Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	2021-07-26 13:37:35 +02:00
deepsource-autofix[bot]	2cf03af155	Remove undefined name from `__all__` (#8468 ) * Remove undefined name from `__all__` Co-authored-by: deepsource-autofix[bot] <62050782+deepsource-autofix[bot]@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	2021-07-26 10:52:35 +02:00
Kaushik B	ef7d41692c	Add `ddp_*_find_unused_parameters_false` to Plugins Registry. (#8483 ) Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>	2021-07-24 04:02:54 +00:00
Carlos Mocholí	4a64bc3fd3	Fix DeepSpeed lr scheduler logic (#8527 ) * Fix deepspeed scheduler logic * Fix tests * Minor changes * Improve tests * inference fix * CHANGELOG Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>	2021-07-23 10:08:58 +01:00
Adrian Wälchli	0ad7f3a829	Fix log_dir tracking in case of multiple Trainer instances + DDP (#7403 ) Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>	2021-07-23 09:18:23 +02:00
Carlos Mocholí	f7027a8701	Remove `torch >= 1.6` checks (#8523 )	2021-07-23 04:03:20 +00:00
Kaushik B	5452590872	fix: Enable manual optimization for TPUs (#8458 )	2021-07-22 15:33:35 +05:30
thomas chaton	c9af1a7aec	[bugfix] Reduce memory leaks (#8490 ) * reduce memory leak * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update changelog * Apply suggestions from code review Co-authored-by: Ethan Harris <ewah1g13@soton.ac.uk> * resolve flake8 * update on comments * resolve bug * update * Undo whitespace changes * remove bug * resolve flake8 * revert change * update on comments * delete the ddp wrapper as it hold memory * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * resolve flake8 * update on comments * update changelog * resolve test * Update CHANGELOG * Refactor teardown * Fix comment * Do it for non-gpu too * remove ref when the model is not a lightning_module * Fix import error * move down * resolve bug * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * resolve assignement * update * move above * Fix device calls to support tpu training * Updat todo Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: Ethan Harris <ewah1g13@soton.ac.uk> Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com> Co-authored-by: Kaushik B <kaushikbokka@gmail.com>	2021-07-21 11:37:05 +02:00
marsggbo	d0038b521c	Bugfix: horovod optimizer missing 2 required positional arguments (#7840 ) Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> Co-authored-by: thomas chaton <thomas@grid.ai> Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>	2021-07-21 08:11:26 +00:00
Sean Naren	8a9ee403be	Add Windows Support for DeepSpeed (#8488 ) * Modify deepspeed distributed to support windows * Add weak test * Cleanups * Capture more in tests * Add comment * Cleaner asserts	2021-07-20 13:55:52 +00:00
deepsource-autofix[bot]	3628c314e5	Merge `isinstance` calls (#8469 ) Co-authored-by: deepsource-autofix[bot] <62050782+deepsource-autofix[bot]@users.noreply.github.com>	2021-07-19 14:34:37 +00:00
Stephen McGroarty	b7e5bc7a36	Only output IPU report on request (#8340 ) These reports can be quite large and involve some processing to produce. It means on larger models there's a noticeable performance hit to produce the cycles/memory reports.	2021-07-19 12:52:58 +00:00
Yi Wang	adaa32f47a	[DDP] Remove the outdated limitations of DDP communication hook since 1.9 (#8346 ) * [DDP] Remove the outdated limitations of DDP communication hook since 1.9 1. DDP communication hook can work on multiple backends since 1.9. 2. SPMD in DDP is completely retired in 1.9, and SPSD is the only option. * Update ddp.py Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	2021-07-19 13:55:42 +02:00
Sean Naren	06ac7d9649	[Fix] Remove DeepSpeed Plugin FP16 exception (#8462 ) * Remove error, add mixed to check * Add test * Remove test * Add changelog * Add test for mixed * Update tests/plugins/test_deepspeed_plugin.py * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add special Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	2021-07-19 11:12:31 +00:00
Adrian Wälchli	b42efa7d86	support launching Lightning ddp with traditional command (#7480 ) Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>	2021-07-14 11:25:36 +00:00
deepsource-autofix[bot]	b2ba2e6333	Use literal syntax instead of function calls to create data structure (#8406 )	2021-07-14 10:32:13 +00:00
Andrew Tritt	3102922647	Add LSF support (#5102 ) * add ClusterEnvironment for LSF systems * update init file * add available cluster environments * clean up LSFEnvironment * add ddp_hpc as a distributed backend * clean up SLURMEnvironment * remove extra blank line * init device for DDPHPCAccelerator We need to do this so we don't send the model to the same device from multiple ranks * committing current state * add additional methods to ClusterEnvironments * add NVIDIA mixin for setting up CUDA envars * remove troubleshooting prints * cleanup SLURMEnvironment * fix docstring * cleanup TorchElasticEnvironment and add documentation * PEP8 puts a cork in it * add set_ranks_to_trainer * remove unused import * move to new location * update LSF environment * remove mixin * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * changelog * reset slurm env * add tests * add licence * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * test node_rank * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add lsf env to docs * add auto detection for lsf environment * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix is_using_lsf() and test * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	2021-07-09 16:14:26 +02:00
Dusan Drevicky	1b06edf2f2	Add the `on_before_optimizer_step` hook (#8048 ) Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>	2021-07-09 13:30:52 +02:00
thomas chaton	1c825a2a9c	Add the `on_before_backward` hook (#7865 ) * Add callback to hook tests and add predict test * Fix lambda callback test * Simplify lambda call test * Use LambdaCallback * Dynamically append to called for the model * Remove print * Consistency * Consistency * Prepare args/kwargs testing * yapf doesn't like dict literals * Add arguments for fit no val test * Add arguments for fit no val test * add before_backward_hook * add test * resolve flake8 * resolve tests * update changelog * add on_before_backward to LightningModule * update on comments * Test arguments * Datamodule refactor * Fix eval test * remove extra file * resolve bug * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * move to hooks * update * resolve flake8 * update on comments * Update full fit + val test * Update test * Remove FIXME * Remove FIXME * Undo change * Fix * Parametrize fit hook test * Comment * Parametrize fit hook test with different precision plugins * Fix tests * Parametrize fit hook test with manual optimization * Unnecessary parenthesis * WIP * Comments * Fix message * Test CI error * Revert "Test CI error" This reverts commit `39c4a85a83`. * Add ddp training type teardown * Update CHANGELOG * Adrian's fix * Use destructor * Update CHANGELOG.md * RPC destructor * Update pytorch_lightning/plugins/training_type/ddp.py * Why do you not work :( * Missing condition * Fix deepspeed test * GC collect in conftest * Do not show warnings for special tests * Needs to run on 1.8 To avoid: "RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:32, unhandled cuda error, NCCL version 2.4.8" * Run torch 1.8 * Skip test due to 'Python bus error' * Debug NCCL * shm size * Disable warnings for special tests * Remove NCCL_DEBUG statement * Try smaller shm size * Revert "Skip test due to 'Python bus error'" This reverts commit `e0a3e8785d`. * README and adjust versions * Avoid self.on_gpu call * empty cache cleanup * More garbage collection * Unroll parametrizations * Do not reuse mock * Undo changes * Undo notebooks modification * resolve test * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * delete file * Undo * Fix test * Revert "WIP" This reverts commit `f5828a8c42`. * Rename * Remove optimizers * Fix bug with LightningOptimizer * Add optimizers * update * update * Update CHANGELOG * On after backward refactor * Do not call super * Fixes * Remove should_accumulate * pre/post backward refactor * Call the LM backward hook * Update tests * Remove dev debug patch * Fix test * Remove optimizer arguments and typing * Docs fixes * Fix comment * Undo changes * Split manual and auto * Undo change * Deepsource * Remove optimizers * Undo changes * Call the hook * Docs * Docs Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>	2021-07-09 06:15:57 +00:00
Carlos Mocholí	eb6d991218	Refactor plugins backward (#8328 )	2021-07-08 16:02:09 +02:00
Carlos Mocholí	c4353ea702	Remove `dev_debugger.call_count` (#8317 )	2021-07-07 19:59:59 +02:00
Carlos Mocholí	368ac1c622	[CLI] Drop `ArgumentParser` when pickling and save before spawning (#8017 ) Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>	2021-07-07 17:56:13 +00:00

1 2 3 4 5 ...

293 Commits