lightning

Commit Graph

Author	SHA1	Message	Date
Sean Naren	b2973a035e	Introduce CheckpointIO Plugin (#8743 )	2021-08-13 17:35:31 +01:00
Carlos Mocholí	a1264a6850	Automatic string fixes (#8886 )	2021-08-13 14:28:14 +00:00
Binh Tang	efec3d461c	Move logger and profiler finalization to trainer's teardown (#8685 ) Co-authored-by: ananthsub <ananth.subramaniam@gmail.com> Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>	2021-08-05 10:09:43 +02:00
Carlos Mocholí	ed13040729	Connect the model to the training type plugin at the start of run (#8536 )	2021-08-04 17:43:34 +02:00
Sean Naren	49d03f87fe	[docs] Update deepspeed docs, add some more information and link to streamlit (#8691 )	2021-08-03 16:12:36 +00:00
Sean Naren	e5d9e21dea	Fix save/load/resume from checkpoint for DeepSpeed Plugin (#8397 )	2021-08-02 22:31:05 +00:00
thomas chaton	9e61de2063	Torch Elastic DDP DeadLock bug fix (#8655 ) Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>	2021-08-02 21:48:43 +02:00
Jirka Borovec	f67892ea96	CI: yesqa (#8564 ) * add yesqa * fix flake8 Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	2021-08-02 16:05:56 +00:00
Sean Naren	7a1e97203e	Add property to skip restoring optimizers and schedulers via plugin (#8644 )	2021-07-31 10:08:10 +02:00
Sean Naren	07b7dc9c17	[Fix] Add delay property for checkpointing, refactor loading checkpoint (DeepSpeed Checkpointing Fix 1/n) (#8627 ) * Add property to delay checkpointing, move loading checkpoint file into the run function to allow deepspeed engine to be loaded * Add a small test * Apply suggestions from code review Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * Update pytorch_lightning/accelerators/accelerator.py Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * Address review * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	2021-07-30 11:31:08 +01:00
thomas chaton	c7f8c8c3c8	[bugfix] DeepSpeed with no schedulers (#8580 ) Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	2021-07-27 15:28:10 +00:00
Carlos Mocholí	e63968ab88	Add `pyupgrade` to `pre-commit` (#8557 ) Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	2021-07-26 14:38:12 +02:00
Carlos Mocholí	a64cc37394	Replace `yapf` with `black` (#7783 ) Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	2021-07-26 13:37:35 +02:00
deepsource-autofix[bot]	2cf03af155	Remove undefined name from `__all__` (#8468 ) * Remove undefined name from `__all__` Co-authored-by: deepsource-autofix[bot] <62050782+deepsource-autofix[bot]@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	2021-07-26 10:52:35 +02:00
Kaushik B	ef7d41692c	Add `ddp_*_find_unused_parameters_false` to Plugins Registry. (#8483 ) Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>	2021-07-24 04:02:54 +00:00
Carlos Mocholí	4a64bc3fd3	Fix DeepSpeed lr scheduler logic (#8527 ) * Fix deepspeed scheduler logic * Fix tests * Minor changes * Improve tests * inference fix * CHANGELOG Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>	2021-07-23 10:08:58 +01:00
Adrian Wälchli	0ad7f3a829	Fix log_dir tracking in case of multiple Trainer instances + DDP (#7403 ) Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>	2021-07-23 09:18:23 +02:00
Carlos Mocholí	f7027a8701	Remove `torch >= 1.6` checks (#8523 )	2021-07-23 04:03:20 +00:00
Kaushik B	5452590872	fix: Enable manual optimization for TPUs (#8458 )	2021-07-22 15:33:35 +05:30
thomas chaton	c9af1a7aec	[bugfix] Reduce memory leaks (#8490 ) * reduce memory leak * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update changelog * Apply suggestions from code review Co-authored-by: Ethan Harris <ewah1g13@soton.ac.uk> * resolve flake8 * update on comments * resolve bug * update * Undo whitespace changes * remove bug * resolve flake8 * revert change * update on comments * delete the ddp wrapper as it hold memory * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * resolve flake8 * update on comments * update changelog * resolve test * Update CHANGELOG * Refactor teardown * Fix comment * Do it for non-gpu too * remove ref when the model is not a lightning_module * Fix import error * move down * resolve bug * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * resolve assignement * update * move above * Fix device calls to support tpu training * Updat todo Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: Ethan Harris <ewah1g13@soton.ac.uk> Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com> Co-authored-by: Kaushik B <kaushikbokka@gmail.com>	2021-07-21 11:37:05 +02:00
marsggbo	d0038b521c	Bugfix: horovod optimizer missing 2 required positional arguments (#7840 ) Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> Co-authored-by: thomas chaton <thomas@grid.ai> Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>	2021-07-21 08:11:26 +00:00
Sean Naren	8a9ee403be	Add Windows Support for DeepSpeed (#8488 ) * Modify deepspeed distributed to support windows * Add weak test * Cleanups * Capture more in tests * Add comment * Cleaner asserts	2021-07-20 13:55:52 +00:00
deepsource-autofix[bot]	3628c314e5	Merge `isinstance` calls (#8469 ) Co-authored-by: deepsource-autofix[bot] <62050782+deepsource-autofix[bot]@users.noreply.github.com>	2021-07-19 14:34:37 +00:00
Stephen McGroarty	b7e5bc7a36	Only output IPU report on request (#8340 ) These reports can be quite large and involve some processing to produce. It means on larger models there's a noticeable performance hit to produce the cycles/memory reports.	2021-07-19 12:52:58 +00:00
Yi Wang	adaa32f47a	[DDP] Remove the outdated limitations of DDP communication hook since 1.9 (#8346 ) * [DDP] Remove the outdated limitations of DDP communication hook since 1.9 1. DDP communication hook can work on multiple backends since 1.9. 2. SPMD in DDP is completely retired in 1.9, and SPSD is the only option. * Update ddp.py Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	2021-07-19 13:55:42 +02:00
Sean Naren	06ac7d9649	[Fix] Remove DeepSpeed Plugin FP16 exception (#8462 ) * Remove error, add mixed to check * Add test * Remove test * Add changelog * Add test for mixed * Update tests/plugins/test_deepspeed_plugin.py * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add special Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	2021-07-19 11:12:31 +00:00
Adrian Wälchli	b42efa7d86	support launching Lightning ddp with traditional command (#7480 ) Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>	2021-07-14 11:25:36 +00:00
deepsource-autofix[bot]	b2ba2e6333	Use literal syntax instead of function calls to create data structure (#8406 )	2021-07-14 10:32:13 +00:00
Andrew Tritt	3102922647	Add LSF support (#5102 ) * add ClusterEnvironment for LSF systems * update init file * add available cluster environments * clean up LSFEnvironment * add ddp_hpc as a distributed backend * clean up SLURMEnvironment * remove extra blank line * init device for DDPHPCAccelerator We need to do this so we don't send the model to the same device from multiple ranks * committing current state * add additional methods to ClusterEnvironments * add NVIDIA mixin for setting up CUDA envars * remove troubleshooting prints * cleanup SLURMEnvironment * fix docstring * cleanup TorchElasticEnvironment and add documentation * PEP8 puts a cork in it * add set_ranks_to_trainer * remove unused import * move to new location * update LSF environment * remove mixin * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * changelog * reset slurm env * add tests * add licence * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * test node_rank * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add lsf env to docs * add auto detection for lsf environment * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix is_using_lsf() and test * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	2021-07-09 16:14:26 +02:00
Dusan Drevicky	1b06edf2f2	Add the `on_before_optimizer_step` hook (#8048 ) Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>	2021-07-09 13:30:52 +02:00
thomas chaton	1c825a2a9c	Add the `on_before_backward` hook (#7865 ) * Add callback to hook tests and add predict test * Fix lambda callback test * Simplify lambda call test * Use LambdaCallback * Dynamically append to called for the model * Remove print * Consistency * Consistency * Prepare args/kwargs testing * yapf doesn't like dict literals * Add arguments for fit no val test * Add arguments for fit no val test * add before_backward_hook * add test * resolve flake8 * resolve tests * update changelog * add on_before_backward to LightningModule * update on comments * Test arguments * Datamodule refactor * Fix eval test * remove extra file * resolve bug * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * move to hooks * update * resolve flake8 * update on comments * Update full fit + val test * Update test * Remove FIXME * Remove FIXME * Undo change * Fix * Parametrize fit hook test * Comment * Parametrize fit hook test with different precision plugins * Fix tests * Parametrize fit hook test with manual optimization * Unnecessary parenthesis * WIP * Comments * Fix message * Test CI error * Revert "Test CI error" This reverts commit `39c4a85a83`. * Add ddp training type teardown * Update CHANGELOG * Adrian's fix * Use destructor * Update CHANGELOG.md * RPC destructor * Update pytorch_lightning/plugins/training_type/ddp.py * Why do you not work :( * Missing condition * Fix deepspeed test * GC collect in conftest * Do not show warnings for special tests * Needs to run on 1.8 To avoid: "RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:32, unhandled cuda error, NCCL version 2.4.8" * Run torch 1.8 * Skip test due to 'Python bus error' * Debug NCCL * shm size * Disable warnings for special tests * Remove NCCL_DEBUG statement * Try smaller shm size * Revert "Skip test due to 'Python bus error'" This reverts commit `e0a3e8785d`. * README and adjust versions * Avoid self.on_gpu call * empty cache cleanup * More garbage collection * Unroll parametrizations * Do not reuse mock * Undo changes * Undo notebooks modification * resolve test * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * delete file * Undo * Fix test * Revert "WIP" This reverts commit `f5828a8c42`. * Rename * Remove optimizers * Fix bug with LightningOptimizer * Add optimizers * update * update * Update CHANGELOG * On after backward refactor * Do not call super * Fixes * Remove should_accumulate * pre/post backward refactor * Call the LM backward hook * Update tests * Remove dev debug patch * Fix test * Remove optimizer arguments and typing * Docs fixes * Fix comment * Undo changes * Split manual and auto * Undo change * Deepsource * Remove optimizers * Undo changes * Call the hook * Docs * Docs Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>	2021-07-09 06:15:57 +00:00
Carlos Mocholí	eb6d991218	Refactor plugins backward (#8328 )	2021-07-08 16:02:09 +02:00
Carlos Mocholí	c4353ea702	Remove `dev_debugger.call_count` (#8317 )	2021-07-07 19:59:59 +02:00
Carlos Mocholí	368ac1c622	[CLI] Drop `ArgumentParser` when pickling and save before spawning (#8017 ) Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>	2021-07-07 17:56:13 +00:00
Carlos Mocholí	398eed508f	Fix `self.optimizers()` not returning a single `LightningOptimizer` (#8326 )	2021-07-07 18:57:45 +02:00
Adrian Wälchli	d73c32ab51	move `torch.cuda.set_device()` to enable collective calls earlier in setup (#8312 )	2021-07-07 13:15:41 +02:00
Sean Naren	6d558961e3	[IPU] Allow poptorch.Options to override Trainer (#8233 ) * Add test for poptorch Options * Hacks to get manual plugin support * Revert changes * Fix tests + ensure logic follow suit * Update pytorch_lightning/plugins/training_type/ipu.py Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Cleaner * Cleaner Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	2021-07-05 13:42:00 +00:00
Carlos Mocholí	ea88105b88	Parametrize fit hook test with different precision plugins (#8070 ) Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>	2021-07-05 10:50:01 +00:00
Kaushik B	7b6d0a842c	Fix progress bar updates for Pod Training (#8258 ) * Fix progress bar updates for Pod Training * Fix progress bar updates for Pod Training * Add _pod_progress_bar_force_stdout	2021-07-05 10:38:38 +01:00
Sean Naren	07b1ce227c	[IPU] Fix Custom Poptorch options to IPUPlugin (#8241 ) * Fixes to ensure ipu options are respected * Better setter * Add test for poptorch Options * Fix test * fix ipu test * Update pytorch_lightning/plugins/training_type/ipu.py Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>	2021-07-02 11:23:57 +00:00
Adrian Wälchli	e7139ab9f7	Support `DDPPlugin` to be used on CPU (#6208 ) * Skip test due to 'Python bus error' * Debug NCCL * Remove NCCL_DEBUG statement * Revert "Skip test due to 'Python bus error'" This reverts commit `e0a3e8785d`. * fix * add test * changelog * yapf * patch os environ * make a special test * destroy pg * debug * revert * revert * problematic test * skip * try the fixture * test * update sensitive test * update changelog * remove comment * update wrong test * update test name * parameterization * Revert "parameterization" This reverts commit b0542f43f59c5ce66800883b5e2f0c66a97408cc. * remove conftest * ignore test * teardown * fix merge * deep speed parameterization * uncomment test * update chlog * update changelog * split tests * update test update test update test update test * update test comments * unroll test * unroll test * unroll test * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * increase shm * sudo * unroll ipu * Revert "sudo" This reverts commit `6cc68c1478`. * Revert "increase shm" This reverts commit `8c27163483`. * x * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * find guilty test * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * POPTORCH_WAIT_FOR_IPU=1 * move test * redo parameterize for ipu * de-comment test * move chlog * Update tests/accelerators/test_accelerator_connector.py Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com> * Update tests/accelerators/test_accelerator_connector.py Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com> Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com>	2021-07-02 12:00:24 +01:00
deepsource-autofix[bot]	7e2f84e050	Remove methods with unnecessary super delegation. (#8148 ) * Remove methods with unnecessary super delegation. * Update fully_sharded.py * replace init in test Co-authored-by: deepsource-autofix[bot] <62050782+deepsource-autofix[bot]@users.noreply.github.com> Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: Ethan Harris <ethanwharris@gmail.com>	2021-07-02 08:00:55 +00:00
Carlos Mocholí	74eb6cc7e9	Clean `cuda.empty_cache` usage (#8199 )	2021-06-30 13:04:24 +02:00
Ethan Harris	57dce7244c	Fix double precision casting complex buffers (#8208 ) * Fix double precision casting complex buffers * Update CHANGELOG.md * Fixes * Fixes * Fix Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>	2021-06-30 10:57:42 +01:00
Carlos Mocholí	2e537b75e3	Deprecate `DDPPlugin.task_idx` (#8203 )	2021-06-30 01:02:55 +02:00
Carlos Mocholí	df601405d9	Use full `torch.distributed` import (#8200 )	2021-06-29 22:44:10 +00:00
Kaushik B	9444a08d56	Fix Deprecation warning in DDPSpawn (#8193 )	2021-06-29 09:29:51 -07:00
Kaushik B	2a7fad92b9	Avoid passing unnecessary params from TPUSpawn to DDPSpawn (#8192 )	2021-06-29 14:30:54 +02:00
Adrian Wälchli	bf54ac1cad	fix NCCL error with non-consecutive trainer gpus (#8165 ) * device ids in barrier x x s same fix for spawn fix non-nccl x * add changelog * get nccl backend * get backend Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com>	2021-06-28 22:08:10 +02:00
thomas chaton	c521624a92	[bugfix] Add mechanism to prevent deadlock for DDP on Exception Trigger (#8167 ) * add mechanism to prevent deadlock * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * resolve flake8 + update changelog * update on comments * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update * remove space * resolve bugs * overwrite config * update on comments * update on comments * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update * update * update test with comments * Update pytorch_lightning/plugins/training_type/parallel.py Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> * update on comments Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>	2021-06-28 19:26:03 +00:00

1 2 3 4 5 ...

277 Commits