lightning

Commit Graph

Author	SHA1	Message	Date
Yi Wang	366fb39d2e	Support post-localSGD in Lightning DDP plugin (#8967 ) Co-authored-by: ananthsub <ananth.subramaniam@gmail.com> Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>	2021-08-26 08:24:49 +01:00
Sean Naren	bac8b1be81	Add support for CPU AMP autocast (#9084 )	2021-08-25 12:18:00 +00:00
Sean Naren	1feec8c601	Add bfloat16 support to Lightning Trainer (#9049 )	2021-08-24 09:47:21 +00:00
ananthsub	1e4d8929fb	Simplify checkpoint connector loading after Checkpoint IO plugin's introduction (#9045 ) * Simplify checkpoint connector loading after Checkpoint IO plugins introduction Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com> Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>	2021-08-23 13:12:18 -07:00
Adrian Wälchli	49c52b0d4b	update an outdated error message in DDPPlugin (#9005 )	2021-08-23 15:29:07 +00:00
Ning	2481816490	Deprecate `prepare_data_per_node` flag on Trainer and set it as a property for DataHooks (#8958 ) Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>	2021-08-23 12:43:45 +00:00
Sean Naren	c6b6888387	Add DeepSpeed Stage 1 + doc improvements for model parallel (#8974 ) * Add stage 1 support + small doc improvements * Add CHANGELOG.md	2021-08-18 19:40:19 +05:30
Sean Naren	b2973a035e	Introduce CheckpointIO Plugin (#8743 )	2021-08-13 17:35:31 +01:00
Sean Naren	e5d9e21dea	Fix save/load/resume from checkpoint for DeepSpeed Plugin (#8397 )	2021-08-02 22:31:05 +00:00
thomas chaton	9e61de2063	Torch Elastic DDP DeadLock bug fix (#8655 ) Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>	2021-08-02 21:48:43 +02:00
Sean Naren	7a1e97203e	Add property to skip restoring optimizers and schedulers via plugin (#8644 )	2021-07-31 10:08:10 +02:00
Jirka Borovec	0c0b24c031	Prune deprecated metrics (#8586 ) * drop metrics * drop tests * fix imports Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	2021-07-28 16:57:31 +00:00
thomas chaton	c7f8c8c3c8	[bugfix] DeepSpeed with no schedulers (#8580 ) Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	2021-07-27 15:28:10 +00:00
Carlos Mocholí	a64cc37394	Replace `yapf` with `black` (#7783 ) Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	2021-07-26 13:37:35 +02:00
Kaushik B	ef7d41692c	Add `ddp_*_find_unused_parameters_false` to Plugins Registry. (#8483 ) Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>	2021-07-24 04:02:54 +00:00
Carlos Mocholí	4a64bc3fd3	Fix DeepSpeed lr scheduler logic (#8527 ) * Fix deepspeed scheduler logic * Fix tests * Minor changes * Improve tests * inference fix * CHANGELOG Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>	2021-07-23 10:08:58 +01:00
Sean Naren	8a9ee403be	Add Windows Support for DeepSpeed (#8488 ) * Modify deepspeed distributed to support windows * Add weak test * Cleanups * Capture more in tests * Add comment * Cleaner asserts	2021-07-20 13:55:52 +00:00
deepsource-autofix[bot]	ddf4a0213d	Use `is` to compare type of objects (#8404 )	2021-07-19 11:17:45 +00:00
Sean Naren	06ac7d9649	[Fix] Remove DeepSpeed Plugin FP16 exception (#8462 ) * Remove error, add mixed to check * Add test * Remove test * Add changelog * Add test for mixed * Update tests/plugins/test_deepspeed_plugin.py * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add special Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	2021-07-19 11:12:31 +00:00
Adrian Wälchli	b42efa7d86	support launching Lightning ddp with traditional command (#7480 ) Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>	2021-07-14 11:25:36 +00:00
Andrew Tritt	3102922647	Add LSF support (#5102 ) * add ClusterEnvironment for LSF systems * update init file * add available cluster environments * clean up LSFEnvironment * add ddp_hpc as a distributed backend * clean up SLURMEnvironment * remove extra blank line * init device for DDPHPCAccelerator We need to do this so we don't send the model to the same device from multiple ranks * committing current state * add additional methods to ClusterEnvironments * add NVIDIA mixin for setting up CUDA envars * remove troubleshooting prints * cleanup SLURMEnvironment * fix docstring * cleanup TorchElasticEnvironment and add documentation * PEP8 puts a cork in it * add set_ranks_to_trainer * remove unused import * move to new location * update LSF environment * remove mixin * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * changelog * reset slurm env * add tests * add licence * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * test node_rank * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add lsf env to docs * add auto detection for lsf environment * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix is_using_lsf() and test * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	2021-07-09 16:14:26 +02:00
Dusan Drevicky	1b06edf2f2	Add the `on_before_optimizer_step` hook (#8048 ) Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>	2021-07-09 13:30:52 +02:00
thomas chaton	1c825a2a9c	Add the `on_before_backward` hook (#7865 ) * Add callback to hook tests and add predict test * Fix lambda callback test * Simplify lambda call test * Use LambdaCallback * Dynamically append to called for the model * Remove print * Consistency * Consistency * Prepare args/kwargs testing * yapf doesn't like dict literals * Add arguments for fit no val test * Add arguments for fit no val test * add before_backward_hook * add test * resolve flake8 * resolve tests * update changelog * add on_before_backward to LightningModule * update on comments * Test arguments * Datamodule refactor * Fix eval test * remove extra file * resolve bug * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * move to hooks * update * resolve flake8 * update on comments * Update full fit + val test * Update test * Remove FIXME * Remove FIXME * Undo change * Fix * Parametrize fit hook test * Comment * Parametrize fit hook test with different precision plugins * Fix tests * Parametrize fit hook test with manual optimization * Unnecessary parenthesis * WIP * Comments * Fix message * Test CI error * Revert "Test CI error" This reverts commit `39c4a85a83`. * Add ddp training type teardown * Update CHANGELOG * Adrian's fix * Use destructor * Update CHANGELOG.md * RPC destructor * Update pytorch_lightning/plugins/training_type/ddp.py * Why do you not work :( * Missing condition * Fix deepspeed test * GC collect in conftest * Do not show warnings for special tests * Needs to run on 1.8 To avoid: "RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:32, unhandled cuda error, NCCL version 2.4.8" * Run torch 1.8 * Skip test due to 'Python bus error' * Debug NCCL * shm size * Disable warnings for special tests * Remove NCCL_DEBUG statement * Try smaller shm size * Revert "Skip test due to 'Python bus error'" This reverts commit `e0a3e8785d`. * README and adjust versions * Avoid self.on_gpu call * empty cache cleanup * More garbage collection * Unroll parametrizations * Do not reuse mock * Undo changes * Undo notebooks modification * resolve test * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * delete file * Undo * Fix test * Revert "WIP" This reverts commit `f5828a8c42`. * Rename * Remove optimizers * Fix bug with LightningOptimizer * Add optimizers * update * update * Update CHANGELOG * On after backward refactor * Do not call super * Fixes * Remove should_accumulate * pre/post backward refactor * Call the LM backward hook * Update tests * Remove dev debug patch * Fix test * Remove optimizer arguments and typing * Docs fixes * Fix comment * Undo changes * Split manual and auto * Undo change * Deepsource * Remove optimizers * Undo changes * Call the hook * Docs * Docs Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>	2021-07-09 06:15:57 +00:00
Carlos Mocholí	eb6d991218	Refactor plugins backward (#8328 )	2021-07-08 16:02:09 +02:00
Carlos Mocholí	398eed508f	Fix `self.optimizers()` not returning a single `LightningOptimizer` (#8326 )	2021-07-07 18:57:45 +02:00
Carlos Mocholí	ea88105b88	Parametrize fit hook test with different precision plugins (#8070 ) Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>	2021-07-05 10:50:01 +00:00
Adrian Wälchli	e7139ab9f7	Support `DDPPlugin` to be used on CPU (#6208 ) * Skip test due to 'Python bus error' * Debug NCCL * Remove NCCL_DEBUG statement * Revert "Skip test due to 'Python bus error'" This reverts commit `e0a3e8785d`. * fix * add test * changelog * yapf * patch os environ * make a special test * destroy pg * debug * revert * revert * problematic test * skip * try the fixture * test * update sensitive test * update changelog * remove comment * update wrong test * update test name * parameterization * Revert "parameterization" This reverts commit b0542f43f59c5ce66800883b5e2f0c66a97408cc. * remove conftest * ignore test * teardown * fix merge * deep speed parameterization * uncomment test * update chlog * update changelog * split tests * update test update test update test update test * update test comments * unroll test * unroll test * unroll test * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * increase shm * sudo * unroll ipu * Revert "sudo" This reverts commit `6cc68c1478`. * Revert "increase shm" This reverts commit `8c27163483`. * x * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * find guilty test * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * POPTORCH_WAIT_FOR_IPU=1 * move test * redo parameterize for ipu * de-comment test * move chlog * Update tests/accelerators/test_accelerator_connector.py Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com> * Update tests/accelerators/test_accelerator_connector.py Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com> Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com>	2021-07-02 12:00:24 +01:00
Ethan Harris	57dce7244c	Fix double precision casting complex buffers (#8208 ) * Fix double precision casting complex buffers * Update CHANGELOG.md * Fixes * Fixes * Fix Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>	2021-06-30 10:57:42 +01:00
Adrian Wälchli	bf54ac1cad	fix NCCL error with non-consecutive trainer gpus (#8165 ) * device ids in barrier x x s same fix for spawn fix non-nccl x * add changelog * get nccl backend * get backend Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com>	2021-06-28 22:08:10 +02:00
Carlos Mocholí	4d9b72b8a9	Nuke RPC (#8101 )	2021-06-23 18:31:13 +00:00
Edgar Riba	b378806b6c	Add `add_to_queue`/`get_from_queue` for DDP spawn(#7916 ) Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>	2021-06-23 03:19:37 +02:00
Sean Naren	55494e8745	Fix Special Tests (#7841 ) * Remove port setting * Drop one of the params to see what happens * Split tests into two * Try using port setting	2021-06-16 19:39:03 +02:00
thomas chaton	d2983c7c51	[fix] Enable manual optimization DeepSpeed (#7970 ) * resolve manual optimization * resolve manual optimization * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update changelog * Simplify message * Move from deprecated * Split model parallel/manual model * Use property Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: SeanNaren <sean@grid.ai>	2021-06-16 09:25:41 +00:00
Yifu Wang	b71aa55b9e	Make optimizers skippable when using amp (#7975 ) Co-authored-by: Yifu Wang <yifuwang@2012@gmail.com> Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>	2021-06-16 00:23:30 +00:00
Kaushik B	78a14a3f56	Add `tpu_spawn_debug` to plugin registry (#7933 )	2021-06-15 22:32:51 +00:00
Carlos Mocholí	560b1970af	Standardize positional datamodule and argument names (#7431 ) Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>	2021-06-15 11:50:13 +00:00
Sean Naren	f7459f5328	DeepSpeed Infinity Update (#7234 ) * Update configs to match latest API * Ensure we move the entire model to device before configure optimizer is called * Add missing param * Expose parameters * Update references, drop local rank as it's now infered from the environment variable * Fix ref * Force install deepspeed 0.3.16 * Add guard for init * Update pytorch_lightning/plugins/training_type/deepspeed.py Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> * Revert type checking * Install master for CI for testing purposes * Update CI * Fix tests * Add check * Update versions * Set precision * Fix * See if i can force upgrade * Attempt to fix * Drop * Add changelog Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>	2021-06-14 16:38:28 +00:00
Carlos Mocholí	5593b6f772	Merge pull request #7872 from PyTorchLightning/refactor/logger-poc-changes Random fixes for logger connector PoC	2021-06-08 09:04:16 -04:00
Ethan Harris	03bb389b21	Fix double precision + ddp_spawn (#6924 ) * Initial fix * Initial fix * Initial fix * Updates * Updates * Update typing and docs * Undo accidental refactor * Remove unused imports * Add DDP double precision test * Remove unused variable * Update CHANGELOG.md * Fix test * Update tests * Formatting * Revert bad change * Add back changes * Correct wrapping order * Improve unwrapping * Correct wrapping order * Fix... finally * Respond to comments * Drop ddp test * Simplify ddp spawn test * Simplify ddp spawn test Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>	2021-06-01 15:21:17 +00:00
shuyingsunshine21	299f2c481b	FSDP with full state dict (#7487 ) * Fix some test errors Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * checkpoint consolidation * Update ddp_spawn.py * Update test_metric_result_integration.py * Update test_results.py * Update utils.py * Update utils.py * Update test_all_gather_grad.py * Update test_all_gather_grad.py * Update test_results.py * Revert "Update test_results.py" This reverts commit `9d4a2b891d`. * Revert "Merge pull request #1 from shuyingsunshine21/shuyingsunshine21-checkpoint_consolidate" This reverts commit `c5053da789`, reversing changes made to `0d23d75bc9`. * Revert "Update test_all_gather_grad.py" This reverts commit `0d23d75bc9`. * Revert "Update utils.py" This reverts commit `70fe5da9c6`. * Revert "Update utils.py" This reverts commit `a9aae99f6e`. * Revert "Update test_results.py" This reverts commit `ea74906878`. * Revert "Update test_metric_result_integration.py" This reverts commit `bf70e431b3`. * Revert "Update ddp_spawn.py" This reverts commit `f17210183b`. * Revert "checkpoint consolidation" This reverts commit `536c1323b0`. * Revert "Revert "checkpoint consolidation"" This reverts commit `3a9fde915a`. * Revert "Revert "Revert "checkpoint consolidation""" This reverts commit `7a369f47e1`. * Revert "Revert "Update ddp_spawn.py"" This reverts commit `8222dc98ea`. * Revert "Revert "Update test_metric_result_integration.py"" This reverts commit `6c095b2370`. * Revert "Revert "Update test_results.py"" This reverts commit `250d0aaaa2`. * Revert "Revert "Update utils.py"" This reverts commit `8651d54d79`. * Revert "Revert "Update test_all_gather_grad.py"" This reverts commit `dcdcd29731`. * modify distributed environment to make test pass * fix version for ddp plugin test * fix * fix * changelog * Update CHANGELOG.md * fsdp with full state dict * fix missing import * modify unitest * fix * fix * fix typo * modify test and add changelog * fix * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * limit max_epoch to 1 for testing * test * fix * update * testing remove special for multi gpu * assert gpu * add assertion for gpu * fix * Re-enable special test, use ModelCheckpoint * Fix paths * Fix path passing * test * test * fix test * fix * pre-commit format * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: SeanNaren <sean@grid.ai>	2021-05-24 08:11:45 +01:00
shuyingsunshine21	2242423b75	refactor accelerator teardown -> training type plugin teardown (#7579 )	2021-05-22 13:19:24 -07:00
Nic Eggert	f4f51e0dcf	Add kubeflow cluster environment (#7300 ) * Add kubeflow cluster environment * Add KubeflowEnvironment to docs * Add KubeflowEnvironment to the changelog * break up a long line * Add method to detect kubeflow environment * Select Kubeflow environment when available * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Run pre-commit * task_idx == 0 Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>	2021-05-17 09:05:24 +01:00
Adrian Wälchli	a1a655d006	Reduce log output size in special tests (#7481 )	2021-05-11 17:36:20 +02:00
shuyingsunshine21	987530cd38	Set `num_nodes` and `sync_batchnorm` From Trainer for Manually Passed Training Type Plugin (#7026 ) Co-authored-by: thomas chaton <thomas@grid.ai> Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com> Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com> Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	2021-05-08 11:25:51 +00:00
Leonard Lausen	98b94b810c	Fix DeepSpeedPlugin with IterableDataset (#7362 ) * deepspeed add train_micro_batch_size_per_gpu argument * Update naming and doc * Modify to use auto naming convention, add test * Add iterable tests * Fix tests, attempt by mocking * Import correct package * Fix comparison * Set as special test * Remove import * Add Changelog Co-authored-by: SeanNaren <sean@grid.ai>	2021-05-07 10:46:03 +01:00
Kaushik B	e21b7a62d7	Add ddp_find_unused_parameters_false to Registry (#7224 )	2021-05-04 22:40:00 +00:00
Carlos Mocholí	8c0ea92af2	`TrainerState` refactor [5/5] (#7173 ) * `TrainerState` refactor * flake8 * Update finished check * Test cleanup * Fix tests * Fixes * Reorder * flake8 * Update CHANGELOG * Better docs * Better docs * Remove default * Update tests * Bad merge	2021-05-04 12:50:56 +02:00
Kaushik B	6d7c6d6403	Update Accelerator Connector for Registry (#7214 )	2021-05-03 21:03:21 +00:00
thomas chaton	16d6c9828d	[bugfix] Apex never instantiated. (#7274 ) * update * update * update apex * update * update * update * remove test.py * update * update * update on comments * update changelog * update * update * typo	2021-04-30 13:16:28 -04:00
thomas chaton	013756404b	[bugfix] Add set_default_tensor_type to torch.DoubleTensor with precision=64 (#7108 ) * update * Update pytorch_lightning/plugins/precision/double.py Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> * Update pytorch_lightning/plugins/precision/double.py Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> * Update pytorch_lightning/plugins/precision/double.py Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> * resolve tests Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>	2021-04-20 15:25:37 +00:00

1 2 3

126 Commits