lightning

Commit Graph

Author	SHA1	Message	Date
Adrian Wälchli	8943d8bca0	add missing logic to new plugins and accelerator (#5734 ) * add missing logic * missed imports * import fixes * isort * mv f * changelog * format * move helper to parallel plugin * d	2021-02-01 13:23:53 -05:00
Justus Schock	b3ebc18bcb	Hardware specific parts of Accelerator Refactoring (#5719 ) * add basic accelerator class. Co-Authored with @awaelchi * pep8 Co-authored-by: @awaelchi * add cpu accelerator Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * add gpu accelerator Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * add tpu accelerator Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * add accelerator connector Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * add single device training Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * add single tpu Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * add tpu spawn Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * make on_colab_kaggle utility func * add basic accelerator class. Co-Authored with @awaelchi * pep8 Co-authored-by: @awaelchi * add cpu accelerator Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * add gpu accelerator Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * add tpu accelerator Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * add accelerator connector Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * add single device training Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * add single tpu Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * add tpu spawn Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * make on_colab_kaggle utility func * fixes * move * yapf * . * . * . * flake8 * sync accelerator connector changes from dev1.2 * changelog * fix tpu handling * tpu * aval * yapf * Update pytorch_lightning/plugins/training_type/tpu_spawn.py Co-authored-by: chaton <thomas@grid.ai> * Update pytorch_lightning/accelerators/accelerator_connector.py Co-authored-by: chaton <thomas@grid.ai> * Update pytorch_lightning/plugins/training_type/tpu_spawn.py Co-authored-by: chaton <thomas@grid.ai> * Update tpu_spawn.py * Update pytorch_lightning/accelerators/accelerator_connector.py Co-authored-by: chaton <thomas@grid.ai> * indentation Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> Co-authored-by: Jirka Borovec <jirka.borovec@seznam.cz> Co-authored-by: chaton <thomas@grid.ai>	2021-02-01 08:34:59 -05:00
Justus Schock	069ae27cef	Accelerator Refactor: Precision Plugins (#5718 ) * add basic accelerator class. Co-Authored with @awaelchi * add basic trainign type plugin. Co-Authored with @awaelchi * pep8 Co-authored-by: @awaelchi * update copyright Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * add apex_amp Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * add mixed base class Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * add native amp Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * add native amp sharded Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * add tpu bfloat Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * add inits Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * Update precision_plugin.py Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>	2021-01-31 13:12:02 -05:00
Justus Schock	5d239ccd70	Base classes for accelerator refactoring (#5715 ) * add basic accelerator class. Co-Authored with @awaelchi * Add base plugin class. Co-authored with @awaelchi * add basic trainign type plugin. Co-Authored with @awaelchi * add basic precision plugin. Co-Authored with @awaelchi * Add missing inits. Co-authored with @awaelchi * pep8 Co-authored-by: @awaelchi * ignore flake8 * coverage omit * imports in init * lost * imports * flake8 * . * . * chlog * Update pytorch_lightning/plugins/training_type/training_type_plugin.py Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> * Update pytorch_lightning/plugins/training_type/training_type_plugin.py Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> * Update pytorch_lightning/plugins/training_type/training_type_plugin.py Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> * Update pytorch_lightning/plugins/training_type/training_type_plugin.py Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> * Update pytorch_lightning/plugins/training_type/training_type_plugin.py Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> * Update pytorch_lightning/plugins/training_type/training_type_plugin.py Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> * Update pytorch_lightning/plugins/training_type/training_type_plugin.py Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> Co-authored-by: Jirka Borovec <jirka.borovec@seznam.cz> Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>	2021-01-30 14:55:28 -05:00
Jirka Borovec	7e2e874d95	Refactor: legacy accelerators and plugins (#5645 ) * tests: legacy * legacy: accel * legacy: plug * fix imports * mypy * flake8	2021-01-26 20:04:36 -05:00
Jirka Borovec	2846322f60	fix docs render (#5610 )	2021-01-25 20:21:00 -05:00
Adrian Wälchli	e806bb77fa	Refactor LightningDistributedDataParallel (#5185 ) * add wrapper * add squeeze * replace LightningDistributedDP * update import * module access * inputs * refactor warning * update * resolve flake8 * remove old class * set find unused params to False * update docstrings * update docs * update docs * add changelog * deprecation * rename wrapper -> module * rename pl_module * add unit tests * Revert "add changelog" This reverts commit 02ec0a6864f4ba2ace3bb6fc6ebc364e1a80ffd7. * Revert "set find unused params to False" This reverts commit 8e451515e6ba3227d00f4a5cb63f332cfedb7b30. Co-authored-by: Ubuntu <thomas@grid.ai>	2021-01-13 14:35:42 -05:00
Jirka Borovec	0f36525e8f	fix/enable - check F401 (#5201 ) * refactor - check F401 * missed * fix	2020-12-21 10:15:04 +01:00
chaton	2c3d43dcb5	Initialize trainer with None in DDPAccelerator (#4915 ) * Initialize trainer with None * add typing to all accelerators * resolve imports * update * add typing * removed typo * update * Fix formatting and imports in accelerator Co-authored-by: maxjeblick <maxjeblick@users.noreply.github.com> Co-authored-by: Sean Naren <sean.narenthiran@gmail.com> Co-authored-by: SeanNaren <sean@grid.ai> Co-authored-by: Roger Shieh <sh.rog@protonmail.ch>	2020-12-10 15:24:44 +01:00
Jirka Borovec	cdbddbe99f	release 1.1.0 (#5048 ) * release 1.1.0 * pep8	2020-12-10 00:52:39 +00:00
Jirka Borovec	ce9179591d	ref: clean config [1/n] add intermediate setters (#4990 ) * add intermediate setters * show inputs * fix options * move * fix * less talk * fix * talk less * str * cases * rename Co-authored-by: chaton <thomas@grid.ai>	2020-12-09 14:13:57 -05:00
Ananya Harsh Jha	127454ade2	All gatherwith grads (#5012 ) * all_gather * ddp * horovod * grad tests * fixed ddp * ddp fixed, removed tpu, horovod for now * changelog * windows fix * windows fix * removed batch from ctx * all_gather * ddp * horovod * grad tests * fixed ddp * ddp fixed, removed tpu, horovod for now * changelog * windows fix * windows fix * removed batch from ctx * removed code duplication * merge Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>	2020-12-08 23:20:01 +00:00
Sean Naren	ee9b3fe574	[feat] pp 1/n (#5016 ) * Added changes for RPC plugin * Add missing kwargs * Fix code format * Loading refactors by introducing is_distributed var, fix optimizer step flow * Add rpc guard * Added docstrings and typing * resolve comments * Add additional rpc hook, refactor name of exit process hook for clarity * remove annotation * Modify behaviour to allow optional return, add test for rpc plugin * resolve tests * rename is_ddp_based * update * update for windows * update * resolve test * code smell * Revert back to init_ddp_connection for backwards compat * Swap to explicit name for property * Add missing speed parity increase for CI variability, fix call counts for child process Co-authored-by: tchaton <thomas@grid.ai>	2020-12-08 22:02:10 +00:00
chaton	2393474350	[hotfix] ddp + manual_optimisation (#4976 ) * Rely on ddp plugin for blocking sync behaviour, and skip if we're using manual optimization * debug * Revert "debug" This reverts commit `ccca6b6b` * Expose manual reduce for automatic optimization * Add input arguments * Enable parity test * clean imports * Expose hook after to ensure we reset * Fix naming * add * fix test * resolve on comments * typo * Update tests/trainer/optimization/test_manual_optimization.py Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> * Update tests/trainer/optimization/test_manual_optimization.py Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> * update on comments * resolve comments Co-authored-by: SeanNaren <sean@grid.ai> Co-authored-by: Sean Naren <sean.narenthiran@gmail.com> Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>	2020-12-07 19:31:54 +00:00
chaton	02152c1729	Simplify optimization Logic (#4984 ) * Rely on ddp plugin for blocking sync behaviour, and skip if we're using manual optimization * debug * Revert "debug" This reverts commit `ccca6b6b` * Expose manual reduce for automatic optimization * Add input arguments * Enable parity test * clean imports * Expose hook after to ensure we reset * Fix naming * add * fix test * uniformize optimizer logic * resolve test * resovle flake8 * resolve amp bug * update tests * remove bug * remove optimizer_step in accelerators * typo * update lightning optimizer * set doesn't work with ddp_spawn * resolve flake8 * update threshold * ignore pyright * correct codeFactor * remove useless if * remove zer_grad function * simplify step * remove typo * resolve bug * Apply suggestions from code review * update on comments * resolve bugs * remove tests * Update pytorch_lightning/trainer/configuration_validator.py Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com> * simplify testing * add more tests Co-authored-by: SeanNaren <sean@grid.ai> Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>	2020-12-07 12:55:49 +00:00
Lezwon Castelino	12cb9942a1	Tpu save (#4309 ) * convert xla tensor to cpu before save * move_to_cpu * updated CHANGELOG.md * added on_save to accelerators * if accelerator is not None * refactors * change filename to run test * run test_tpu_backend * added xla_device_utils to tests * added xla_device_utils to test * removed tests * Revert "added xla_device_utils to test" This reverts commit 0c9316bb * fixed pep * increase timeout and print traceback * lazy check tpu exists * increased timeout removed barrier for tpu during test reduced epochs * fixed torch_xla imports * fix tests * define xla utils * fix test * aval * chlog * docs * aval * Apply suggestions from code review Co-authored-by: Jirka Borovec <jirka.borovec@seznam.cz> Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: chaton <thomas@grid.ai> Co-authored-by: Sean Naren <sean.narenthiran@gmail.com>	2020-12-02 13:05:11 +00:00
Sean Naren	404af43cde	5/n: Extract reference model call to plugins/accelerators (#4773 ) * Encapsulate extracting reference model within the plugin to allow custom wrapper logic to live within the plugin/accelerators * Add missing new lines * Fix call to accelerator * Removed double blank * Use accelerator backend * Handle case where wrapper has not been initialized within the plugin * Added basic get model tests, add better typing * Change model name * Split GPU/DDP test * Add stronger typing, skip ddp test on windows * Fix import * Fix import in dp * Fixed PEP8 definition * Add ddp launcher for ddp testing * Modify accelerator reference model to property, change name to reflect func * Revert property as this is incorrect.= * Revert across accelerators * Modified name to get_model_from_plugin * Code review changes, fix issue with dp * Add verb to function getter Co-authored-by: chaton <thomas@grid.ai>	2020-11-23 17:21:47 +00:00
ananthsub	45c57600af	Move init_ddp_connection to DDP Plugin (#4407 ) * Move init_ddp_connection to DDP Plugin * cluster-env * trainer? * imports * Update ddp_plugin.py Co-authored-by: Sean Naren <sean.narenthiran@gmail.com>	2020-11-18 15:49:22 -05:00
Sean Naren	e7134a9135	Sharded Plugin 2/n: Allow ddp plugin to modify optimizer state saving (#4675 ) * Allow ddp plugin to modify optimizer state saving * Rely on the accelerator for optimizer states * Ensure we init the accelerator for the saving function * Better comment for optim state dump * Revert "Ensure we init the accelerator for the saving function" This reverts commit `af65effa` * Added accelerator check to initialize tuner before saving model checkpoint * Simplify comment * Revert "Added accelerator check to initialize tuner before saving model checkpoint" This reverts commit `f9929c0c` * Return single optimizer state to reduce duplication * Fixed docstring * Fixed typing * Fixed comment * Added CHANGELOG.md Co-authored-by: chaton <thomas@grid.ai>	2020-11-18 16:38:35 +00:00
chaton	4018237c30	[FEAT] Add lambda closure to manual_optimizer_step (#4618 ) * added lambda_closure * move to types * add 2 new tests * make example more complex * add complex example to doc * added more tests * resolve doc * typo * update * update tpu optimizer_step * Apply suggestions from code review * Update pytorch_lightning/core/lightning.py Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> * update Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>	2020-11-12 19:22:06 +00:00
Sean Naren	bacabaebaf	Sharded Accelerator 1/n: Expose clip gradients to plugins via abstract class (#4639 ) * Added abstract precision plugin to expose clip_gradients function, use within accelerator to clip gradients * Exclude model from override, keep optimizer (needed for sharded clip gradients), add override for O2 support apex * Fix doc * Applied codereview changes * Refactored clip function to encapsulate tpu changes with tpu accelerator. Default to standard clip function for vanilla torch * Pass correct grad clip val * Moved var to property * Apply code review suggestions	2020-11-12 17:18:09 +00:00
chaton	7e08b0d710	[bug-fix] DDP and automatic_optimization=False (#4485 ) * resolve bug * add self._running_manual_optim * update * update tests * update lightning module * resolve bug * update tests * update * resolve pep8 * update * replace by `ddp_spawn` * temporary fix * update * update * move update to training_loop * make both ddp_spawn * introduce `manual_optimizer_step` * update changelog * added changelog wrong place * add force_optimizer_step * update docstring for tests * update optimizer_step * update zero_grad * resolve flake8 * move update into manual_optimizer_step * add zero_grad * remove zero_grad tests * remove manual_backward in AMP, it doesn't help * update * loosen tests * update * update doc * add TODO * Removed unnecessary get model from native amp * Remove try except with pytest raise * Add seed, clean up imports, remove try catch to reproduce error * update code * update test * revert back * formatting * Update pytorch_lightning/core/lightning.py Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: SeanNaren <sean@grid.ai> Co-authored-by: Sean Naren <sean.narenthiran@gmail.com> Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>	2020-11-10 19:44:51 +00:00
William Falcon	ee35907170	Accelerator docs (#4583 ) * accelerator docs * accelerator docs	2020-11-08 17:24:41 -05:00
Travis Addair	51cc7a89ee	Horovod: fixed early stopping and added metrics aggregation (#3775 ) * Fixed early stopping for Horovod * Refactored to sync_dist_if_available * Bump min Horovod version to support hvd.is_initialized * Changelog * Added back change for Horovod * Removed redundant checks for initialization * Implement metrics gathering for Horovod * Added test for EvalResult * Renamed ddp_sync_on_step -> dist_sync_on_step * Added metric test for Horovod * Added option pass callable allgather function to metric base class * Added dist_sync_fn * Fixed calls to private _sync_dist * Fixed Horovod test * Added sync_tensor to the distributed backend * Skip Windows * Insert test path * Removed redundant import * Updated drone * Unset HOROVOD_GPU_ALLREDUCE * Unset * No cache dir * No uninstall * Unset variables * Uninstall Horovod during initialization * Replaced more references to ddp_sync_on_step * Fixed imports * Fixed attribute * Added back default * Lint * Added back docstring * Made gather_all_tensors default * Added whitespace * Update tests/models/test_horovod.py Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> * Update pytorch_lightning/metrics/metric.py Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> * Update CHANGELOG.md Co-authored-by: Teddy Koker <teddy.koker@gmail.com> Co-authored-by: Sean Naren <sean.narenthiran@gmail.com> Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>	2020-11-05 12:52:02 -05:00
Ananya Harsh Jha	01ab2a933d	[bug] [docs] Clearer optimizer_step override instructions (#4455 ) * fix * flags * remove defaults	2020-11-02 22:13:34 +00:00
chaton	102fa9ee7d	[BUGFIX] AMP + Precision unscale grad (#4441 ) * move unscale within Native plugin * remove gradient tracking from lightning backward * forgot trainer.fit * typo * update * cleanup * set to 1.6 * typo * skip if below 1.6 strict * update changelog * remove useless code * Update tests/plugins/test_amp_plugin.py Co-authored-by: Sean Naren <sean.narenthiran@gmail.com> * Update tests/plugins/test_amp_plugin.py Co-authored-by: Sean Naren <sean.narenthiran@gmail.com> * update changelog * Update CHANGELOG.md Co-authored-by: Sean Naren <sean.narenthiran@gmail.com> Co-authored-by: Jeff Yang <ydcjeff@outlook.com>	2020-11-02 16:36:48 +00:00
Sean Naren	5641b266d5	Bug/4319 ddp checkpoint (#4323 ) * Broadcast best model path to ensure we sync with main process + wait for main process to save * Add barrier call to ensure all processes are in sync * Added changelog commit * Move sync of best model path/score to model checkpoint, keep barrier to ensure all processes complete * Ensure we broadcast as tuple * Add init check * Update pytorch_lightning/callbacks/model_checkpoint.py Co-authored-by: ananthsub <ananth.subramaniam@gmail.com> * Update pytorch_lightning/callbacks/model_checkpoint.py Co-authored-by: ananthsub <ananth.subramaniam@gmail.com> * Removed model checkpoint code, added barrier to trainer to enforce we syncronize and wait for all processes to finish before completing training * Add barrier within teardown call, removed horovod teardown to inherit from base accelerator Co-authored-by: ananthsub <ananth.subramaniam@gmail.com>	2020-10-24 16:55:49 -04:00
William Falcon	753362d0a4	enable ddp as a plugin (#4285 ) * enable custom ddp plugin * enable custom ddp plugin * enable custom ddp plugin * enable custom ddp plugin * enable custom ddp plugin * enable custom ddp plugin * enable custom ddp plugin * enable custom ddp plugin * enable custom ddp plugin * enable custom ddp plugin * enable custom ddp plugin Co-authored-by: chaton <thomas@grid.ai>	2020-10-22 05:15:51 -04:00
Justus Schock	0ec4107697	Optimizer closure (#4190 ) * closure for all optimizers * rename hook and take care of alternating backwards * add comment * training_loop_fix * closure whenever possible * training_loop * simple tests that count backward calls * fix test to work with closure * remove debugging statement * better place * check grads after backward * start fixing manual optimization * skip step when result returned by closure was None * fix gradient clipping test to work with closure * attribute dict result only for automatic optimization * adjust backward calls in accelerator * adjust where to call gradient clipping * adjust backward calls in tests * Apply suggestions from code review Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * pass kwargs to xla optimizer Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>	2020-10-21 19:34:29 +01:00
Akihiro Nitta	b45b57cc58	Use `Optional` for arguments set to `None` by default (#4164 ) * Use `Optional` for variables set to `None` by default * Use `Optional` instead of `Union[None, ...]` for consistency	2020-10-15 23:02:50 +02:00
Sean Naren	98eb736496	Added getstate/setstate method for torch.save serialization (#4127 ) * Added getstate/setstate method for torch.save serialization, added additional Optional Typing to results object * Added tests to ensure torch.save does not fail * Added flags to ensure compatible ddp cpu environment * Removed torch version check due to minimum already being 1.3, reduced epochs for speed * Moved tests to separate file * Update to accelerator, move to ddp_spawn to prevent hanging ddp	2020-10-13 16:47:23 -04:00
William Falcon	09c2020a93	notices (#4118 )	2020-10-13 07:18:07 -04:00
William Falcon	b9f2682b7d	clean docs, enable grad clip in manual mode (#4078 ) * docs * docs	2020-10-11 13:12:35 -04:00
William Falcon	7ffe05a3d1	ref: accelerator names (#4066 ) * ref: accelerator names * docs	2020-10-11 01:05:14 -04:00

34 Commits