lightning

Commit Graph

Author	SHA1	Message	Date
Sean Naren	bac8b1be81	Add support for CPU AMP autocast (#9084 )	2021-08-25 12:18:00 +00:00
Sean Naren	e9f4bffe0a	Add validate logic for precision (#9080 )	2021-08-24 20:00:09 +00:00
ananthsub	1e4d8929fb	Simplify checkpoint connector loading after Checkpoint IO plugin's introduction (#9045 ) * Simplify checkpoint connector loading after Checkpoint IO plugins introduction Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com> Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>	2021-08-23 13:12:18 -07:00
Carlos Mocholí	93ab24d1ee	Replace DataLoader sampler once for IPUs (#8858 )	2021-08-16 11:28:05 +02:00
Sean Naren	b2973a035e	Introduce CheckpointIO Plugin (#8743 )	2021-08-13 17:35:31 +01:00
Carlos Mocholí	ed13040729	Connect the model to the training type plugin at the start of run (#8536 )	2021-08-04 17:43:34 +02:00
samlurye	f90849cc95	Deprecate LightningModule.summarize() in favor of pl.utilities.model_summary.summarize() (#8513 ) Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> Co-authored-by: ananthsub <ananth.subramaniam@gmail.com> Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>	2021-08-03 22:08:51 +00:00
Kaushik B	d01d8334b5	Fix `ddp` accelerator choice for cpu (#8645 ) * Fix ddp accelerator choice for cpu	2021-08-02 21:24:07 +00:00
Kaushik B	850416f0a0	Fix distributed types support for CPUs (#8667 )	2021-08-02 16:42:28 +05:30
Sean Naren	07b7dc9c17	[Fix] Add delay property for checkpointing, refactor loading checkpoint (DeepSpeed Checkpointing Fix 1/n) (#8627 ) * Add property to delay checkpointing, move loading checkpoint file into the run function to allow deepspeed engine to be loaded * Add a small test * Apply suggestions from code review Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * Update pytorch_lightning/accelerators/accelerator.py Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * Address review * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	2021-07-30 11:31:08 +01:00
Kaushik B	f457f97d97	Raise exception for ddp_cpu not supported for TPUs (#8530 )	2021-07-26 13:52:34 +00:00
Carlos Mocholí	a64cc37394	Replace `yapf` with `black` (#7783 ) Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	2021-07-26 13:37:35 +02:00
Kaushik B	5452590872	fix: Enable manual optimization for TPUs (#8458 )	2021-07-22 15:33:35 +05:30
Kaushik B	556879e5cf	Add support for devices flag to Trainer (#8440 ) * Support devices flag to Trainer	2021-07-20 04:33:12 +00:00
Kaushik B	825c5dbe8c	Add support for (accelerator='cpu'\|'gpu'\|'tpu'\|'ipu'\|'auto') (#7808 ) Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> Co-authored-by: Ethan Harris <ewah1g13@soton.ac.uk> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: SeanNaren <sean@grid.ai>	2021-07-09 15:28:54 +00:00
Sean Naren	6d558961e3	[IPU] Allow poptorch.Options to override Trainer (#8233 ) * Add test for poptorch Options * Hacks to get manual plugin support * Revert changes * Fix tests + ensure logic follow suit * Update pytorch_lightning/plugins/training_type/ipu.py Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Cleaner * Cleaner Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	2021-07-05 13:42:00 +00:00
Sean Naren	07b1ce227c	[IPU] Fix Custom Poptorch options to IPUPlugin (#8241 ) * Fixes to ensure ipu options are respected * Better setter * Add test for poptorch Options * Fix test * fix ipu test * Update pytorch_lightning/plugins/training_type/ipu.py Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>	2021-07-02 11:23:57 +00:00
Adrian Wälchli	e7139ab9f7	Support `DDPPlugin` to be used on CPU (#6208 ) * Skip test due to 'Python bus error' * Debug NCCL * Remove NCCL_DEBUG statement * Revert "Skip test due to 'Python bus error'" This reverts commit `e0a3e8785d`. * fix * add test * changelog * yapf * patch os environ * make a special test * destroy pg * debug * revert * revert * problematic test * skip * try the fixture * test * update sensitive test * update changelog * remove comment * update wrong test * update test name * parameterization * Revert "parameterization" This reverts commit b0542f43f59c5ce66800883b5e2f0c66a97408cc. * remove conftest * ignore test * teardown * fix merge * deep speed parameterization * uncomment test * update chlog * update changelog * split tests * update test update test update test update test * update test comments * unroll test * unroll test * unroll test * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * increase shm * sudo * unroll ipu * Revert "sudo" This reverts commit `6cc68c1478`. * Revert "increase shm" This reverts commit `8c27163483`. * x * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * find guilty test * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * POPTORCH_WAIT_FOR_IPU=1 * move test * redo parameterize for ipu * de-comment test * move chlog * Update tests/accelerators/test_accelerator_connector.py Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com> * Update tests/accelerators/test_accelerator_connector.py Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com> Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com>	2021-07-02 12:00:24 +01:00
nisheethlahoti	06f8349291	Support calling fit and test scripts using "python -m" module syntax with DDP (#8073 ) Co-authored-by: Nisheeth Lahoti <nisheeth@rephrase.ai>	2021-06-23 02:42:04 +00:00
Andrew Tritt	e808f9fb28	Use DistributedSampler when running with custom accelerator (#7814 ) Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>	2021-06-18 14:34:05 +02:00
Sean Naren	024cf23c67	Remove convert_to_half, suggest using `model.half` (#7974 )	2021-06-14 18:48:02 +01:00
Sean Naren	96433d03ea	IPU Integration 5/5 (#7867 ) * Initial changes * Add broken example for now * Fix reference * Fix format * Code runs * Fixes * Clear up files * Add tests, helpers, fixes * Small cleanups * Refactors based on review * Swap to special tests * Add special tests * Add source * Cleanups * Add logic to attach/detach model from devices * Fixes for tests * Fixes for tests * Move earlier * Cleanups * Add check for nvcc * Add tests, cleanups * Fix errors * fix * Try condition * Add missing annotation * Clearer * Clearer message * Fix variable * Cleanups * Add comment * CHANGELOG.md * Add simple selection test * Remove special=True to see what happens * Fix test * Update tests/accelerators/test_ipu.py Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com> * Convert ipu_cores -> ipus * Add typing, fail earlier * simplify precision * Add test, add helper * fix accum * Update pytorch_lightning/plugins/training_type/ipu.py Co-authored-by: thomas chaton <thomas@grid.ai> * Use stages * Make sure warning message returned * thorw error * Add more tests, use fs * add comment * Clean * Address feedback, add IPU tests * Fixes * Fix signature * Add types * Remove autoround * Add docstring * ipu_cores -> ipus * Add test, remove unnecessary precision set * Add optimizer test * Add precision back with test * Address code review * Change to probs * Move some of the asserts earlier Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com> Co-authored-by: thomas chaton <thomas@grid.ai>	2021-06-11 15:07:04 +00:00
Sean Naren	6388c29e87	[IPU] Add reset dataloader hooks to training type plugin 3/n (#7861 ) * Add hooks * Add tests for hooks * Add changelog * Test changes, add typing	2021-06-07 10:37:09 +00:00
Ethan Harris	03bb389b21	Fix double precision + ddp_spawn (#6924 ) * Initial fix * Initial fix * Initial fix * Updates * Updates * Update typing and docs * Undo accidental refactor * Remove unused imports * Add DDP double precision test * Remove unused variable * Update CHANGELOG.md * Fix test * Update tests * Formatting * Revert bad change * Add back changes * Correct wrapping order * Improve unwrapping * Correct wrapping order * Fix... finally * Respond to comments * Drop ddp test * Simplify ddp spawn test * Simplify ddp spawn test Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>	2021-06-01 15:21:17 +00:00
Carlos Mocholí	d47173bb72	Use typing forward references (#7770 ) * Use typing forward references * Update pytorch_lightning/core/lightning.py	2021-05-31 09:54:28 +02:00
Carlos Mocholí	a69beab499	Clean existing logging tests (#7760 ) * Remove dev debugger metric tracking * Fix tests * Fix test * Import * Clean logging tests * flake8 * Docstring	2021-05-30 16:36:52 +02:00
Carlos Mocholí	bc3238be8c	Remove metric tracking from dev debugger (#7759 ) * Remove dev debugger metric tracking * Fix tests * Fix test * Import * Fix tests * Fix test * flake8 * Fix tests	2021-05-30 12:03:42 +02:00
Kaushik B	3f460b150a	Move parameter validation specific to TPU Training plugins (#7415 ) * Move parameter validation specific to TPU Training plugins * update docstring	2021-05-24 16:02:01 +00:00
Adrian Wälchli	502adbced3	refactor optimizer loop logic for manual and automatic optimization (#7526 ) Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> Co-authored-by: ananthsub <ananth.subramaniam@gmail.com> Co-authored-by: Justus Schock <12886177+justusschock@users.noreply.github.com>	2021-05-17 14:42:01 +02:00
Kaushik B	bf46730d92	Support TPU Pod Training (n/n) (#7296 )	2021-05-17 11:33:44 +00:00
Nic Eggert	f4f51e0dcf	Add kubeflow cluster environment (#7300 ) * Add kubeflow cluster environment * Add KubeflowEnvironment to docs * Add KubeflowEnvironment to the changelog * break up a long line * Add method to detect kubeflow environment * Select Kubeflow environment when available * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Run pre-commit * task_idx == 0 Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>	2021-05-17 09:05:24 +01:00
Alan Du	6ac16ff348	Fix DistribType for `ddp_cpu` (spawn) (#7492 )	2021-05-14 20:53:26 +01:00
Jirka Borovec	d4ec75164c	Prune deprecated trainer attributes (#7501 ) * use_single_gpu * use_horovod * use_ddp2 * use_ddp * use_dp * on_gpu * use_tpu * on_tpu * on_cpu * cleaning * chlog * Apply suggestions from code review * Apply suggestions from code review	2021-05-12 20:10:15 +00:00
Ethan Harris	45143fd825	Improve val step logging (#7351 ) * Fix val step logging * Add a type * Fix * Update CHANGELOG.md	2021-05-07 22:58:03 +00:00
Martin Kristiansen	c3fc0313ef	Updating docs and error message: half precision not available on CPU (#7384 ) * Updating docs and error message to specify that half precission not available on CPU * update messages Co-authored-by: Martin Kristiansen <martinkristiansen@sixgill.com> Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> Co-authored-by: jirka <jirka.borovec@seznam.cz>	2021-05-06 09:05:50 +00:00
Carlos Mocholí	8c0ea92af2	`TrainerState` refactor [5/5] (#7173 ) * `TrainerState` refactor * flake8 * Update finished check * Test cleanup * Fix tests * Fixes * Reorder * flake8 * Update CHANGELOG * Better docs * Better docs * Remove default * Update tests * Bad merge	2021-05-04 12:50:56 +02:00
Carlos Mocholí	40f80230fe	Remove `trainer.fit` return value [2/n] (#7237 ) * `_fit_impl` refactor and types * Fix return * Remove return docstring * Fixes * Fixes * Remove `trainer.fit` return value * Update CHANGELOG * flake8 * Undo results change * Fix test * Revert changes for a separate PR * flake8	2021-04-28 19:11:32 +01:00
thomas chaton	5a113a2f05	[bug/feat] Support parameters_to_ignore in DDP (#7239 ) * update * update * update * update on comments * update	2021-04-27 17:49:32 +00:00
Kaushik B	f168a535ca	Add MpModelWrapper in TPU Spawn (#7045 ) Co-authored-by: tchaton <thomas@grid.ai> Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>	2021-04-20 13:05:27 +00:00
thomas chaton	3cc0b2c063	[test] Add checks for gpus=1 (#7105 ) * update * remove cluster env	2021-04-19 20:39:28 +02:00
Adrian Wälchli	e9fca760ac	Set `DistributedSampler` seed if `seed_everything` was called (#7024 ) Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>	2021-04-19 14:50:31 +01:00
Adrian Wälchli	33cc9fe138	Clean up environment access in plugins (#6941 ) Co-authored-by: ananthsub <ananth.subramaniam@gmail.com> Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com>	2021-04-13 20:07:40 +02:00
Kaushik B	f79a13e495	[Model Parallel] Add configure sharded model hook (#6679 ) * Add base hook for model parallel * fix callback signature * Simplify hook * Add hook logic * add tests * add property setter * add logic for being called once * Update changelog * Fix * fix return type * fix lambda callback test * Fix tests * Apply code suggestions * add logic for setup_optimizers_predispatch * add common dummy model * Swap call order * Remove test that isn't needed anymore * Update tests * Add a bit more doc * Few code review fixes * Update pytorch_lightning/accelerators/accelerator.py Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> * Change hook name * Fix test * Test setup hook, refactor names * Swap call order of callbacks and model initialization * Change name of context manager Co-authored-by: SeanNaren <sean@grid.ai> Co-authored-by: Sean Naren <sean.narenthiran@gmail.com> Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>	2021-03-29 14:50:51 -06:00
Carlos Mocholí	21fc5eb21e	Automatically find and run special tests (#6669 )	2021-03-26 17:04:59 +00:00
thomas chaton	fd5cb7fcc3	Add PyTorch 1.8 Profiler 5/5 (#6618 ) * Refactor profilers * Update PassThrough * WIP - This is broken and will change * Update pytorch_lightning/profiler/pytorch.py Co-authored-by: thomas chaton <thomas@grid.ai> * resolve tests * resolve tests * find output * try something * update * add support for test and predict * update * update * use getattr * test * test * update * tests * update * update * update * update * update * remove file * update * update * update * update * update * test * update# * update * update tests * update * add suport for 1.8 * rename records * add support for 1.8 * update * resolve flake8 * resolve test * Refactor basic profilers * Fixes * Unused import * Introduce setup * Profile on all ranks. Print to stdout on 0 * Introduce dirpath + filename * CHANGELOG * Add tests. Address comments * add `on_run_stage_setup` * add on_run_stage_setup function * update * add test for RegisterRecordFunction * update lightnng flow direction * move variable to private * remove trace * Undo code that should be in 3/4 * Multi-stage multi-rank * 2/5 changes * Pass stage in __del__ * Remove TODOs * Describe on_evaluation_end. Add tests * Typo * Address comments * deepcopy tests * Advanced teardown * Fix teardown test * Fix tests * Minor change * Update CHANGELOG.md * Fix test * Quick fixes * Fix 6522 * resolve ddp tests * resolve tests * resolve some tests * update tests * resolve tests * update * resolve tests * resolve some tests * Missed fixes from 3/5 * Fixes * resolve some tests * resolve test for 1.7.1 * Broken refactor * Missed stage * Minor changes * resolve tests * Update CHANGELOG * resolve bug * remove print * Typo * Cleanup * resolve ddp test * remove barrier * update profiler * update * Smaller model * update * resolve tests * update * Minor changes. CHANGELOG * Minimize diff * update to 1.8.1 * RunIf. Extra code. Check segfault * resolve tests * Typo. Bad merge * Fixing a bad merge * replace for kineto * Update pytorch_lightning/profiler/pytorch.py Co-authored-by: ananthsub <ananth.subramaniam@gmail.com> * Update pytorch_lightning/profiler/pytorch.py Co-authored-by: ananthsub <ananth.subramaniam@gmail.com> * Minor changes * Bad merge * Use lists for flexibility * Use sets * predict_step * Ananth's suggestion * update * Docs * Update pl_examples/basic_examples/profiler_example.py Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> * update example * update example Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com> Co-authored-by: ananthsub <ananth.subramaniam@gmail.com> Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>	2021-03-23 20:43:21 +00:00
Jirka Borovec	efce2b7777	Prune metrics: regression 8/n (#6636 ) * explained_variance * tests * mean_absolute_error * mean_squared_error * mean_relative_error * mean_squared_log_error * chlog	2021-03-23 09:35:51 +01:00
Sean Naren	58c9fa7edb	Allow training type plugin to delay optimizer creation (FSDP 2/n) (#6331 ) * Allow training_type_plugin to delay optimizer configure * Add missing references to trainer, add a CPU accelerator based test	2021-03-22 11:43:53 +00:00
Sean Naren	4e9b453854	[Fix] Move init dist connection into the setup function (#6506 ) * Move connection setup into the setup function. Call setup hook after we set up the accelerator * Added CHANGELOG.md * fix setup order in callback test * fix input arguments in test * Mock distributed function, remove protection to turn into training type hook * Remove import * Add missing mock, ensure custom plugin does not create children process * Skip test on windows * Update deepspeed to init connection in setup * Do not initialize distributed module * Move DeepSpeed tests to special tests since dist communication is being set up * Special the test to see if this fixes CI * Delete accelerator connector test to see if its causing build to fail * Delete deepspeed test * Revert "Delete accelerator connector test to see if its causing build to fail" This reverts commit `edde60b8` * Revert "Delete deepspeed test" This reverts commit `9d317429` * Reverse hook * Reverse setup hooks to debug again * Add todo so i know where i left off * For single device move in pre_dispatch after setup function * Add additional model to device hook if any additional parameters have been set * See if we can enable deepspeed tests * Revert "See if we can enable deepspeed tests" This reverts commit `b5450def` * See if this hook approach works * Introduce new granular hooks * Remove import, fix tpu spawn by moving the function to setup * Added missing special test Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>	2021-03-18 14:33:39 -07:00
Jirka Borovec	b341b53f70	deprecate metrics pkg (#6505 ) * deprecate metrics * examples * req * docs * Apply suggestions from code review Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> Co-authored-by: Nicki Skafte <skaftenicki@gmail.com> * pep8 Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> Co-authored-by: Nicki Skafte <skaftenicki@gmail.com>	2021-03-15 14:39:38 +00:00
Rohit Gupta	c53edce1a1	Disable batch transfer in DP mode (#6098 ) * add exceptions and test * hook * fix * clean up * clean up * regex * regex * docs * rev * comment and docs * chlog * Apply suggestions from code review Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> * Apply suggestions from code review Co-authored-by: chaton <thomas@grid.ai> * Monkey-patch device count * docs * pep * api_change Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> Co-authored-by: chaton <thomas@grid.ai>	2021-03-11 10:51:10 -05:00

1 2

81 Commits