lightning

Commit Graph

Author	SHA1	Message	Date
Carlos Mocholí	152eb57def	Rename special to standalone (#10779 )	2021-11-26 17:13:14 +00:00
Kaushik B	e0b4bb2ea3	Deprecate `DeviceType` in favor of `_AcceleratorType` (#10503 ) Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>	2021-11-25 16:41:03 +01:00
Adrian Wälchli	c09c9c7607	Remove redundant fit call from accelerator connector test (#10626 )	2021-11-19 12:19:52 +05:30
Adrian Wälchli	1ff35ed0f5	Improve code quality in `AcceleratorConnector._configure_slurm_ddp` (#10102 )	2021-11-17 23:10:47 +00:00
Carlos Mocholí	0fa07da987	Fail the test when a `DeprecationWarning` is raised (#9940 )	2021-11-17 23:41:50 +01:00
Carlos Mocholí	af4af3d73a	Mock GPU accelerator connector tests (#10554 )	2021-11-16 16:13:40 +00:00
Kaushik B	01cf7a2ac5	Deprecate `DistributedType` in favor of `StrategyType` (#10505 )	2021-11-15 17:10:08 +00:00
Adrian Wälchli	a270a79ed9	Rename "master" methods to "main" in ClusterEnvironment plugins (#10103 ) * rename occurrences of master port, master address, maser node, master process * rename properties * add property decorators * occurrences in docs * update changelog * update changelog * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add lost method * create deprecation * add changelog * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix typo (but it was already there!!!) * Apply suggestions from code review Co-authored-by: Justus Schock <12886177+justusschock@users.noreply.github.com> * add todo * update more occurences * add types * add missing import Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Justus Schock <12886177+justusschock@users.noreply.github.com>	2021-11-08 12:32:58 +00:00
Kaushik B	2ee6d9fbc7	Fix `distrib_type` not being set when Plugin instances being passed to Trainer (#10251 )	2021-11-01 17:11:57 +05:30
Kaushik B	e0f7dbdd1c	Add support for `devices='auto'` (#10264 )	2021-10-30 15:05:51 +00:00
Carlos Mocholí	e4eb61d812	Raise exception for `strategy=ddp_cpu\|tpu_spawn` (#10185 )	2021-10-29 16:15:24 +00:00
Carlos Mocholí	4bc73b2b76	Avoid deprecated usage in accelerator connector tests (#10184 )	2021-10-29 12:36:21 +01:00
Adrian Wälchli	21a5867dad	Rename `ClusterEnvironment.creates_processes` (#10106 ) Co-authored-by: tchaton <thomas@grid.ai>	2021-10-25 23:15:41 +00:00
Danielle Pintz	1f7bd6650c	Mark accelerator connector as protected (#10032 )	2021-10-25 19:24:54 +00:00
Adrian Wälchli	76081fb846	Mark SLURM detection methods in `AcceleratorConnector` as protected (#10101 ) Co-authored-by: Justus Schock <12886177+justusschock@users.noreply.github.com>	2021-10-25 17:52:15 +00:00
Kaushik B	56bc55db71	Update strategy flag in docs (#10000 ) Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>	2021-10-20 21:02:53 +05:30
Carlos Mocholí	53c62f63e8	Constrain IPU precision choices (#10030 )	2021-10-20 00:52:01 +00:00
Carlos Mocholí	d45897d522	Rename `TPUHalfPrecisionPlugin` to `TPUBf16PrecisionPlugin` (#10026 ) Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>	2021-10-19 21:09:37 +00:00
Carlos Mocholí	e8beceb631	Add `TPUPrecisionPlugin` (#10020 ) Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com> Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>	2021-10-19 17:48:57 +00:00
Rohit Gupta	0aa220b46b	Remove deprecated `distributed_backend` from `Trainer` (#10017 ) * rm distributed_backend from Trainer * unused * chlog * internal distributed_backend * Docstring Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>	2021-10-19 13:54:37 +00:00
Carlos Mocholí	01b304ec57	Update accelerator connector messages after the addition of strategy (#9937 )	2021-10-18 01:10:48 +00:00
Kaushik B	5e8829b97d	(1/n) tests: Use strategy flag instead of accelerator for training strategies (#9931 ) Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>	2021-10-16 20:40:25 +05:30
Carlos Mocholí	db4e770004	Validate the precision input earlier (#9763 )	2021-10-15 17:30:00 +00:00
Oliver Borchert	afbf703684	Single-process multi-node CPU training (#9603 ) Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> Co-authored-by: thomas chaton <thomas@grid.ai>	2021-10-14 22:21:41 +02:00
Kaushik B	05b15e63f0	Add `strategy` argument to Trainer (#8597 ) Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	2021-10-13 12:34:06 +00:00
Rohit Gupta	617e798f3b	Raise an exception if using `amp_level` with native `amp_backend` (#9755 ) * add exception * chlog * code review * Apply suggestions from code review Co-authored-by: thomas chaton <thomas@grid.ai>	2021-10-01 14:27:05 +02:00
Jirka Borovec	6e124e7207	CI: precommit - docformatter (#8584 ) * CI: precommit - docformatter * fix deprecated Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	2021-09-06 12:49:09 +00:00
Sean Naren	e9f4bffe0a	Add validate logic for precision (#9080 )	2021-08-24 20:00:09 +00:00
Kaushik B	d01d8334b5	Fix `ddp` accelerator choice for cpu (#8645 ) * Fix ddp accelerator choice for cpu	2021-08-02 21:24:07 +00:00
Kaushik B	850416f0a0	Fix distributed types support for CPUs (#8667 )	2021-08-02 16:42:28 +05:30
Carlos Mocholí	a64cc37394	Replace `yapf` with `black` (#7783 ) Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	2021-07-26 13:37:35 +02:00
Kaushik B	556879e5cf	Add support for devices flag to Trainer (#8440 ) * Support devices flag to Trainer	2021-07-20 04:33:12 +00:00
Kaushik B	825c5dbe8c	Add support for (accelerator='cpu'\|'gpu'\|'tpu'\|'ipu'\|'auto') (#7808 ) Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> Co-authored-by: Ethan Harris <ewah1g13@soton.ac.uk> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: SeanNaren <sean@grid.ai>	2021-07-09 15:28:54 +00:00
Adrian Wälchli	e7139ab9f7	Support `DDPPlugin` to be used on CPU (#6208 ) * Skip test due to 'Python bus error' * Debug NCCL * Remove NCCL_DEBUG statement * Revert "Skip test due to 'Python bus error'" This reverts commit `e0a3e8785d`. * fix * add test * changelog * yapf * patch os environ * make a special test * destroy pg * debug * revert * revert * problematic test * skip * try the fixture * test * update sensitive test * update changelog * remove comment * update wrong test * update test name * parameterization * Revert "parameterization" This reverts commit b0542f43f59c5ce66800883b5e2f0c66a97408cc. * remove conftest * ignore test * teardown * fix merge * deep speed parameterization * uncomment test * update chlog * update changelog * split tests * update test update test update test update test * update test comments * unroll test * unroll test * unroll test * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * increase shm * sudo * unroll ipu * Revert "sudo" This reverts commit `6cc68c1478`. * Revert "increase shm" This reverts commit `8c27163483`. * x * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * find guilty test * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * POPTORCH_WAIT_FOR_IPU=1 * move test * redo parameterize for ipu * de-comment test * move chlog * Update tests/accelerators/test_accelerator_connector.py Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com> * Update tests/accelerators/test_accelerator_connector.py Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com> Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com>	2021-07-02 12:00:24 +01:00
Andrew Tritt	e808f9fb28	Use DistributedSampler when running with custom accelerator (#7814 ) Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>	2021-06-18 14:34:05 +02:00
Adrian Wälchli	502adbced3	refactor optimizer loop logic for manual and automatic optimization (#7526 ) Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> Co-authored-by: ananthsub <ananth.subramaniam@gmail.com> Co-authored-by: Justus Schock <12886177+justusschock@users.noreply.github.com>	2021-05-17 14:42:01 +02:00
Kaushik B	bf46730d92	Support TPU Pod Training (n/n) (#7296 )	2021-05-17 11:33:44 +00:00
Nic Eggert	f4f51e0dcf	Add kubeflow cluster environment (#7300 ) * Add kubeflow cluster environment * Add KubeflowEnvironment to docs * Add KubeflowEnvironment to the changelog * break up a long line * Add method to detect kubeflow environment * Select Kubeflow environment when available * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Run pre-commit * task_idx == 0 Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>	2021-05-17 09:05:24 +01:00
Alan Du	6ac16ff348	Fix DistribType for `ddp_cpu` (spawn) (#7492 )	2021-05-14 20:53:26 +01:00
Jirka Borovec	d4ec75164c	Prune deprecated trainer attributes (#7501 ) * use_single_gpu * use_horovod * use_ddp2 * use_ddp * use_dp * on_gpu * use_tpu * on_tpu * on_cpu * cleaning * chlog * Apply suggestions from code review * Apply suggestions from code review	2021-05-12 20:10:15 +00:00
thomas chaton	3cc0b2c063	[test] Add checks for gpus=1 (#7105 ) * update * remove cluster env	2021-04-19 20:39:28 +02:00
Adrian Wälchli	33cc9fe138	Clean up environment access in plugins (#6941 ) Co-authored-by: ananthsub <ananth.subramaniam@gmail.com> Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com>	2021-04-13 20:07:40 +02:00
Sean Naren	4e9b453854	[Fix] Move init dist connection into the setup function (#6506 ) * Move connection setup into the setup function. Call setup hook after we set up the accelerator * Added CHANGELOG.md * fix setup order in callback test * fix input arguments in test * Mock distributed function, remove protection to turn into training type hook * Remove import * Add missing mock, ensure custom plugin does not create children process * Skip test on windows * Update deepspeed to init connection in setup * Do not initialize distributed module * Move DeepSpeed tests to special tests since dist communication is being set up * Special the test to see if this fixes CI * Delete accelerator connector test to see if its causing build to fail * Delete deepspeed test * Revert "Delete accelerator connector test to see if its causing build to fail" This reverts commit `edde60b8` * Revert "Delete deepspeed test" This reverts commit `9d317429` * Reverse hook * Reverse setup hooks to debug again * Add todo so i know where i left off * For single device move in pre_dispatch after setup function * Add additional model to device hook if any additional parameters have been set * See if we can enable deepspeed tests * Revert "See if we can enable deepspeed tests" This reverts commit `b5450def` * See if this hook approach works * Introduce new granular hooks * Remove import, fix tpu spawn by moving the function to setup * Added missing special test Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>	2021-03-18 14:33:39 -07:00
Jirka Borovec	55dd3a4c64	Typing for tests 1/n (#6313 ) * typing * yapf * typing	2021-03-09 11:27:15 +00:00
Adrian Wälchli	ec8d46e02b	introduce default cluster environment for lightning-specific ddp (#5915 ) * handle distributed_sampler_kwargs * move emptying cache to accelertor * fix a few tests * restoring the result from subprocess * fix queue.get() order for results * add missing "block_backward_sync" context manager * add missing "block_backward_sync" context manager * fix sync_batchnorm * fix supported gpu-ids for tuple * fix clip gradients and inf recursion * accelerator selection: added cluster_environment plugin * fix torchelastic test * fix reduce early stopping decision for DDP * fix tests: callbacks, conversion to lightning optimizer * fix lightning optimizer does not pickle * fix setting benchmark and deterministic option * fix slurm amp test * fix prepare_data test and determine node_rank * fix retrieving last path when testing * remove obsolete plugin argument * fix test: test_trainer_config * fix torchscript tests * fix trainer.model access * move properties * fix test_transfer_batch_hook * fix auto_select_gpus * fix omegaconf test * fix test that needs to simulate slurm ddp * add horovod plugin * fix test with named arguments * clean up whitespace * fix datamodules test * remove old accelerators * fix naming * move old plugins * move to plugins * create precision subpackage * create training_type subpackage * fix all new import errors * fix wrong arguments order passed to test * fix LR finder * Added sharded training type and amp plugin * Move clip grad to precision plugin * Added sharded spawn, select accelerators based on distributed_backend + enable custom fp16 plugin automatically * Fix import issue, attempting to fix tests * Fix initial test * Reflect hook logic from master, should wrap model after move to device * Optional state consolidation, since master has optimizers not wrapped * change attribute for instance test * reset optimizers optimizers are not used in main process, so state would be wrong. * legacy * imports in accel * legacy2 * trainer imports * fix import errors after rebase * move hook to new setup location * provide unwrapping logic * fix trainer callback system * added ddp2 implementation * fix imports .legacy * move plugins * restore legacy * drop test.py from root * add tpu accelerator and plugins * fixes * fix lightning optimizer merge * reset bugreportmodel * unwrapping * step routing forward * model access * unwrap * opt * integrate distrib_type * sync changes * sync * fixes * add forgotten generators * add missing logic * update * import * missed imports * import fixes * isort * mv f * changelog * format * move helper to parallel plugin * d * add world size * clean up * duplicate * activate ddp_sharded and tpu * set nvidia flags * remove unused colab var * use_tpu <-> on_tpu attrs * make some ddp_cpu and clusterplugin tests pass * Ref/accelerator connector (#5742) * final cleanup Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * connector cleanup Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * trainer cleanup Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * accelerator cleanup + missing logic in accelerator connector Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * add missing changes to callbacks Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * reflect accelerator changes to lightning module Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * clean cluster envs Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * cleanup plugins Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * add broadcasting Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * yapf * remove plugin connector Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * plugins * manual optimization * update optimizer routing * add rank to torchelastic * fix memory mixed precision * setstate on trainer for pickling in ddp spawn * add predict method * add back commented accelerator code * adapt test for sync_batch_norm to new plugin * fix deprecated tests * fix ddp cpu choice when no num_processes are given * yapf format * skip a memory test that cannot pass anymore * fix pickle error in spawn plugin * x * avoid * x * fix cyclic import in docs build * add support for sharded * update typing * add sharded and sharded_spawn to distributed types * make unwrap model default * refactor LightningShardedDataParallel similar to LightningDistributedDataParallel * update sharded spawn to reflect changes * update sharded to reflect changes * Merge 1.1.5 changes * fix merge * fix merge * yapf isort * fix merge * yapf isort * fix indentation in test * copy over reinit scheduler implementation from dev1.2 * fix apex tracking calls with dev_debugger * reduce diff to dev1.2, clean up * fix trainer config test when gpus>0 and num_processes >0 and ddp_cpu * sort plugin tests legacy/new * fix error handling for amp on cpu * fix merge fix merge fix merge * [Feat] Resolve manual_backward (#5837) * resolve manual_backward * resolve flake8 * update * resolve for ddp_spawn * resolve flake8 * resolve flake8 * resolve flake8 Co-authored-by: Ubuntu <ubuntu@ip-172-31-88-60.ec2.internal> * fix tests/accelerator tests on cpu * [BugFix] Resolve manual optimization (#5852) * resolve manual_optimization * update * update Co-authored-by: Ubuntu <ubuntu@ip-172-31-88-60.ec2.internal> * Remove copy trainer parameters to happen earlier within the loop and add safe guard to get ref model (#5856) * resovle a bug * Accelerator refactor sharded rpc (#5854) * rpc branch * merge * update handling of rpc * make devices etc. Optional in RPC * set devices etc. later if necessary * remove devices from sequential * make devices optional in rpc * fix import * uncomment everything * fix cluster selection Co-authored-by: Ubuntu <ubuntu@ip-172-31-88-60.ec2.internal> * resolve bug * fix assert in rpc test * resolve a test * fix docs compilation * accelerator refactor - fix for sharded parity test (#5866) * fix memory issue with ddp_spawn * x x x x x x x x x * x * Remove DDP2 as this does not apply * Add missing pre optimizer hook to ensure lambda closure is called * fix apex docstring * [accelerator][BugFix] Resolve some test for 1 gpu (#5863) * update * revert init * resolve a bug * update * resolve flake8 * update * update * update * revert init * resolve a bug * update * resolve flake8 * update * update * update * update * update * revert init * resolve a bug * update * resolve flake8 * update * update * update * revert init * update * resolve flake8 * update * update * update * update * update * all_gather * update * make plugins work, add misconfig for RPC * update * update * remove breaking test * resolve some tests * resolve flake8 * revert to ddp_spawn Co-authored-by: root <root@ip-172-31-88-60.ec2.internal> Co-authored-by: Ubuntu <ubuntu@ip-172-31-88-60.ec2.internal> Co-authored-by: Justus Schock <justus.schock@rwth-aachen.de> * yapf isort * resolve flake8 * fix apex doctests * fix apex doctests 2 * resolve docs * update drone * clean env * update * update * update * update * merge * Fix RPC related tests, clean out old API, update for new accelerator API [skip ci] (#5881) * Fix RPC related tests, clean out old API, update for new accelerator API * Move tests out of legacy folder, update paths and names * Update test_remove_1-4.py * Expose properties for tpu cores/gpus/num_gpus * Add root GPU property * Move properties to properties.py * move tests that were previously in drone * Fix root GPU property (#5908) * Move root GPU to property, remove horovod set as this is handled in horovod plugin, ensure we mock correctly to set GPU accelerator * Add missing tests back * fix best model path transfer when no checkpoint callback available * Fix setup hook order [wip] (#5858) * Call trainer setup hook before accelerator setup * Add test case * add new test * typo * fix callback order in test Co-authored-by: tchaton <thomas@grid.ai> Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * rename ddp sequential -> rpc sequential for special test * revert * fix stupid merge problem * abstract the cluster plugins * default plugin * integrate default environment * fix property * adapt tests * adjust test * fix world size access * base cluster env * revert rebase errors * revert rebase errors * missing import * revert unrelated change * remove unused cluster local rank * remove unrelated changes * fix unrelated changes * fix pep8 * remove unused var * reset permissions * ypaf * test default environment * test torchelastic environment * world size as int * tests for slurm environment * changelog * test comments * remove unintended change * keep master port fixed after it is generated * test random master port * yapf * add missing default environment * move helper function * rename default environment * rename * rename * yapf * Update pytorch_lightning/plugins/environments/lightning_environment.py Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> * Update CHANGELOG.md Co-authored-by: Justus Schock <12886177+justusschock@users.noreply.github.com> * spawn -> create Co-authored-by: justusschock <justus.schock@posteo.de> Co-authored-by: SeanNaren <sean@grid.ai> Co-authored-by: Justus Schock <12886177+justusschock@users.noreply.github.com> Co-authored-by: Jirka Borovec <jirka.borovec@seznam.cz> Co-authored-by: Justus Schock <justus.schock@rwth-aachen.de> Co-authored-by: chaton <thomas@grid.ai> Co-authored-by: Ubuntu <ubuntu@ip-172-31-88-60.ec2.internal> Co-authored-by: Sean Naren <sean.narenthiran@gmail.com> Co-authored-by: root <root@ip-172-31-88-60.ec2.internal> Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>	2021-03-05 01:47:29 +00:00
Kaushik B	4157b35062	Add fairscale & deepspeed to skipif 4/n (#6281 ) * add fairscale & windows to skipif * add deepspeed to runif * fairscale * deepspeed * flake8 Co-authored-by: Jirka Borovec <jirka.borovec@seznam.cz>	2021-03-02 19:45:13 +00:00
Jirka Borovec	ac583781db	Refactor: Runif for TPU and Horovod 5/n (#6301 ) * TPU * horovod * extra * fix * Apply suggestions from code review Co-authored-by: Nicki Skafte <skaftenicki@gmail.com> * doc Co-authored-by: Nicki Skafte <skaftenicki@gmail.com>	2021-03-02 16:21:20 +00:00
Sean Naren	80019874e5	[fix] Ensure we check deepspeed/sharded in multinode DDP (#6297 ) * Ensure we check deepspeed/sharded in multinode * Add CHANGELOG.md * Add CHANGELOG.md * Drop mock, use actual multi-gpu node	2021-03-02 13:36:18 +00:00
Jirka Borovec	0f9134e043	Refactor: skipif for Windows 2/n (#6268 ) * win * isort * flake8	2021-03-02 09:36:01 +00:00
Jirka Borovec	eb815000f6	Refactor: skipif for multi - gpus 1/n (#6266 ) * ngpus * gpu * isort * pt * flake8	2021-03-02 09:03:32 +01:00

1 2

54 Commits