lightning

Commit Graph

Author	SHA1	Message	Date
Jirka Borovec	759e89df21	Future 1/n: package in src/ folder (#13293 ) * move: pytorch_lightning >> src/ * update setup & install * update CI * ci * update CI for examples * Self review * mypy Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> * ci * make * docs * typo * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * ci: gpu * . * hpu * typing * docs * tpu Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	2022-06-14 20:54:55 -04:00
Carlos Mocholí	0cf9d73d28	Drop PyTorch 1.8 support (#13155 ) * Drop PyTorch 1.8 support * Missed update * Skip profiler test until supported * Upgrade ipu dockerfile pytorch version * Update XLA version	2022-06-14 20:46:44 -04:00
Jirka Borovec	78ff201c7e	Update CI setup (#13291 ) * drop mamba * use legacy GPU machines	2022-06-14 17:11:54 +00:00
Akarsha Rao	bfa8b7be2d	Create hpu-ci-runner Dockerfile (#13239 ) * Create hpu-ci-runner Dockerfile * Add ENTRYPOINT script 'start.sh' to hpu-ci-runner * rename dirs * ci * add docker * Fix build failure * Fix build failure * Fix title of nightly ci runner build * Fix comments * Fix comments Co-authored-by: Jirka <jirka.borovec@seznam.cz> Co-authored-by: Akihiro Nitta <nitta@akihironitta.com>	2022-06-08 16:02:16 -04:00
Akihiro Nitta	3c5a8a833e	Decouple pulling legacy checkpoints from existing GHA workflows and docker files (#13185 ) * Add pull-legacy-checkpoints action * Replace pulls with the new action and script * Simplify	2022-06-02 15:39:14 +02:00
Jirka Borovec	de4ab1c027	update NGC docker (#13136 ) * update docker * Apply suggestions from code review Co-authored-by: Akihiro Nitta <nitta@akihironitta.com> Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>	2022-06-02 12:54:13 +00:00
Jirka Borovec	fab2ff35ad	CI: Azure - multiple configs (#12984 ) * CI: Azure - multiple configs * names * benchmark * Apply suggestions from code review Co-authored-by: Akihiro Nitta <nitta@akihironitta.com> Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>	2022-05-14 01:59:03 +00:00
Jirka Borovec	fec9a09672	add freeze for development and full range for install (#12994 ) * freeze versions * unfreeze * dependabot * Apply suggestions from code review Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> * fix all req * ... * use base * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix refs * Apply suggestions from code review Co-authored-by: Akihiro Nitta <nitta@akihironitta.com> * Apply suggestions from code review * dockers Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Akihiro Nitta <nitta@akihironitta.com>	2022-05-12 09:14:18 -04:00
Eric Wiener	3f78c4ca7a	Track CPU stats with DeviceStatsMonitor (#11795 ) Co-authored-by: ananthsub <ananth.subramaniam@gmail.com> Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Kaushik B <kaushikbokka@gmail.com> Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>	2022-05-10 10:57:38 +00:00
Jirka Borovec	783ec43a85	parse strategies as own extras (#12975 ) * parse strategies as own extras * prune devel * Update Makefile Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> * revert parse_requirements Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>	2022-05-09 09:25:53 -04:00
Jirka Borovec	7ce948edb6	Unpin CUDA docker image for GPU CI (#12373 ) * unpin CUDA docker image for GPU CI * Apply suggestions from code review Co-authored-by: Aki Nitta <nitta@akihironitta.com> Co-authored-by: Akihiro Nitta <akihiro@pytorchlightning.ai> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	2022-05-06 02:56:57 +00:00
Jirka Borovec	bb51e2a55b	Merge pull request #12723 from PyTorchLightning/req/strategies Separate strategies' requirements	2022-05-04 10:06:02 -04:00
Akihiro Nitta	ecd135e939	Update nvidia gpg key to fix nightly docker builds (#12930 ) * Update gpg key * Use curl instead of wget * Install key manually	2022-05-02 09:00:44 +02:00
Akihiro Nitta	98b206e836	Use cmake installed with apt (#12907 )	2022-04-28 07:44:52 +00:00
Akihiro Nitta	ace6a5827b	Update building docker images (#12837 ) Co-authored-by: Akihiro Nitta <akihiro@pytorchlightning.ai>	2022-04-21 22:10:42 +00:00
Jirka Borovec	16b9580958	build more dockers & slack fails (#12675 ) * build dockers * add slack * Apply suggestions from code review Co-authored-by: Akihiro Nitta <nitta@akihironitta.com>	2022-04-13 17:24:08 +02:00
Jirka Borovec	f9b69ce5b0	CI: check docker requires (#12677 ) * check docker requires * ci update * bagua * conda * cuda	2022-04-12 00:29:54 +09:00
Kaushik B	bd035af78a	Fix TPU CI (#12419 )	2022-03-23 11:35:38 +05:30
Jirka Borovec	fe940e195d	CI: update prune_pkgs (#12382 )	2022-03-21 12:50:50 +00:00
four4fish	1eff3b53c1	Update fairscale version (#11567 ) Co-authored-by: Aki Nitta <nitta@akihironitta.com> Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com> Co-authored-by: Jirka <jirka.borovec@seznam.cz> Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>	2022-03-21 11:38:55 +00:00
Jirka Borovec	efa870eebc	Docker: fix NCCL building Horovod (#12318 ) * Horovod w. MPI * nccl_built * fix	2022-03-18 14:23:19 +00:00
Jirka Borovec	7ee690758c	CI: fix running PT 1.11 (#12304 ) * fix fire * horovod * assistant * cmake * u20 * cuda * -j2 * fix mypy Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>	2022-03-12 09:00:20 +00:00
Jirka Borovec	bc8172856f	aggregate multiple helper scripts to single CLI (#11147 ) * nightly release * min version * fire	2022-03-11 11:13:43 +00:00
Jirka Borovec	1144673cd9	CI: sanity check for req. pkgs (#11819 ) * CI: sanity check for req. pkgs * scripts * rename * gcsfs ? * rich ! * install extra * move * set -e Co-authored-by: Aki Nitta <nitta@akihironitta.com> Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	2022-03-11 09:20:47 +00:00
Jirka Borovec	3b4061f39a	CI: enable testing for PT 1.11 (#11792 ) * enable PT 1.11 * horovod * Apply suggestions from code review Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> Co-authored-by: Aki Nitta <nitta@akihironitta.com>	2022-03-10 18:38:47 +00:00
Jirka Borovec	8577ef7bba	Skip horovod 0.24.0 only (#12248 ) * try skip horovod 0.24.0 only * HOROVOD_BUILD_CUDA_CC_LIST * fix test Co-authored-by: Aki Nitta <nitta@akihironitta.com> Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>	2022-03-10 16:01:08 +00:00
wangraying	a0655611de	Add bagua installation in dockerfile (#11283 ) Co-authored-by: Aki Nitta <nitta@akihironitta.com> Co-authored-by: Jirka <jirka.borovec@seznam.cz>	2022-02-24 15:17:31 +01:00
Jirka Borovec	7bc87015ea	Unblock GPU CI (#11934 ) Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>	2022-02-16 21:15:44 +01:00
Aki Nitta	0a1b8b880d	Fix horovod installation `base-cuda` Dockerfile (#11811 ) * pip install --user * add checks * rm unrelated comment * consistent format * Fail if horovod not found Co-authored-by: Jirka <jirka.borovec@seznam.cz> Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>	2022-02-10 16:48:33 +09:00
Aki Nitta	86b177ebe5	Fix `apex` installation path in Dockerfile (#11596 ) * empty commit * Specify apex installation target directory * pip install --user	2022-01-27 20:14:16 -05:00
Kaushik B	650c710efa	Rename training plugin test files & names to strategy (#11303 )	2022-01-04 14:32:45 +01:00
Carlos Mocholí	3692eba807	Drop Python 3.6 support (#11117 )	2021-12-21 17:06:15 +00:00
Kaushik B	2a5d05b562	Fix tpu spawn plugin test (#11131 )	2021-12-18 02:53:37 +00:00
Sean Naren	c66cd12445	Remove partitioning of model in ZeRO 3 (#10655 )	2021-12-17 12:36:53 +00:00
Jirka Borovec	e8659bd40e	update NGC (#10770 )	2021-11-29 14:14:37 +00:00
Carlos Mocholí	d2aaf6b4cc	Upgrade CI after the 1.10 release (#10075 )	2021-11-10 17:59:10 +01:00
Carlos Mocholí	939a861853	Update Python testing (#10269 )	2021-11-04 18:26:24 +01:00
Carlos Mocholí	70570f9eaa	Minimize the number of docker jobs (#10202 ) Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>	2021-10-29 07:48:05 +01:00
Carlos Mocholí	3a4e9970d6	Pin fairscale version (#10200 )	2021-10-27 23:24:17 +00:00
Carlos Mocholí	a0e45dc071	Some minor CI cleanup (#10088 )	2021-10-26 13:58:20 +02:00
Kaushik B	af4a8f1950	Refactor tests for TPU Accelerator (#9718 ) Co-authored-by: tchaton <thomas@grid.ai>	2021-10-14 19:45:15 +00:00
Danielle Pintz	940b910d27	[2/4] Add DeviceStatsMonitor callback (#9712 ) Co-authored-by: ananthsub <ananth.subramaniam@gmail.com> Co-authored-by: thomas chaton <thomas@grid.ai> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> Co-authored-by: Kaushik B <kaushikbokka@gmail.com> Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com>	2021-10-13 18:29:36 +00:00
edwardpwtsoi	7c6efbc8a8	Resolved wrong mv usage for extracted directory (#9678 ) Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>	2021-10-05 12:56:33 +00:00
Jirka Borovec	0e6ee9c39d	CI: add mdformat (#8673 ) * add mdformat * exclude chlog * fix *** Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>	2021-08-03 18:19:09 +00:00
Jirka Borovec	66cc505339	update NGC (#8652 ) * update NGC Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>	2021-08-02 16:05:36 +00:00
Jirka Borovec	abbcfa1ab7	fix CI for PT 1.10 (#8526 ) * fix CI for PT 1.10 * Apply suggestions from code review Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>	2021-07-23 19:24:31 +02:00
thomas chaton	8d0df6fad2	[Feat] Improve TPU CI (#6078 ) * i * i * i * i * i * i * i * i * i * i * i * i * i * i * i * i * i * i * i * i * i * i * i * i * i * i * i * i * i * i * i * i * i * i * i * update * update ci * i * i * i * i	2021-07-19 19:43:21 +05:30
Jirka Borovec	74a09a23f1	CI: support PT 1.10 (#8133 ) * prepare PT 1.10 * dockers * fixes * readme	2021-07-14 18:04:33 +03:00
Carlos Mocholí	6ce77a102b	Set minimum PyTorch version to 1.6 (#8288 ) Co-authored-by: Jirka <jirka.borovec@seznam.cz>	2021-07-13 17:12:49 +00:00
Jirka Borovec	ed6d4baea2	ngc (#8242 )	2021-07-02 13:12:45 +01:00
Kaushik B	2f3c65e57b	XLA Profiler integration (#8014 )	2021-06-29 00:58:05 +05:30
Sean Naren	f7459f5328	DeepSpeed Infinity Update (#7234 ) * Update configs to match latest API * Ensure we move the entire model to device before configure optimizer is called * Add missing param * Expose parameters * Update references, drop local rank as it's now infered from the environment variable * Fix ref * Force install deepspeed 0.3.16 * Add guard for init * Update pytorch_lightning/plugins/training_type/deepspeed.py Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> * Revert type checking * Install master for CI for testing purposes * Update CI * Fix tests * Add check * Update versions * Set precision * Fix * See if i can force upgrade * Attempt to fix * Drop * Add changelog Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>	2021-06-14 16:38:28 +00:00
Jirka Borovec	7b531ac7ac	Fix NVIDIA docker versions (#7834 )	2021-06-06 23:56:27 +02:00
Jirka Borovec	9a001fea22	update NGC docker (#7787 )	2021-06-01 12:11:29 +02:00
Tomy Hsieh	037a71b156	Update README.md (#7717 )	2021-05-26 12:58:11 +02:00
Kaushik B	2c10ecc232	MAINTAINER has been deprecated (#7683 )	2021-05-25 00:01:31 +05:30
Jirka Borovec	6e56f56aa1	docker use $(nproc) (#7606 ) * docker use $(nproc) * Update typo Co-authored-by: Roger Shieh <sh.rog@protonmail.ch> Co-authored-by: Roger Shieh <sh.rog@protonmail.ch>	2021-05-19 21:48:14 +02:00
Jirka Borovec	298f9e5c2d	Prune deprecated utils modules (#7503 ) * argparse_utils * model_utils * warning_utils * xla_device_utils * chlog * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	2021-05-13 07:24:42 +00:00
Jirka Borovec	db54b30776	Update README to 1.3 (#7489 )	2021-05-12 13:36:52 +02:00
Louis Taylor	2b7e65b747	Add base IPU dockerfiles (#7252 )	2021-05-07 12:07:29 +00:00
Jirka Borovec	1a27c12b26	update ngc for 1.3 (#7414 )	2021-05-07 13:13:54 +02:00
Jirka Borovec	626ef08694	enable Dockers for PT 1.9 (#7363 ) * enable PT 1.9 * fix versions * args * fix	2021-05-05 14:26:22 +02:00
Carlos Mocholí	c6a171b776	Fix requirements/adjust_versions.py (#7149 ) Co-authored-by: jirka <jirka.borovec@seznam.cz>	2021-05-04 01:06:28 +02:00
Adrian Wälchli	7636d422fa	Update DeepSpeed version requirement in Dockerfile (#7326 ) Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>	2021-05-03 20:21:19 +02:00
Jirka Borovec	a153c15c90	Docker/nvidia (#7109 ) * version check * ...	2021-04-27 20:29:49 +01:00
Sean Naren	8439aead66	Update FairScale on CI (#7017 ) * Try updating CI to latest fairscale * Update availability of imports.py * Remove some of the fairscale custom ci stuff * Update grad scaler within the new process as reference is incorrect for spawn * Remove fairscale from mocks * Install fairscale 0.3.4 into the base container, remove from extra.txt * Update docs/source/conf.py * Fix import issues * Mock fairscale for docs * Fix DeepSpeed and FairScale to specific versions * Swap back to greater than * extras * Revert "extras" This reverts commit `7353479f` * ci Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> Co-authored-by: jirka <jirka.borovec@seznam.cz>	2021-04-23 12:37:00 +01:00
Jirka Borovec	1e4bc69a16	Ban `tensorboard==2.5.0` and `deepspeed==0.3.15` (#7159 ) * ban TB 2.5 * note * push * Ban tb==2.5.0 and deepspeed==0.3.15 * Fix pip command * pull * up * up Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>	2021-04-22 11:08:21 -04:00
Sean Naren	5d8610955a	Fix `apex` version in Docker due to broken upstream (#7146 ) * Set Apex commit before introduction of new MLP extensions * Refactor install command Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>	2021-04-21 23:58:55 +01:00
Jirka Borovec	da1ac3a530	update docker base on PT 1.7 (#6931 ) * update docker base on PT 1.7 * fix path	2021-04-13 10:06:06 +01:00
Sean Naren	b46cc557ef	[Feat] DeepSpeed single file saving (#6900 ) * Add single checkpoint capability * Fix checkpointing in test, few cleanups * Add comment * Change restore logic * Move vars around, add better explanation, make todo align with DeepSpeed team * Fix checkpointing * Remove deepspeed from extra, install in Dockerfile * push * pull * Split to two tests to see if it fixes Deepspeed error * Add comment	2021-04-12 22:44:09 +00:00
thomas chaton	1302766f83	DeepSpeed ZeRO Update (#6546 ) * Add context to call hook to handle all modules defined within the hook * Expose some additional parameters * Added docs, exposed parameters * Make sure we only configure if necessary * Setup activation checkpointing regardless, saves the user having to do it manually * Add some tests that fail currently * update * update * update * add tests * change docstring * resolve accumulate_grad_batches * resolve flake8 * Update DeepSpeed to use latest version, add some comments * add metrics * update * Small formatting fixes, clean up some code * Few cleanups * No need for default state * Fix tests, add some boilerplate that should move eventually * Add hook removal * Add a context manager to handle hook * Small naming cleanup * wip * move save_checkpoint responsability to accelerator * resolve flake8 * add BC * Change recommended scale to 16 * resolve flake8 * update test * update install * update * update test * update * update * update test * resolve flake8 * update * update * update on comments * Push * pull * Update pytorch_lightning/plugins/training_type/deepspeed.py Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * Update pytorch_lightning/plugins/training_type/deepspeed.py Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * update * Apply suggestions from code review * Swap to using world size defined by plugin * update * update todo * Remove deepspeed from extra, keep it in the base cuda docker install * Push * pull * update * update * update * update * Minor changes * duplicate * format * format2 Co-authored-by: SeanNaren <sean@grid.ai> Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> Co-authored-by: Sean Naren <sean.narenthiran@gmail.com> Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com> Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: Jirka Borovec <jirka.borovec@seznam.cz>	2021-03-30 13:39:02 -04:00
Jirka Borovec	dcf6e4e310	remake nvidia docker (#6686 ) * use latest * remake * examples	2021-03-29 09:39:06 +01:00
Jirka Borovec	5780796931	NGC container PoC (#6187 ) * add NVIDIA flows * push * pull * ... * extras * ci prune * fix * tag * . * list	2021-03-20 02:55:46 +05:30
Jirka Borovec	85c8074bee	require: adjust versions (#6363 ) * adjust versions * release * manifest * pep8 * CI * fix * build	2021-03-06 14:34:54 +01:00
Sean Naren	8440595b26	[CI] Move DeepSpeed into CUDA image, remove DeepSpeed install from azure (#6043 ) * Move to CUDA image * Remove deepspeed install as deepspeed now in the cuda image * Remove path setting, as ninja should be in the container now	2021-02-17 18:51:31 -05:00
Sean Naren	5157ba5509	Add openmpi to our base cuda container for MPI support (#6026 ) * Add openmpi to our base container for DeepSpeed MPI support * conda Co-authored-by: Jirka Borovec <jirka.borovec@seznam.cz>	2021-02-17 12:15:49 +00:00
Jirka Borovec	b5d7d08da5	fix nightly releases & readme (#5922 ) * fix nightly releases * readme * cuda * doxker * Apply suggestions from code review Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> * revert Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>	2021-02-16 13:46:28 -05:00
Adrian Wälchli	a3d4e7c86a	move accelerator legacy tests (#5948 ) Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>	2021-02-13 19:42:18 -05:00
Justus Schock	da6dbc8d1d	PoC: Accelerator refactor (#5743 ) * restoring the result from subprocess * fix queue.get() order for results * add missing "block_backward_sync" context manager * add missing "block_backward_sync" context manager * fix sync_batchnorm * fix supported gpu-ids for tuple * fix clip gradients and inf recursion * accelerator selection: added cluster_environment plugin * fix torchelastic test * fix reduce early stopping decision for DDP * fix tests: callbacks, conversion to lightning optimizer * fix lightning optimizer does not pickle * fix setting benchmark and deterministic option * fix slurm amp test * fix prepare_data test and determine node_rank * fix retrieving last path when testing * remove obsolete plugin argument * fix test: test_trainer_config * fix torchscript tests * fix trainer.model access * move properties * fix test_transfer_batch_hook * fix auto_select_gpus * fix omegaconf test * fix test that needs to simulate slurm ddp * add horovod plugin * fix test with named arguments * clean up whitespace * fix datamodules test * remove old accelerators * fix naming * move old plugins * move to plugins * create precision subpackage * create training_type subpackage * fix all new import errors * fix wrong arguments order passed to test * fix LR finder * Added sharded training type and amp plugin * Move clip grad to precision plugin * Added sharded spawn, select accelerators based on distributed_backend + enable custom fp16 plugin automatically * Fix import issue, attempting to fix tests * Fix initial test * Reflect hook logic from master, should wrap model after move to device * Optional state consolidation, since master has optimizers not wrapped * change attribute for instance test * reset optimizers optimizers are not used in main process, so state would be wrong. * legacy * imports in accel * legacy2 * trainer imports * fix import errors after rebase * move hook to new setup location * provide unwrapping logic * fix trainer callback system * added ddp2 implementation * fix imports .legacy * move plugins * restore legacy * drop test.py from root * add tpu accelerator and plugins * fixes * fix lightning optimizer merge * reset bugreportmodel * unwrapping * step routing forward * model access * unwrap * opt * integrate distrib_type * sync changes * sync * fixes * add forgotten generators * add missing logic * update * import * missed imports * import fixes * isort * mv f * changelog * format * move helper to parallel plugin * d * add world size * clean up * duplicate * activate ddp_sharded and tpu * set nvidia flags * remove unused colab var * use_tpu <-> on_tpu attrs * make some ddp_cpu and clusterplugin tests pass * Ref/accelerator connector (#5742) * final cleanup Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * connector cleanup Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * trainer cleanup Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * accelerator cleanup + missing logic in accelerator connector Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * add missing changes to callbacks Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * reflect accelerator changes to lightning module Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * clean cluster envs Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * cleanup plugins Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * add broadcasting Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * yapf * remove plugin connector Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * plugins * manual optimization * update optimizer routing * add rank to torchelastic * fix memory mixed precision * setstate on trainer for pickling in ddp spawn * add predict method * add back commented accelerator code * adapt test for sync_batch_norm to new plugin * fix deprecated tests * fix ddp cpu choice when no num_processes are given * yapf format * skip a memory test that cannot pass anymore * fix pickle error in spawn plugin * x * avoid * x * fix cyclic import in docs build * add support for sharded * update typing * add sharded and sharded_spawn to distributed types * make unwrap model default * refactor LightningShardedDataParallel similar to LightningDistributedDataParallel * update sharded spawn to reflect changes * update sharded to reflect changes * Merge 1.1.5 changes * fix merge * fix merge * yapf isort * fix merge * yapf isort * fix indentation in test * copy over reinit scheduler implementation from dev1.2 * fix apex tracking calls with dev_debugger * reduce diff to dev1.2, clean up * fix trainer config test when gpus>0 and num_processes >0 and ddp_cpu * sort plugin tests legacy/new * fix error handling for amp on cpu * fix merge fix merge fix merge * [Feat] Resolve manual_backward (#5837) * resolve manual_backward * resolve flake8 * update * resolve for ddp_spawn * resolve flake8 * resolve flake8 * resolve flake8 Co-authored-by: Ubuntu <ubuntu@ip-172-31-88-60.ec2.internal> * fix tests/accelerator tests on cpu * [BugFix] Resolve manual optimization (#5852) * resolve manual_optimization * update * update Co-authored-by: Ubuntu <ubuntu@ip-172-31-88-60.ec2.internal> * Remove copy trainer parameters to happen earlier within the loop and add safe guard to get ref model (#5856) * resovle a bug * Accelerator refactor sharded rpc (#5854) * rpc branch * merge * update handling of rpc * make devices etc. Optional in RPC * set devices etc. later if necessary * remove devices from sequential * make devices optional in rpc * fix import * uncomment everything * fix cluster selection Co-authored-by: Ubuntu <ubuntu@ip-172-31-88-60.ec2.internal> * resolve bug * fix assert in rpc test * resolve a test * fix docs compilation * accelerator refactor - fix for sharded parity test (#5866) * fix memory issue with ddp_spawn * x x x x x x x x x * x * Remove DDP2 as this does not apply * Add missing pre optimizer hook to ensure lambda closure is called * fix apex docstring * [accelerator][BugFix] Resolve some test for 1 gpu (#5863) * update * revert init * resolve a bug * update * resolve flake8 * update * update * update * revert init * resolve a bug * update * resolve flake8 * update * update * update * update * update * revert init * resolve a bug * update * resolve flake8 * update * update * update * revert init * update * resolve flake8 * update * update * update * update * update * all_gather * update * make plugins work, add misconfig for RPC * update * update * remove breaking test * resolve some tests * resolve flake8 * revert to ddp_spawn Co-authored-by: root <root@ip-172-31-88-60.ec2.internal> Co-authored-by: Ubuntu <ubuntu@ip-172-31-88-60.ec2.internal> Co-authored-by: Justus Schock <justus.schock@rwth-aachen.de> * yapf isort * resolve flake8 * fix apex doctests * fix apex doctests 2 * resolve docs * update drone * clean env * update * update * update * update * merge * Fix RPC related tests, clean out old API, update for new accelerator API [skip ci] (#5881) * Fix RPC related tests, clean out old API, update for new accelerator API * Move tests out of legacy folder, update paths and names * Update test_remove_1-4.py * Expose properties for tpu cores/gpus/num_gpus * Add root GPU property * Move properties to properties.py * move tests that were previously in drone * Fix root GPU property (#5908) * Move root GPU to property, remove horovod set as this is handled in horovod plugin, ensure we mock correctly to set GPU accelerator * Add missing tests back * fix best model path transfer when no checkpoint callback available * Fix setup hook order [wip] (#5858) * Call trainer setup hook before accelerator setup * Add test case * add new test * typo * fix callback order in test Co-authored-by: tchaton <thomas@grid.ai> Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * rename ddp sequential -> rpc sequential for special test * revert * fix stupid merge problem * Use property in connector for sampler (#5913) * merge the import conflicts * fix spawning of processes in slurm * [wip] Fix some bugs for TPU [skip ci] (#5878) * fixed for single tpu * fixed spawn * fixed spawn * update * update * wip * resolve bugs * resolve bug * update on comment * removed decorator * resolve comments * set to 4 * update * update * need cleaning * update * update * update * resolve flake8 * resolve bugs * exclude broadcast * resolve bugs * change test * update * update * skip if meet fails * properly raise trace * update * add catch * wrap test * resolve typo * update * typo Co-authored-by: Lezwon Castelino <lezwon@gmail.com> Co-authored-by: Your Name <you@example.com> * resolve some tests * update * fix imports * update * resolve flake8 * update azure pipeline * skip a sharded test on cpu that requires a gpu * resolve tpus * resolve bug * resolve flake8 * update * updat utils * revert permission change on files * suggestions from carlos Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> * remove unrelated formatting changes * remove incomplete comment * Update pytorch_lightning/accelerators/__init__.py Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> * remove unrelated formatting change * add types * warn 1.7 ddp manual backward only if ddp kwarg unset * yapf + isort * pep8 unused imports * fix cyclic import in docs * Apply suggestions from code review * typer in accelerator.py * typo * Apply suggestions from code review * formatting * update on comments * update typo * Update pytorch_lightning/trainer/properties.py Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * update * suggestion from code review * suggestion from code review Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> Co-authored-by: SeanNaren <sean@grid.ai> Co-authored-by: Jirka Borovec <jirka.borovec@seznam.cz> Co-authored-by: chaton <thomas@grid.ai> Co-authored-by: Ubuntu <ubuntu@ip-172-31-88-60.ec2.internal> Co-authored-by: Sean Naren <sean.narenthiran@gmail.com> Co-authored-by: root <root@ip-172-31-88-60.ec2.internal> Co-authored-by: Lezwon Castelino <lezwon@gmail.com> Co-authored-by: Your Name <you@example.com> Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>	2021-02-12 15:48:56 -05:00
Jirka Borovec	c2c82dad62	CI: Azure (#5882 ) * add base Azure pipeline * skip	2021-02-10 04:43:26 -05:00
Jirka Borovec	1ac9164f91	create new Conda images (#5877 ) * create new Conda images * . * .	2021-02-09 15:30:48 +00:00
Jirka Borovec	937f11c05b	try fix: Docker with Conda & PT 1.8 (#5842 ) * ci * ver * list * pt * nk * ch * 4.9	2021-02-09 08:22:35 +00:00
tchaton	77be6f6e24	resolve conflits resolve doc boring commit docs torchvision tpu Update dockers/tpu-tests/tpu_test_cases.jsonnet Update dockers/tpu-tests/tpu_test_cases.jsonnet	2021-02-05 21:43:10 +01:00
Jirka Borovec	a39b382fe1	hotfix for GHA tpu (#5762 ) * -y * t * . * t	2021-02-05 21:43:10 +01:00
Sumanth Ratna	8732475701	Remove unnecessary intermediate layers in base-conda Dockerfile (#5697 ) * [docker][base-conda] Combine ENV+COPY instructions * [docker][base-cuda] Combine ENV+COPY instructions * [docker][base-xla] Combine ENV+COPY instructions * [docker][base-cuda] Fix COPY instruction * [docker][base-xla] Fix quote in ENV * [docker][base-xla] Fix $PATH in ENV * [docker][base-conda] Fix COPY instruction * chlog Co-authored-by: Jirka Borovec <jirka.borovec@seznam.cz>	2021-02-05 21:40:40 +01:00
Jirka Borovec	07f24d2438	add nvidia docker image (#5668 ) Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>	2021-01-29 11:01:03 -05:00
Jirka Borovec	7e2e874d95	Refactor: legacy accelerators and plugins (#5645 ) * tests: legacy * legacy: accel * legacy: plug * fix imports * mypy * flake8	2021-01-26 20:04:36 -05:00
Jirka Borovec	9dd04028d5	tests for legacy checkpoints (#5223 ) * wip * generate * clean * tests * copy * download * download * download * download * download * download * download * download * download * download * download * flake8 * extend * aws * extension * pull * pull * pull * pull * pull * pull * pull * try * try * try * got it * Apply suggestions from code review (cherry picked from commit `72525f0a83`)	2021-01-26 14:27:56 +01:00
Jeff Yang	e1a4c2e448	docker: run ci only docker related files are changed (#5203 ) * only run ci on docker related files * docker related files changed! * install pytorch along with cudatoolkit * build docker only on SUN * conda exit status has been fixed * reverts back to old conda version * add more docker related files * conda env update --name * create env and install pytorch again * create env and install pytorch again * ${PYTORCH_CHANNEL} * dont update pytorch with conda env update * Apply suggestions from code review Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> * Update dockers/base-conda/Dockerfile * Apply suggestions from code review * remove checks in cron job * Apply suggestions from code review * readd # * readd # Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: Roger Shieh <sh.rog@protonmail.ch> (cherry picked from commit `cc624358c8`)	2021-01-26 14:27:56 +01:00
Jirka Borovec	9be04c1c0b	try to update failing dockers (#5611 )	2021-01-25 17:10:56 -05:00
Jirka Borovec	7e4d6cbe48	set minimal req. PT 1.4 (#5418 ) * set minimal req. PT 1.4 * chlog	2021-01-12 19:15:35 -05:00
Jirka Borovec	5119013c81	drop install FairScale for TPU (#5113 ) * drop install FairScale for TPU * typo Co-authored-by: Roger Shieh <sh.rog@protonmail.ch>	2021-01-05 09:58:37 +01:00
Lezwon Castelino	12cb9942a1	Tpu save (#4309 ) * convert xla tensor to cpu before save * move_to_cpu * updated CHANGELOG.md * added on_save to accelerators * if accelerator is not None * refactors * change filename to run test * run test_tpu_backend * added xla_device_utils to tests * added xla_device_utils to test * removed tests * Revert "added xla_device_utils to test" This reverts commit 0c9316bb * fixed pep * increase timeout and print traceback * lazy check tpu exists * increased timeout removed barrier for tpu during test reduced epochs * fixed torch_xla imports * fix tests * define xla utils * fix test * aval * chlog * docs * aval * Apply suggestions from code review Co-authored-by: Jirka Borovec <jirka.borovec@seznam.cz> Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: chaton <thomas@grid.ai> Co-authored-by: Sean Naren <sean.narenthiran@gmail.com>	2020-12-02 13:05:11 +00:00
Jirka Borovec	2fe1eff85d	drop fairscale for PT <= 1.4 (#4910 ) * drop fairscale for PT <= 1.4 * fix * Add extra check to remove fairscale from minimal testing if using minimal torch version 1.3 * Update ci_test-full.yml * Update gym to .3 to see if this fixes examples CI * Update omegaconf to minimum for hydra v1.0 * Revert "Update gym to .3 to see if this fixes examples CI" This reverts commit `4221d4b9` * Revert "Update omegaconf to minimum for hydra v1.0" This reverts commit `4f579217` Co-authored-by: Sean Naren <sean.narenthiran@gmail.com> Co-authored-by: SeanNaren <sean@grid.ai>	2020-11-30 23:19:30 +00:00
Jirka Borovec	597dfa174c	build dockers XLA 1.7 (#4891 ) * build XLA 1.7 * night XLA 1.7 * rename * use 1.7 * tpu ver	2020-11-29 15:14:19 -04:00
Jirka Borovec	bddc6cd77a	pytest default color (#4703 ) * pytest default color * time Co-authored-by: chaton <thomas@grid.ai>	2020-11-18 10:53:44 +00:00
Jirka Borovec	7940ea5aaf	CI: TPU drop install horovod (#4622 ) Co-authored-by: chaton <thomas@grid.ai>	2020-11-13 11:33:52 +01:00
Jirka Borovec	bd6c413829	Conda: PT 1.8 (#3833 ) * PT 1.8 * unfreeze PT * drop nightly from full * add PT 1.8 to workflow * readme table * cuda * skip cuda * test 1.8 * unfreeze torch vision Co-authored-by: ydcjeff <ydcjeff@outlook.com> Co-authored-by: chaton <thomas@grid.ai> Co-authored-by: Nicki Skafte <skaftenicki@gmail.com>	2020-11-12 15:03:43 +01:00
Jeff Yang	23719e3c05	[dockers] install nvidia-dali-cudaXXX (#4532 ) * [dockers] install nvidia-dali-cuda100 * Apply suggestions from code review * build DALI * build DALI * build DALI * dali from source * dali from source * use binaries * qq Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: Jirka Borovec <jirka@pytorchlightning.ai> Co-authored-by: Sean Naren <sean.narenthiran@gmail.com>	2020-11-09 21:18:24 +06:30
Jeff Yang	1d594c5d0c	[docker] Lock cuda version (#4453 ) * lock cuda version * back to normal	2020-10-31 20:17:07 +06:30
Jeff Yang	0f584faa6b	PyTorch 1.7 Stable support (#3821 ) * prepare for 1.7 support [ci skip] * tpu [ci skip] * test run 1.7 * all 1.7, needs to fix tests * couple with torchvision * windows try * remove windows * 1.7 is here * on purpose fail [ci skip] * return [ci skip] * 1.7 docker * back to normal [ci skip] * change to some_val [ci skip] * add seed [ci skip] * 4 places [ci skip] * fail on purpose [ci skip] * verbose=True [ci skip] * use filename to track * use filename to track * monitor epoch + changelog * Update tests/checkpointing/test_model_checkpoint.py Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com> Co-authored-by: Sean Naren <sean.narenthiran@gmail.com> Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>	2020-10-30 15:42:14 +00:00
Jirka Borovec	ce8abd6255	Drone: use nightly build cuda docker images (#3658 ) * upgrade PT version * update docker * docker * try 1.5 * badge * fix typo: dor -> for (#3918) * prune * prune * env * echo * try * notes * env * env * env * notes * docker * prune * maintainer * CI * update * just 1.5 * CI * CI * CI * CI * CI * CI * CI * CI * CI * CI * CI * docker * CI * CI * CI * CI * CI * CI * CI * CI * CI * push * try * prune * CI * CI * CI * CI Co-authored-by: Klyukin Valeriy <mr.clyukin@gmail.com> Co-authored-by: Jeff Yang <ydcjeff@outlook.com>	2020-10-26 10:47:09 +00:00
Jeff Yang	d83c4e4d69	Cache docker builds (#3659 ) * parent `faa357648f` author ydcjeff <ydcjeff@outlook.com> 1601049378 +0630 committer ydcjeff <ydcjeff@outlook.com> 1601469495 +0630 cache docker builds lock horovod at 0.19.5 done [ci skip] [CI SKIP] use --cache-from [ci skip] typo and horovod [ci skip] exclude pt 1.3 py3.8 [ci skip] conda no cache [ci skip] fix * revert * align with master [ci skip] * retry * remove empty continuation lines * add comment * fix build-args	2020-10-25 18:46:10 +06:30
chaton	829d90b257	activated color in all pytest runs (#4254 ) * activated color in all pytest runs * Update .drone.yml Co-authored-by: Jeff Yang <ydcjeff@outlook.com> Co-authored-by: Jeff Yang <ydcjeff@outlook.com>	2020-10-20 16:38:17 +02:00
Jirka Borovec	d3567c33a6	move base req. to root (#4219 ) * move base req. to root * check-manifest * check-manifest * manifest * req	2020-10-18 20:40:18 +02:00
Jeff Yang	90929fa433	Fix apt repo issue for docker (#3823 ) * fix docker repo issue * docker * docker * docker * no cudnn * no cudnn * try 16.04 Co-authored-by: Jirka Borovec <jirka@pytorchlightning.ai>	2020-10-05 23:18:14 -04:00
Jirka Borovec	1160270882	fix path in CI for release & python version in all dockers & duplicated badges (#3765 ) * typo * path * check * trigger * fix conda * pip ver * fix cuda * fix XLA * fix xla * ci * docker * BIULD * unBIULD * update * py 3.8 * apex * apex	2020-10-02 05:26:21 -04:00
Jirka Borovec	ab508dae0c	run TPU tests with multiple versions (#3024 ) * rename * multi build * multi build * copy * copy * copy * copy * copy * copy * clean * note * docker * formatting Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> Co-authored-by: William Falcon <waf2107@columbia.edu>	2020-09-30 08:36:02 -04:00
Jirka Borovec	a0968e4bdf	fix PT version in CUDA docker images (#3739 ) * upgrade PT version * update docker * docker * try 1.5 * fix docker versions * old * badge	2020-09-30 08:33:22 -04:00
Jirka Borovec	a94728c99b	spec Horovod version (#3661 ) * spec Horovod version * MAKEFLAGS="-j2" * tests * CI * docker * CI * docker	2020-09-26 19:30:25 +02:00
Jirka Borovec	0784cf3ab4	dockers nightly (#3615 ) * dockers nightly * typo * Apply suggestions from code review Co-authored-by: Jeff Yang <ydcjeff@outlook.com> Co-authored-by: Jeff Yang <ydcjeff@outlook.com>	2020-09-25 15:58:01 +02:00
Jeff Yang	a2120130ed	Lightning docker image based on base-cuda (#3637 ) * use lightning CI docker * exclude py3.8 and torch1.3 * torch 1.7 * mergify * Apply suggestions from code review Co-authored-by: Jirka Borovec <jirka@pytorchlightning.ai> Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>	2020-09-24 23:14:15 +02:00
Jirka Borovec	37a59be21b	build more docker configs (#3533 ) * update build cases * list * matrix * matrix * builds * docker * -j1 * -q * -q * sep * docker * docker * mergify * -j1 * -j1 * horovod * copy	2020-09-23 01:41:35 +02:00
Jeff Yang	8be79a9a96	stable, dev PyTorch in Dockerfile and conda gh actions (#3074 ) * dockerfile and actions file * dockerfile and actions file * added pytorch conda cpu nightly * added pytorch conda cpu nightly * recopy base reqs * gh action `include` torch nightly * add pytorch nightly & conda gh badge * rebase * fix horovod * proposal refactor * Update .github/workflows/ci_pt-conda.yml Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> * Update .github/workflows/ci_pt-conda.yml Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> * update * update * fix cmd * filled && * fix * add -y * torchvision >0.7 allowed * explicitly install torchvision * use HOROVOD_GPU_OPERATIONS env variable * CI * skip 1.7 * table Co-authored-by: Jirka Borovec <jirka@pytorchlightning.ai> Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>	2020-09-17 20:30:39 +02:00
Jirka Borovec	cbc4f6f8a4	add CI for building dockers (#3383 ) * rename * fix badges * add docker build * mergify * update * env * ci * times * CI * name * comment	2020-09-10 18:38:29 -04:00
Jirka Borovec	9f2b29a7cd	build XLA with py3.6 (#2863 ) * build py3.6 * info * conda * update * version * version * builds * builds * builds * builds * builds	2020-08-15 15:39:44 -04:00
Jirka Borovec	a6e7aa7796	allow using apex with any PT version (#2865 ) * wip * setup * type * name * wip * docs * imports * fix if * fix if * use_amp * Apply suggestions from code review Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * Apply suggestions from code review Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * fix tests * Apply suggestions from code review Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * fix tests * todos Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>	2020-08-08 11:07:32 +02:00
Jirka Borovec	448be60701	update GPU to PT 1.5 (#2779 ) * update gpu PT 1.6 * fix docker * use PT 1.5 * Update tests/install_AMP.sh Co-authored-by: Nathan Raw <nxr9266@g.rit.edu> Co-authored-by: Nathan Raw <nxr9266@g.rit.edu>	2020-08-02 08:14:53 -04:00
Jirka Borovec	bc7a08fbe0	test dockers & add AMP in pt-1.6 (#1584 ) * exist images * names * images * args * pt 1.6 dev * circleci * update * refactor * build * fix * MKL	2020-07-31 08:23:13 -04:00
zcain117	d0b8e850a4	integrate with CircleCI (#2486 ) * add circleCI * wip * CircleCI setup that worked on my private repo. Use a working pytorch-lightning commit * Fix the orb imports * Update circleci header comment * Try to pull the GITHUB_REF from the CI_PULL_REQUEST * Use null instead of space for 'sed' * Add TODO for codecov * Remove echo of GKE_CLUSTER since it will be redacted by CircleCI. * Try running codecov upload. * Try using codecov orb * Use pip install codecov * Use codecov orb again since it should be approved * dockers/tpu-tests/Dockerfile * action * suggestions * drop suggestion * suggestion Co-authored-by: Jirka <jirka@pytorchlightning.ai>	2020-07-23 12:13:10 -04:00
Jirka Borovec	fb85d493d0	use XLA base image for TPU testing (#2536 ) * drop py3.6 * use base image * typo * skip extra * drop cache	2020-07-07 07:05:17 -04:00
Jirka Borovec	977df6ed31	Docker: building XLA base image (#2494 ) * refactor * add TPU base * wip * builds * typo * extras * simple * unzip * rename	2020-07-06 14:21:36 -04:00

1 2 3 4 5

222 Commits