lightning

Commit Graph

Author	SHA1	Message	Date
ver217	2fef6d9403	Add ColossalAI strategy (#14224 ) Co-authored-by: HELSON <c2h214748@gmail.com> Co-authored-by: rohitgr7 <rohitgr1998@gmail.com> Co-authored-by: otaj <ota@lightning.ai> Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>	2022-10-11 13:59:09 +02:00
Rui Wang	40868f7f43	Add bagua support for CUDA 11.6 images (#14529 ) * Add support for bagua-cuda116 * Remove bagua-cuda115 from installation Co-authored-by: Akihiro Nitta <nitta@akihironitta.com>	2022-09-09 20:07:25 +00:00
otaj	1ae14ca754	[CI] fix horovod tests (#14382 )	2022-08-25 17:30:06 +00:00
otaj	bb634310e7	[CI] Bump CUDA in Docker images to 11.6.1 (#14348 ) * bump cuda in docker images to 11.6.1 * PUSH TO HUB. REVERT THIS! * conda forge for 11.6 * cuda 11.5 * revert conda changes * 11.6 back again * 11.6 back again, all of them * maybe all passes now * maybe all passes now * final push * Revert "PUSH TO HUB. REVERT THIS!" This reverts commit `602bfce224`. * Apply suggestions from code review Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>	2022-08-23 12:10:52 -04:00
Carlos Mocholí	1299e4f984	Run GPU tests with PyTorch 1.12 (#13716 ) Co-authored-by: Jirka <jirka.borovec@seznam.cz>	2022-07-28 19:37:57 +05:30
Carlos Mocholí	ad87d2cad0	Future 5/n: Move requirements (#13306 ) Co-authored-by: Jirka <jirka.borovec@seznam.cz>	2022-06-21 17:11:33 +02:00
Carlos Mocholí	0cf9d73d28	Drop PyTorch 1.8 support (#13155 ) * Drop PyTorch 1.8 support * Missed update * Skip profiler test until supported * Upgrade ipu dockerfile pytorch version * Update XLA version	2022-06-14 20:46:44 -04:00
Jirka Borovec	fab2ff35ad	CI: Azure - multiple configs (#12984 ) * CI: Azure - multiple configs * names * benchmark * Apply suggestions from code review Co-authored-by: Akihiro Nitta <nitta@akihironitta.com> Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>	2022-05-14 01:59:03 +00:00
Jirka Borovec	fec9a09672	add freeze for development and full range for install (#12994 ) * freeze versions * unfreeze * dependabot * Apply suggestions from code review Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> * fix all req * ... * use base * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix refs * Apply suggestions from code review Co-authored-by: Akihiro Nitta <nitta@akihironitta.com> * Apply suggestions from code review * dockers Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Akihiro Nitta <nitta@akihironitta.com>	2022-05-12 09:14:18 -04:00
Jirka Borovec	783ec43a85	parse strategies as own extras (#12975 ) * parse strategies as own extras * prune devel * Update Makefile Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> * revert parse_requirements Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>	2022-05-09 09:25:53 -04:00
Jirka Borovec	7ce948edb6	Unpin CUDA docker image for GPU CI (#12373 ) * unpin CUDA docker image for GPU CI * Apply suggestions from code review Co-authored-by: Aki Nitta <nitta@akihironitta.com> Co-authored-by: Akihiro Nitta <akihiro@pytorchlightning.ai> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	2022-05-06 02:56:57 +00:00
Jirka Borovec	bb51e2a55b	Merge pull request #12723 from PyTorchLightning/req/strategies Separate strategies' requirements	2022-05-04 10:06:02 -04:00
Akihiro Nitta	ecd135e939	Update nvidia gpg key to fix nightly docker builds (#12930 ) * Update gpg key * Use curl instead of wget * Install key manually	2022-05-02 09:00:44 +02:00
Akihiro Nitta	ace6a5827b	Update building docker images (#12837 ) Co-authored-by: Akihiro Nitta <akihiro@pytorchlightning.ai>	2022-04-21 22:10:42 +00:00
Jirka Borovec	f9b69ce5b0	CI: check docker requires (#12677 ) * check docker requires * ci update * bagua * conda * cuda	2022-04-12 00:29:54 +09:00
Jirka Borovec	fe940e195d	CI: update prune_pkgs (#12382 )	2022-03-21 12:50:50 +00:00
four4fish	1eff3b53c1	Update fairscale version (#11567 ) Co-authored-by: Aki Nitta <nitta@akihironitta.com> Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com> Co-authored-by: Jirka <jirka.borovec@seznam.cz> Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>	2022-03-21 11:38:55 +00:00
Jirka Borovec	efa870eebc	Docker: fix NCCL building Horovod (#12318 ) * Horovod w. MPI * nccl_built * fix	2022-03-18 14:23:19 +00:00
Jirka Borovec	7ee690758c	CI: fix running PT 1.11 (#12304 ) * fix fire * horovod * assistant * cmake * u20 * cuda * -j2 * fix mypy Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>	2022-03-12 09:00:20 +00:00
Jirka Borovec	1144673cd9	CI: sanity check for req. pkgs (#11819 ) * CI: sanity check for req. pkgs * scripts * rename * gcsfs ? * rich ! * install extra * move * set -e Co-authored-by: Aki Nitta <nitta@akihironitta.com> Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	2022-03-11 09:20:47 +00:00
Jirka Borovec	8577ef7bba	Skip horovod 0.24.0 only (#12248 ) * try skip horovod 0.24.0 only * HOROVOD_BUILD_CUDA_CC_LIST * fix test Co-authored-by: Aki Nitta <nitta@akihironitta.com> Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>	2022-03-10 16:01:08 +00:00
wangraying	a0655611de	Add bagua installation in dockerfile (#11283 ) Co-authored-by: Aki Nitta <nitta@akihironitta.com> Co-authored-by: Jirka <jirka.borovec@seznam.cz>	2022-02-24 15:17:31 +01:00
Jirka Borovec	7bc87015ea	Unblock GPU CI (#11934 ) Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>	2022-02-16 21:15:44 +01:00
Aki Nitta	0a1b8b880d	Fix horovod installation `base-cuda` Dockerfile (#11811 ) * pip install --user * add checks * rm unrelated comment * consistent format * Fail if horovod not found Co-authored-by: Jirka <jirka.borovec@seznam.cz> Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>	2022-02-10 16:48:33 +09:00
Aki Nitta	86b177ebe5	Fix `apex` installation path in Dockerfile (#11596 ) * empty commit * Specify apex installation target directory * pip install --user	2022-01-27 20:14:16 -05:00
Sean Naren	c66cd12445	Remove partitioning of model in ZeRO 3 (#10655 )	2021-12-17 12:36:53 +00:00
Carlos Mocholí	d2aaf6b4cc	Upgrade CI after the 1.10 release (#10075 )	2021-11-10 17:59:10 +01:00
Carlos Mocholí	939a861853	Update Python testing (#10269 )	2021-11-04 18:26:24 +01:00
Carlos Mocholí	70570f9eaa	Minimize the number of docker jobs (#10202 ) Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>	2021-10-29 07:48:05 +01:00
Carlos Mocholí	3a4e9970d6	Pin fairscale version (#10200 )	2021-10-27 23:24:17 +00:00
Carlos Mocholí	a0e45dc071	Some minor CI cleanup (#10088 )	2021-10-26 13:58:20 +02:00
Jirka Borovec	74a09a23f1	CI: support PT 1.10 (#8133 ) * prepare PT 1.10 * dockers * fixes * readme	2021-07-14 18:04:33 +03:00
Carlos Mocholí	6ce77a102b	Set minimum PyTorch version to 1.6 (#8288 ) Co-authored-by: Jirka <jirka.borovec@seznam.cz>	2021-07-13 17:12:49 +00:00
Sean Naren	f7459f5328	DeepSpeed Infinity Update (#7234 ) * Update configs to match latest API * Ensure we move the entire model to device before configure optimizer is called * Add missing param * Expose parameters * Update references, drop local rank as it's now infered from the environment variable * Fix ref * Force install deepspeed 0.3.16 * Add guard for init * Update pytorch_lightning/plugins/training_type/deepspeed.py Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> * Revert type checking * Install master for CI for testing purposes * Update CI * Fix tests * Add check * Update versions * Set precision * Fix * See if i can force upgrade * Attempt to fix * Drop * Add changelog Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>	2021-06-14 16:38:28 +00:00
Jirka Borovec	6e56f56aa1	docker use $(nproc) (#7606 ) * docker use $(nproc) * Update typo Co-authored-by: Roger Shieh <sh.rog@protonmail.ch> Co-authored-by: Roger Shieh <sh.rog@protonmail.ch>	2021-05-19 21:48:14 +02:00
Jirka Borovec	626ef08694	enable Dockers for PT 1.9 (#7363 ) * enable PT 1.9 * fix versions * args * fix	2021-05-05 14:26:22 +02:00
Adrian Wälchli	7636d422fa	Update DeepSpeed version requirement in Dockerfile (#7326 ) Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>	2021-05-03 20:21:19 +02:00
Sean Naren	8439aead66	Update FairScale on CI (#7017 ) * Try updating CI to latest fairscale * Update availability of imports.py * Remove some of the fairscale custom ci stuff * Update grad scaler within the new process as reference is incorrect for spawn * Remove fairscale from mocks * Install fairscale 0.3.4 into the base container, remove from extra.txt * Update docs/source/conf.py * Fix import issues * Mock fairscale for docs * Fix DeepSpeed and FairScale to specific versions * Swap back to greater than * extras * Revert "extras" This reverts commit `7353479f` * ci Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> Co-authored-by: jirka <jirka.borovec@seznam.cz>	2021-04-23 12:37:00 +01:00
Jirka Borovec	1e4bc69a16	Ban `tensorboard==2.5.0` and `deepspeed==0.3.15` (#7159 ) * ban TB 2.5 * note * push * Ban tb==2.5.0 and deepspeed==0.3.15 * Fix pip command * pull * up * up Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>	2021-04-22 11:08:21 -04:00
Sean Naren	5d8610955a	Fix `apex` version in Docker due to broken upstream (#7146 ) * Set Apex commit before introduction of new MLP extensions * Refactor install command Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>	2021-04-21 23:58:55 +01:00
Sean Naren	b46cc557ef	[Feat] DeepSpeed single file saving (#6900 ) * Add single checkpoint capability * Fix checkpointing in test, few cleanups * Add comment * Change restore logic * Move vars around, add better explanation, make todo align with DeepSpeed team * Fix checkpointing * Remove deepspeed from extra, install in Dockerfile * push * pull * Split to two tests to see if it fixes Deepspeed error * Add comment	2021-04-12 22:44:09 +00:00
thomas chaton	1302766f83	DeepSpeed ZeRO Update (#6546 ) * Add context to call hook to handle all modules defined within the hook * Expose some additional parameters * Added docs, exposed parameters * Make sure we only configure if necessary * Setup activation checkpointing regardless, saves the user having to do it manually * Add some tests that fail currently * update * update * update * add tests * change docstring * resolve accumulate_grad_batches * resolve flake8 * Update DeepSpeed to use latest version, add some comments * add metrics * update * Small formatting fixes, clean up some code * Few cleanups * No need for default state * Fix tests, add some boilerplate that should move eventually * Add hook removal * Add a context manager to handle hook * Small naming cleanup * wip * move save_checkpoint responsability to accelerator * resolve flake8 * add BC * Change recommended scale to 16 * resolve flake8 * update test * update install * update * update test * update * update * update test * resolve flake8 * update * update * update on comments * Push * pull * Update pytorch_lightning/plugins/training_type/deepspeed.py Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * Update pytorch_lightning/plugins/training_type/deepspeed.py Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * update * Apply suggestions from code review * Swap to using world size defined by plugin * update * update todo * Remove deepspeed from extra, keep it in the base cuda docker install * Push * pull * update * update * update * update * Minor changes * duplicate * format * format2 Co-authored-by: SeanNaren <sean@grid.ai> Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> Co-authored-by: Sean Naren <sean.narenthiran@gmail.com> Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com> Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: Jirka Borovec <jirka.borovec@seznam.cz>	2021-03-30 13:39:02 -04:00
Jirka Borovec	85c8074bee	require: adjust versions (#6363 ) * adjust versions * release * manifest * pep8 * CI * fix * build	2021-03-06 14:34:54 +01:00
Sean Naren	8440595b26	[CI] Move DeepSpeed into CUDA image, remove DeepSpeed install from azure (#6043 ) * Move to CUDA image * Remove deepspeed install as deepspeed now in the cuda image * Remove path setting, as ninja should be in the container now	2021-02-17 18:51:31 -05:00
Sean Naren	5157ba5509	Add openmpi to our base cuda container for MPI support (#6026 ) * Add openmpi to our base container for DeepSpeed MPI support * conda Co-authored-by: Jirka Borovec <jirka.borovec@seznam.cz>	2021-02-17 12:15:49 +00:00
Jirka Borovec	c2c82dad62	CI: Azure (#5882 ) * add base Azure pipeline * skip	2021-02-10 04:43:26 -05:00
Sumanth Ratna	8732475701	Remove unnecessary intermediate layers in base-conda Dockerfile (#5697 ) * [docker][base-conda] Combine ENV+COPY instructions * [docker][base-cuda] Combine ENV+COPY instructions * [docker][base-xla] Combine ENV+COPY instructions * [docker][base-cuda] Fix COPY instruction * [docker][base-xla] Fix quote in ENV * [docker][base-xla] Fix $PATH in ENV * [docker][base-conda] Fix COPY instruction * chlog Co-authored-by: Jirka Borovec <jirka.borovec@seznam.cz>	2021-02-05 21:40:40 +01:00
Jirka Borovec	9dd04028d5	tests for legacy checkpoints (#5223 ) * wip * generate * clean * tests * copy * download * download * download * download * download * download * download * download * download * download * download * flake8 * extend * aws * extension * pull * pull * pull * pull * pull * pull * pull * try * try * try * got it * Apply suggestions from code review (cherry picked from commit `72525f0a83`)	2021-01-26 14:27:56 +01:00
Jirka Borovec	9be04c1c0b	try to update failing dockers (#5611 )	2021-01-25 17:10:56 -05:00
Jirka Borovec	7e4d6cbe48	set minimal req. PT 1.4 (#5418 ) * set minimal req. PT 1.4 * chlog	2021-01-12 19:15:35 -05:00

1 2

63 Commits