lightning

Commit Graph

Author	SHA1	Message	Date
Jirka Borovec	74a09a23f1	CI: support PT 1.10 (#8133 ) * prepare PT 1.10 * dockers * fixes * readme	2021-07-14 18:04:33 +03:00
Carlos Mocholí	6ce77a102b	Set minimum PyTorch version to 1.6 (#8288 ) Co-authored-by: Jirka <jirka.borovec@seznam.cz>	2021-07-13 17:12:49 +00:00
Sean Naren	f7459f5328	DeepSpeed Infinity Update (#7234 ) * Update configs to match latest API * Ensure we move the entire model to device before configure optimizer is called * Add missing param * Expose parameters * Update references, drop local rank as it's now infered from the environment variable * Fix ref * Force install deepspeed 0.3.16 * Add guard for init * Update pytorch_lightning/plugins/training_type/deepspeed.py Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> * Revert type checking * Install master for CI for testing purposes * Update CI * Fix tests * Add check * Update versions * Set precision * Fix * See if i can force upgrade * Attempt to fix * Drop * Add changelog Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>	2021-06-14 16:38:28 +00:00
Jirka Borovec	6e56f56aa1	docker use $(nproc) (#7606 ) * docker use $(nproc) * Update typo Co-authored-by: Roger Shieh <sh.rog@protonmail.ch> Co-authored-by: Roger Shieh <sh.rog@protonmail.ch>	2021-05-19 21:48:14 +02:00
Jirka Borovec	626ef08694	enable Dockers for PT 1.9 (#7363 ) * enable PT 1.9 * fix versions * args * fix	2021-05-05 14:26:22 +02:00
Adrian Wälchli	7636d422fa	Update DeepSpeed version requirement in Dockerfile (#7326 ) Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>	2021-05-03 20:21:19 +02:00
Sean Naren	8439aead66	Update FairScale on CI (#7017 ) * Try updating CI to latest fairscale * Update availability of imports.py * Remove some of the fairscale custom ci stuff * Update grad scaler within the new process as reference is incorrect for spawn * Remove fairscale from mocks * Install fairscale 0.3.4 into the base container, remove from extra.txt * Update docs/source/conf.py * Fix import issues * Mock fairscale for docs * Fix DeepSpeed and FairScale to specific versions * Swap back to greater than * extras * Revert "extras" This reverts commit `7353479f` * ci Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> Co-authored-by: jirka <jirka.borovec@seznam.cz>	2021-04-23 12:37:00 +01:00
Jirka Borovec	1e4bc69a16	Ban `tensorboard==2.5.0` and `deepspeed==0.3.15` (#7159 ) * ban TB 2.5 * note * push * Ban tb==2.5.0 and deepspeed==0.3.15 * Fix pip command * pull * up * up Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>	2021-04-22 11:08:21 -04:00
Sean Naren	5d8610955a	Fix `apex` version in Docker due to broken upstream (#7146 ) * Set Apex commit before introduction of new MLP extensions * Refactor install command Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>	2021-04-21 23:58:55 +01:00
Sean Naren	b46cc557ef	[Feat] DeepSpeed single file saving (#6900 ) * Add single checkpoint capability * Fix checkpointing in test, few cleanups * Add comment * Change restore logic * Move vars around, add better explanation, make todo align with DeepSpeed team * Fix checkpointing * Remove deepspeed from extra, install in Dockerfile * push * pull * Split to two tests to see if it fixes Deepspeed error * Add comment	2021-04-12 22:44:09 +00:00
thomas chaton	1302766f83	DeepSpeed ZeRO Update (#6546 ) * Add context to call hook to handle all modules defined within the hook * Expose some additional parameters * Added docs, exposed parameters * Make sure we only configure if necessary * Setup activation checkpointing regardless, saves the user having to do it manually * Add some tests that fail currently * update * update * update * add tests * change docstring * resolve accumulate_grad_batches * resolve flake8 * Update DeepSpeed to use latest version, add some comments * add metrics * update * Small formatting fixes, clean up some code * Few cleanups * No need for default state * Fix tests, add some boilerplate that should move eventually * Add hook removal * Add a context manager to handle hook * Small naming cleanup * wip * move save_checkpoint responsability to accelerator * resolve flake8 * add BC * Change recommended scale to 16 * resolve flake8 * update test * update install * update * update test * update * update * update test * resolve flake8 * update * update * update on comments * Push * pull * Update pytorch_lightning/plugins/training_type/deepspeed.py Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * Update pytorch_lightning/plugins/training_type/deepspeed.py Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * update * Apply suggestions from code review * Swap to using world size defined by plugin * update * update todo * Remove deepspeed from extra, keep it in the base cuda docker install * Push * pull * update * update * update * update * Minor changes * duplicate * format * format2 Co-authored-by: SeanNaren <sean@grid.ai> Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> Co-authored-by: Sean Naren <sean.narenthiran@gmail.com> Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com> Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: Jirka Borovec <jirka.borovec@seznam.cz>	2021-03-30 13:39:02 -04:00
Jirka Borovec	85c8074bee	require: adjust versions (#6363 ) * adjust versions * release * manifest * pep8 * CI * fix * build	2021-03-06 14:34:54 +01:00
Sean Naren	8440595b26	[CI] Move DeepSpeed into CUDA image, remove DeepSpeed install from azure (#6043 ) * Move to CUDA image * Remove deepspeed install as deepspeed now in the cuda image * Remove path setting, as ninja should be in the container now	2021-02-17 18:51:31 -05:00
Sean Naren	5157ba5509	Add openmpi to our base cuda container for MPI support (#6026 ) * Add openmpi to our base container for DeepSpeed MPI support * conda Co-authored-by: Jirka Borovec <jirka.borovec@seznam.cz>	2021-02-17 12:15:49 +00:00
Jirka Borovec	c2c82dad62	CI: Azure (#5882 ) * add base Azure pipeline * skip	2021-02-10 04:43:26 -05:00
Sumanth Ratna	8732475701	Remove unnecessary intermediate layers in base-conda Dockerfile (#5697 ) * [docker][base-conda] Combine ENV+COPY instructions * [docker][base-cuda] Combine ENV+COPY instructions * [docker][base-xla] Combine ENV+COPY instructions * [docker][base-cuda] Fix COPY instruction * [docker][base-xla] Fix quote in ENV * [docker][base-xla] Fix $PATH in ENV * [docker][base-conda] Fix COPY instruction * chlog Co-authored-by: Jirka Borovec <jirka.borovec@seznam.cz>	2021-02-05 21:40:40 +01:00
Jirka Borovec	9dd04028d5	tests for legacy checkpoints (#5223 ) * wip * generate * clean * tests * copy * download * download * download * download * download * download * download * download * download * download * download * flake8 * extend * aws * extension * pull * pull * pull * pull * pull * pull * pull * try * try * try * got it * Apply suggestions from code review (cherry picked from commit `72525f0a83`)	2021-01-26 14:27:56 +01:00
Jirka Borovec	9be04c1c0b	try to update failing dockers (#5611 )	2021-01-25 17:10:56 -05:00
Jirka Borovec	7e4d6cbe48	set minimal req. PT 1.4 (#5418 ) * set minimal req. PT 1.4 * chlog	2021-01-12 19:15:35 -05:00
Jirka Borovec	2fe1eff85d	drop fairscale for PT <= 1.4 (#4910 ) * drop fairscale for PT <= 1.4 * fix * Add extra check to remove fairscale from minimal testing if using minimal torch version 1.3 * Update ci_test-full.yml * Update gym to .3 to see if this fixes examples CI * Update omegaconf to minimum for hydra v1.0 * Revert "Update gym to .3 to see if this fixes examples CI" This reverts commit `4221d4b9` * Revert "Update omegaconf to minimum for hydra v1.0" This reverts commit `4f579217` Co-authored-by: Sean Naren <sean.narenthiran@gmail.com> Co-authored-by: SeanNaren <sean@grid.ai>	2020-11-30 23:19:30 +00:00
Jirka Borovec	bd6c413829	Conda: PT 1.8 (#3833 ) * PT 1.8 * unfreeze PT * drop nightly from full * add PT 1.8 to workflow * readme table * cuda * skip cuda * test 1.8 * unfreeze torch vision Co-authored-by: ydcjeff <ydcjeff@outlook.com> Co-authored-by: chaton <thomas@grid.ai> Co-authored-by: Nicki Skafte <skaftenicki@gmail.com>	2020-11-12 15:03:43 +01:00
Jeff Yang	23719e3c05	[dockers] install nvidia-dali-cudaXXX (#4532 ) * [dockers] install nvidia-dali-cuda100 * Apply suggestions from code review * build DALI * build DALI * build DALI * dali from source * dali from source * use binaries * qq Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: Jirka Borovec <jirka@pytorchlightning.ai> Co-authored-by: Sean Naren <sean.narenthiran@gmail.com>	2020-11-09 21:18:24 +06:30
Jirka Borovec	ce8abd6255	Drone: use nightly build cuda docker images (#3658 ) * upgrade PT version * update docker * docker * try 1.5 * badge * fix typo: dor -> for (#3918) * prune * prune * env * echo * try * notes * env * env * env * notes * docker * prune * maintainer * CI * update * just 1.5 * CI * CI * CI * CI * CI * CI * CI * CI * CI * CI * CI * docker * CI * CI * CI * CI * CI * CI * CI * CI * CI * push * try * prune * CI * CI * CI * CI Co-authored-by: Klyukin Valeriy <mr.clyukin@gmail.com> Co-authored-by: Jeff Yang <ydcjeff@outlook.com>	2020-10-26 10:47:09 +00:00
Jeff Yang	d83c4e4d69	Cache docker builds (#3659 ) * parent `faa357648f` author ydcjeff <ydcjeff@outlook.com> 1601049378 +0630 committer ydcjeff <ydcjeff@outlook.com> 1601469495 +0630 cache docker builds lock horovod at 0.19.5 done [ci skip] [CI SKIP] use --cache-from [ci skip] typo and horovod [ci skip] exclude pt 1.3 py3.8 [ci skip] conda no cache [ci skip] fix * revert * align with master [ci skip] * retry * remove empty continuation lines * add comment * fix build-args	2020-10-25 18:46:10 +06:30
Jeff Yang	90929fa433	Fix apt repo issue for docker (#3823 ) * fix docker repo issue * docker * docker * docker * no cudnn * no cudnn * try 16.04 Co-authored-by: Jirka Borovec <jirka@pytorchlightning.ai>	2020-10-05 23:18:14 -04:00
Jirka Borovec	1160270882	fix path in CI for release & python version in all dockers & duplicated badges (#3765 ) * typo * path * check * trigger * fix conda * pip ver * fix cuda * fix XLA * fix xla * ci * docker * BIULD * unBIULD * update * py 3.8 * apex * apex	2020-10-02 05:26:21 -04:00
Jirka Borovec	a0968e4bdf	fix PT version in CUDA docker images (#3739 ) * upgrade PT version * update docker * docker * try 1.5 * fix docker versions * old * badge	2020-09-30 08:33:22 -04:00
Jirka Borovec	a94728c99b	spec Horovod version (#3661 ) * spec Horovod version * MAKEFLAGS="-j2" * tests * CI * docker * CI * docker	2020-09-26 19:30:25 +02:00
Jirka Borovec	0784cf3ab4	dockers nightly (#3615 ) * dockers nightly * typo * Apply suggestions from code review Co-authored-by: Jeff Yang <ydcjeff@outlook.com> Co-authored-by: Jeff Yang <ydcjeff@outlook.com>	2020-09-25 15:58:01 +02:00
Jirka Borovec	37a59be21b	build more docker configs (#3533 ) * update build cases * list * matrix * matrix * builds * docker * -j1 * -q * -q * sep * docker * docker * mergify * -j1 * -j1 * horovod * copy	2020-09-23 01:41:35 +02:00
Jeff Yang	8be79a9a96	stable, dev PyTorch in Dockerfile and conda gh actions (#3074 ) * dockerfile and actions file * dockerfile and actions file * added pytorch conda cpu nightly * added pytorch conda cpu nightly * recopy base reqs * gh action `include` torch nightly * add pytorch nightly & conda gh badge * rebase * fix horovod * proposal refactor * Update .github/workflows/ci_pt-conda.yml Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> * Update .github/workflows/ci_pt-conda.yml Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> * update * update * fix cmd * filled && * fix * add -y * torchvision >0.7 allowed * explicitly install torchvision * use HOROVOD_GPU_OPERATIONS env variable * CI * skip 1.7 * table Co-authored-by: Jirka Borovec <jirka@pytorchlightning.ai> Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>	2020-09-17 20:30:39 +02:00
Jirka Borovec	9f2b29a7cd	build XLA with py3.6 (#2863 ) * build py3.6 * info * conda * update * version * version * builds * builds * builds * builds * builds	2020-08-15 15:39:44 -04:00

32 Commits