Commit Graph

51 Commits

Author SHA1 Message Date
Akihiro Nitta ecd135e939
Update nvidia gpg key to fix nightly docker builds (#12930)
* Update gpg key
* Use curl instead of wget
* Install key manually
2022-05-02 09:00:44 +02:00
Akihiro Nitta ace6a5827b
Update building docker images (#12837)
Co-authored-by: Akihiro Nitta <akihiro@pytorchlightning.ai>
2022-04-21 22:10:42 +00:00
Jirka Borovec f9b69ce5b0
CI: check docker requires (#12677)
* check docker requires
* ci update
* bagua
* conda
* cuda
2022-04-12 00:29:54 +09:00
Jirka Borovec fe940e195d
CI: update prune_pkgs (#12382) 2022-03-21 12:50:50 +00:00
four4fish 1eff3b53c1
Update fairscale version (#11567)
Co-authored-by: Aki Nitta <nitta@akihironitta.com>
Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
Co-authored-by: Jirka <jirka.borovec@seznam.cz>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2022-03-21 11:38:55 +00:00
Jirka Borovec efa870eebc
Docker: fix NCCL building Horovod (#12318)
* Horovod w. MPI
* nccl_built
* fix
2022-03-18 14:23:19 +00:00
Jirka Borovec 7ee690758c
CI: fix running PT 1.11 (#12304)
* fix fire
* horovod
* assistant
* cmake
* u20
* cuda
* -j2
* fix mypy

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2022-03-12 09:00:20 +00:00
Jirka Borovec 1144673cd9
CI: sanity check for req. pkgs (#11819)
* CI: sanity check for req. pkgs
* scripts
* rename
* gcsfs ?
* rich !
* install extra
* move
* set -e

Co-authored-by: Aki Nitta <nitta@akihironitta.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2022-03-11 09:20:47 +00:00
Jirka Borovec 8577ef7bba
Skip horovod 0.24.0 only (#12248)
* try skip horovod 0.24.0 only
* HOROVOD_BUILD_CUDA_CC_LIST
* fix test

Co-authored-by: Aki Nitta <nitta@akihironitta.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2022-03-10 16:01:08 +00:00
wangraying a0655611de
Add bagua installation in dockerfile (#11283)
Co-authored-by: Aki Nitta <nitta@akihironitta.com>
Co-authored-by: Jirka <jirka.borovec@seznam.cz>
2022-02-24 15:17:31 +01:00
Jirka Borovec 7bc87015ea
Unblock GPU CI (#11934)
Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
2022-02-16 21:15:44 +01:00
Aki Nitta 0a1b8b880d
Fix horovod installation `base-cuda` Dockerfile (#11811)
* pip install --user

* add checks

* rm unrelated comment

* consistent format

* Fail if horovod not found

Co-authored-by: Jirka <jirka.borovec@seznam.cz>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2022-02-10 16:48:33 +09:00
Aki Nitta 86b177ebe5
Fix `apex` installation path in Dockerfile (#11596)
* empty commit

* Specify apex installation target directory

* pip install --user
2022-01-27 20:14:16 -05:00
Sean Naren c66cd12445
Remove partitioning of model in ZeRO 3 (#10655) 2021-12-17 12:36:53 +00:00
Carlos Mocholí d2aaf6b4cc
Upgrade CI after the 1.10 release (#10075) 2021-11-10 17:59:10 +01:00
Carlos Mocholí 939a861853
Update Python testing (#10269) 2021-11-04 18:26:24 +01:00
Carlos Mocholí 70570f9eaa
Minimize the number of docker jobs (#10202)
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2021-10-29 07:48:05 +01:00
Carlos Mocholí 3a4e9970d6
Pin fairscale version (#10200) 2021-10-27 23:24:17 +00:00
Carlos Mocholí a0e45dc071
Some minor CI cleanup (#10088) 2021-10-26 13:58:20 +02:00
Jirka Borovec 74a09a23f1
CI: support PT 1.10 (#8133)
* prepare PT 1.10

* dockers

* fixes

* readme
2021-07-14 18:04:33 +03:00
Carlos Mocholí 6ce77a102b
Set minimum PyTorch version to 1.6 (#8288)
Co-authored-by: Jirka <jirka.borovec@seznam.cz>
2021-07-13 17:12:49 +00:00
Sean Naren f7459f5328
DeepSpeed Infinity Update (#7234)
* Update configs to match latest API

* Ensure we move the entire model to device before configure optimizer is called

* Add missing param

* Expose parameters

* Update references, drop local rank as it's now infered from the environment variable

* Fix ref

* Force install deepspeed 0.3.16

* Add guard for init

* Update pytorch_lightning/plugins/training_type/deepspeed.py

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

* Revert type checking

* Install master for CI for testing purposes

* Update CI

* Fix tests

* Add check

* Update versions

* Set precision

* Fix

* See if i can force upgrade

* Attempt to fix

* Drop

* Add changelog

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2021-06-14 16:38:28 +00:00
Jirka Borovec 6e56f56aa1
docker use $(nproc) (#7606)
* docker use $(nproc)

* Update typo

Co-authored-by: Roger Shieh <sh.rog@protonmail.ch>

Co-authored-by: Roger Shieh <sh.rog@protonmail.ch>
2021-05-19 21:48:14 +02:00
Jirka Borovec 626ef08694
enable Dockers for PT 1.9 (#7363)
* enable PT 1.9

* fix versions

* args

* fix
2021-05-05 14:26:22 +02:00
Adrian Wälchli 7636d422fa
Update DeepSpeed version requirement in Dockerfile (#7326)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2021-05-03 20:21:19 +02:00
Sean Naren 8439aead66
Update FairScale on CI (#7017)
* Try updating CI to latest fairscale

* Update availability of imports.py

* Remove some of the fairscale custom ci stuff

* Update grad scaler within the new process as reference is incorrect for spawn

* Remove fairscale from mocks

* Install fairscale 0.3.4 into the base container, remove from extra.txt

* Update docs/source/conf.py

* Fix import issues

* Mock fairscale for docs

* Fix DeepSpeed and FairScale to specific versions

* Swap back to greater than

* extras

* Revert "extras"

This reverts commit 7353479f

* ci

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: jirka <jirka.borovec@seznam.cz>
2021-04-23 12:37:00 +01:00
Jirka Borovec 1e4bc69a16
Ban `tensorboard==2.5.0` and `deepspeed==0.3.15` (#7159)
* ban TB 2.5

* note

* push

* Ban tb==2.5.0 and deepspeed==0.3.15

* Fix pip command

* pull

* up

* up

Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
2021-04-22 11:08:21 -04:00
Sean Naren 5d8610955a
Fix `apex` version in Docker due to broken upstream (#7146)
* Set Apex commit before introduction of new MLP extensions

* Refactor install command

Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
2021-04-21 23:58:55 +01:00
Sean Naren b46cc557ef
[Feat] DeepSpeed single file saving (#6900)
* Add single checkpoint capability

* Fix checkpointing in test, few cleanups

* Add comment

* Change restore logic

* Move vars around, add better explanation, make todo align with DeepSpeed team

* Fix checkpointing

* Remove deepspeed from extra, install in Dockerfile

* push

* pull

* Split to two tests to see if it fixes Deepspeed error

* Add comment
2021-04-12 22:44:09 +00:00
thomas chaton 1302766f83
DeepSpeed ZeRO Update (#6546)
* Add context to call hook to handle all modules defined within the hook

* Expose some additional parameters

* Added docs, exposed parameters

* Make sure we only configure if necessary

* Setup activation checkpointing regardless, saves the user having to do it manually

* Add some tests that fail currently

* update

* update

* update

* add tests

* change docstring

* resolve accumulate_grad_batches

* resolve flake8

* Update DeepSpeed to use latest version, add some comments

* add metrics

* update

* Small formatting fixes, clean up some code

* Few cleanups

* No need for default state

* Fix tests, add some boilerplate that should move eventually

* Add hook removal

* Add a context manager to handle hook

* Small naming cleanup

* wip

* move save_checkpoint responsability to accelerator

* resolve flake8

* add BC

* Change recommended scale to 16

* resolve flake8

* update test

* update install

* update

* update test

* update

* update

* update test

* resolve flake8

* update

* update

* update on comments

* Push

* pull

* Update pytorch_lightning/plugins/training_type/deepspeed.py

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* Update pytorch_lightning/plugins/training_type/deepspeed.py

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* update

* Apply suggestions from code review

* Swap to using world size defined by plugin

* update

* update todo

* Remove deepspeed from extra, keep it in the base cuda docker install

* Push

* pull

* update

* update

* update

* update

* Minor changes

* duplicate

* format

* format2

Co-authored-by: SeanNaren <sean@grid.ai>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: Sean Naren <sean.narenthiran@gmail.com>
Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Jirka Borovec <jirka.borovec@seznam.cz>
2021-03-30 13:39:02 -04:00
Jirka Borovec 85c8074bee
require: adjust versions (#6363)
* adjust versions

* release

* manifest

* pep8

* CI

* fix

* build
2021-03-06 14:34:54 +01:00
Sean Naren 8440595b26
[CI] Move DeepSpeed into CUDA image, remove DeepSpeed install from azure (#6043)
* Move to CUDA image

* Remove deepspeed install as deepspeed now in the cuda image

* Remove path setting, as ninja should be in the container now
2021-02-17 18:51:31 -05:00
Sean Naren 5157ba5509
Add openmpi to our base cuda container for MPI support (#6026)
* Add openmpi to our base container for DeepSpeed MPI support

* conda

Co-authored-by: Jirka Borovec <jirka.borovec@seznam.cz>
2021-02-17 12:15:49 +00:00
Jirka Borovec c2c82dad62
CI: Azure (#5882)
* add base Azure pipeline

* skip
2021-02-10 04:43:26 -05:00
Sumanth Ratna 8732475701 Remove unnecessary intermediate layers in base-conda Dockerfile (#5697)
* [docker][base-conda] Combine ENV+COPY instructions

* [docker][base-cuda] Combine ENV+COPY instructions

* [docker][base-xla] Combine ENV+COPY instructions

* [docker][base-cuda] Fix COPY instruction

* [docker][base-xla] Fix quote in ENV

* [docker][base-xla] Fix $PATH in ENV

* [docker][base-conda] Fix COPY instruction

* chlog

Co-authored-by: Jirka Borovec <jirka.borovec@seznam.cz>
2021-02-05 21:40:40 +01:00
Jirka Borovec 9dd04028d5 tests for legacy checkpoints (#5223)
* wip

* generate

* clean

* tests

* copy

* download

* download

* download

* download

* download

* download

* download

* download

* download

* download

* download

* flake8

* extend

* aws

* extension

* pull

* pull

* pull

* pull

* pull

* pull

* pull

* try

* try

* try

* got it

* Apply suggestions from code review

(cherry picked from commit 72525f0a83)
2021-01-26 14:27:56 +01:00
Jirka Borovec 9be04c1c0b
try to update failing dockers (#5611) 2021-01-25 17:10:56 -05:00
Jirka Borovec 7e4d6cbe48
set minimal req. PT 1.4 (#5418)
* set minimal req. PT 1.4

* chlog
2021-01-12 19:15:35 -05:00
Jirka Borovec 2fe1eff85d
drop fairscale for PT <= 1.4 (#4910)
* drop fairscale for PT <= 1.4

* fix

* Add extra check to remove fairscale from minimal testing if using minimal torch version 1.3

* Update ci_test-full.yml

* Update gym to .3 to see if this fixes examples CI

* Update omegaconf to minimum for hydra v1.0

* Revert "Update gym to .3 to see if this fixes examples CI"

This reverts commit 4221d4b9

* Revert "Update omegaconf to minimum for hydra v1.0"

This reverts commit 4f579217

Co-authored-by: Sean Naren <sean.narenthiran@gmail.com>
Co-authored-by: SeanNaren <sean@grid.ai>
2020-11-30 23:19:30 +00:00
Jirka Borovec bd6c413829
Conda: PT 1.8 (#3833)
* PT 1.8

* unfreeze PT

* drop nightly from full

* add PT 1.8 to workflow

* readme table

* cuda

* skip cuda

* test 1.8

* unfreeze torch vision

Co-authored-by: ydcjeff <ydcjeff@outlook.com>
Co-authored-by: chaton <thomas@grid.ai>
Co-authored-by: Nicki Skafte <skaftenicki@gmail.com>
2020-11-12 15:03:43 +01:00
Jeff Yang 23719e3c05
[dockers] install nvidia-dali-cudaXXX (#4532)
* [dockers] install nvidia-dali-cuda100

* Apply suggestions from code review

* build DALI

* build DALI

* build DALI

* dali from source

* dali from source

* use binaries

* qq

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Jirka Borovec <jirka@pytorchlightning.ai>
Co-authored-by: Sean Naren <sean.narenthiran@gmail.com>
2020-11-09 21:18:24 +06:30
Jirka Borovec ce8abd6255
Drone: use nightly build cuda docker images (#3658)
* upgrade PT version

* update docker

* docker

* try 1.5

* badge

* fix typo: dor -> for (#3918)

* prune

* prune

* env

* echo

* try

* notes

* env

* env

* env

* notes

* docker

* prune

* maintainer

* CI

* update

* just 1.5

* CI

* CI

* CI

* CI

* CI

* CI

* CI

* CI

* CI

* CI

* CI

* docker

* CI

* CI

* CI

* CI

* CI

* CI

* CI

* CI

* CI

* push

* try

* prune

* CI

* CI

* CI

* CI

Co-authored-by: Klyukin Valeriy <mr.clyukin@gmail.com>
Co-authored-by: Jeff Yang <ydcjeff@outlook.com>
2020-10-26 10:47:09 +00:00
Jeff Yang d83c4e4d69
Cache docker builds (#3659)
* parent faa357648f
author ydcjeff <ydcjeff@outlook.com> 1601049378 +0630
committer ydcjeff <ydcjeff@outlook.com> 1601469495 +0630

cache docker builds

lock horovod at 0.19.5

done [ci skip] [CI SKIP]

use --cache-from [ci skip]

typo and horovod [ci skip]

exclude pt 1.3 py3.8 [ci skip]

conda no cache [ci skip]

fix

* revert

* align with master [ci skip]

* retry

* remove empty continuation lines

* add comment

* fix build-args
2020-10-25 18:46:10 +06:30
Jeff Yang 90929fa433
Fix apt repo issue for docker (#3823)
* fix docker repo issue

* docker

* docker

* docker

* no cudnn

* no cudnn

* try 16.04

Co-authored-by: Jirka Borovec <jirka@pytorchlightning.ai>
2020-10-05 23:18:14 -04:00
Jirka Borovec 1160270882
fix path in CI for release & python version in all dockers & duplicated badges (#3765)
* typo

* path

* check

* trigger

* fix conda

* pip ver

* fix cuda

* fix XLA

* fix xla

* ci

* docker

* BIULD

* unBIULD

* update

* py 3.8

* apex

* apex
2020-10-02 05:26:21 -04:00
Jirka Borovec a0968e4bdf
fix PT version in CUDA docker images (#3739)
* upgrade PT version

* update docker

* docker

* try 1.5

* fix docker versions

* old

* badge
2020-09-30 08:33:22 -04:00
Jirka Borovec a94728c99b
spec Horovod version (#3661)
* spec Horovod version

* MAKEFLAGS="-j2"

* tests

* CI

* docker

* CI

* docker
2020-09-26 19:30:25 +02:00
Jirka Borovec 0784cf3ab4
dockers nightly (#3615)
* dockers nightly

* typo

* Apply suggestions from code review

Co-authored-by: Jeff Yang <ydcjeff@outlook.com>

Co-authored-by: Jeff Yang <ydcjeff@outlook.com>
2020-09-25 15:58:01 +02:00
Jirka Borovec 37a59be21b
build more docker configs (#3533)
* update build cases

* list

* matrix

* matrix

* builds

* docker

* -j1

* -q

* -q

* sep

* docker

* docker

* mergify

* -j1

* -j1

* horovod

* copy
2020-09-23 01:41:35 +02:00
Jeff Yang 8be79a9a96
stable, dev PyTorch in Dockerfile and conda gh actions (#3074)
* dockerfile and actions file

* dockerfile and actions file

* added pytorch conda cpu nightly

* added pytorch conda cpu nightly

* recopy base reqs

* gh action `include` torch nightly

* add pytorch nightly & conda gh badge

* rebase

* fix horovod

* proposal refactor

* Update .github/workflows/ci_pt-conda.yml

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* Update .github/workflows/ci_pt-conda.yml

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* update

* update

* fix cmd

* filled &&

* fix

* add -y

* torchvision >0.7 allowed

* explicitly install torchvision

* use HOROVOD_GPU_OPERATIONS env variable

* CI

* skip 1.7

* table

Co-authored-by: Jirka Borovec <jirka@pytorchlightning.ai>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2020-09-17 20:30:39 +02:00