Commit Graph

105 Commits

Author SHA1 Message Date
Kaushik B bd035af78a
Fix TPU CI (#12419) 2022-03-23 11:35:38 +05:30
Jirka Borovec fe940e195d
CI: update prune_pkgs (#12382) 2022-03-21 12:50:50 +00:00
four4fish 1eff3b53c1
Update fairscale version (#11567)
Co-authored-by: Aki Nitta <nitta@akihironitta.com>
Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
Co-authored-by: Jirka <jirka.borovec@seznam.cz>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2022-03-21 11:38:55 +00:00
Jirka Borovec efa870eebc
Docker: fix NCCL building Horovod (#12318)
* Horovod w. MPI
* nccl_built
* fix
2022-03-18 14:23:19 +00:00
Jirka Borovec 7ee690758c
CI: fix running PT 1.11 (#12304)
* fix fire
* horovod
* assistant
* cmake
* u20
* cuda
* -j2
* fix mypy

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2022-03-12 09:00:20 +00:00
Jirka Borovec bc8172856f
aggregate multiple helper scripts to single CLI (#11147)
* nightly release
* min version
* fire
2022-03-11 11:13:43 +00:00
Jirka Borovec 1144673cd9
CI: sanity check for req. pkgs (#11819)
* CI: sanity check for req. pkgs
* scripts
* rename
* gcsfs ?
* rich !
* install extra
* move
* set -e

Co-authored-by: Aki Nitta <nitta@akihironitta.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2022-03-11 09:20:47 +00:00
Jirka Borovec 3b4061f39a
CI: enable testing for PT 1.11 (#11792)
* enable PT 1.11
* horovod
* Apply suggestions from code review

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: Aki Nitta <nitta@akihironitta.com>
2022-03-10 18:38:47 +00:00
Jirka Borovec 8577ef7bba
Skip horovod 0.24.0 only (#12248)
* try skip horovod 0.24.0 only
* HOROVOD_BUILD_CUDA_CC_LIST
* fix test

Co-authored-by: Aki Nitta <nitta@akihironitta.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2022-03-10 16:01:08 +00:00
wangraying a0655611de
Add bagua installation in dockerfile (#11283)
Co-authored-by: Aki Nitta <nitta@akihironitta.com>
Co-authored-by: Jirka <jirka.borovec@seznam.cz>
2022-02-24 15:17:31 +01:00
Jirka Borovec 7bc87015ea
Unblock GPU CI (#11934)
Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
2022-02-16 21:15:44 +01:00
Aki Nitta 0a1b8b880d
Fix horovod installation `base-cuda` Dockerfile (#11811)
* pip install --user

* add checks

* rm unrelated comment

* consistent format

* Fail if horovod not found

Co-authored-by: Jirka <jirka.borovec@seznam.cz>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2022-02-10 16:48:33 +09:00
Aki Nitta 86b177ebe5
Fix `apex` installation path in Dockerfile (#11596)
* empty commit

* Specify apex installation target directory

* pip install --user
2022-01-27 20:14:16 -05:00
Kaushik B 650c710efa
Rename training plugin test files & names to strategy (#11303) 2022-01-04 14:32:45 +01:00
Carlos Mocholí 3692eba807
Drop Python 3.6 support (#11117) 2021-12-21 17:06:15 +00:00
Kaushik B 2a5d05b562
Fix tpu spawn plugin test (#11131) 2021-12-18 02:53:37 +00:00
Sean Naren c66cd12445
Remove partitioning of model in ZeRO 3 (#10655) 2021-12-17 12:36:53 +00:00
Jirka Borovec e8659bd40e
update NGC (#10770) 2021-11-29 14:14:37 +00:00
Carlos Mocholí d2aaf6b4cc
Upgrade CI after the 1.10 release (#10075) 2021-11-10 17:59:10 +01:00
Carlos Mocholí 939a861853
Update Python testing (#10269) 2021-11-04 18:26:24 +01:00
Carlos Mocholí 70570f9eaa
Minimize the number of docker jobs (#10202)
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2021-10-29 07:48:05 +01:00
Carlos Mocholí 3a4e9970d6
Pin fairscale version (#10200) 2021-10-27 23:24:17 +00:00
Carlos Mocholí a0e45dc071
Some minor CI cleanup (#10088) 2021-10-26 13:58:20 +02:00
Kaushik B af4a8f1950
Refactor tests for TPU Accelerator (#9718)
Co-authored-by: tchaton <thomas@grid.ai>
2021-10-14 19:45:15 +00:00
Danielle Pintz 940b910d27
[2/4] Add DeviceStatsMonitor callback (#9712)
Co-authored-by: ananthsub <ananth.subramaniam@gmail.com>
Co-authored-by: thomas chaton <thomas@grid.ai>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: Kaushik B <kaushikbokka@gmail.com>
Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com>
2021-10-13 18:29:36 +00:00
edwardpwtsoi 7c6efbc8a8
Resolved wrong mv usage for extracted directory (#9678)
Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
2021-10-05 12:56:33 +00:00
Jirka Borovec 0e6ee9c39d
CI: add mdformat (#8673)
* add mdformat
* exclude chlog
* fix ***

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2021-08-03 18:19:09 +00:00
Jirka Borovec 66cc505339
update NGC (#8652)
* update NGC

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2021-08-02 16:05:36 +00:00
Jirka Borovec abbcfa1ab7
fix CI for PT 1.10 (#8526)
* fix CI for PT 1.10
* Apply suggestions from code review

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2021-07-23 19:24:31 +02:00
thomas chaton 8d0df6fad2
[Feat] Improve TPU CI (#6078)
* i

* i

* i

* i

* i

* i

* i

* i

* i

* i

* i

* i

* i

* i

* i

* i

* i

* i

* i

* i

* i

* i

* i

* i

* i

* i

* i

* i

* i

* i

* i

* i

* i

* i

* i

* update

* update ci

* i

* i

* i

* i
2021-07-19 19:43:21 +05:30
Jirka Borovec 74a09a23f1
CI: support PT 1.10 (#8133)
* prepare PT 1.10

* dockers

* fixes

* readme
2021-07-14 18:04:33 +03:00
Carlos Mocholí 6ce77a102b
Set minimum PyTorch version to 1.6 (#8288)
Co-authored-by: Jirka <jirka.borovec@seznam.cz>
2021-07-13 17:12:49 +00:00
Jirka Borovec ed6d4baea2
ngc (#8242) 2021-07-02 13:12:45 +01:00
Kaushik B 2f3c65e57b
XLA Profiler integration (#8014) 2021-06-29 00:58:05 +05:30
Sean Naren f7459f5328
DeepSpeed Infinity Update (#7234)
* Update configs to match latest API

* Ensure we move the entire model to device before configure optimizer is called

* Add missing param

* Expose parameters

* Update references, drop local rank as it's now infered from the environment variable

* Fix ref

* Force install deepspeed 0.3.16

* Add guard for init

* Update pytorch_lightning/plugins/training_type/deepspeed.py

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

* Revert type checking

* Install master for CI for testing purposes

* Update CI

* Fix tests

* Add check

* Update versions

* Set precision

* Fix

* See if i can force upgrade

* Attempt to fix

* Drop

* Add changelog

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2021-06-14 16:38:28 +00:00
Jirka Borovec 7b531ac7ac
Fix NVIDIA docker versions (#7834) 2021-06-06 23:56:27 +02:00
Jirka Borovec 9a001fea22
update NGC docker (#7787) 2021-06-01 12:11:29 +02:00
Tomy Hsieh 037a71b156
Update README.md (#7717) 2021-05-26 12:58:11 +02:00
Kaushik B 2c10ecc232
MAINTAINER has been deprecated (#7683) 2021-05-25 00:01:31 +05:30
Jirka Borovec 6e56f56aa1
docker use $(nproc) (#7606)
* docker use $(nproc)

* Update typo

Co-authored-by: Roger Shieh <sh.rog@protonmail.ch>

Co-authored-by: Roger Shieh <sh.rog@protonmail.ch>
2021-05-19 21:48:14 +02:00
Jirka Borovec 298f9e5c2d
Prune deprecated utils modules (#7503)
* argparse_utils

* model_utils

* warning_utils

* xla_device_utils

* chlog

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2021-05-13 07:24:42 +00:00
Jirka Borovec db54b30776
Update README to 1.3 (#7489) 2021-05-12 13:36:52 +02:00
Louis Taylor 2b7e65b747
Add base IPU dockerfiles (#7252) 2021-05-07 12:07:29 +00:00
Jirka Borovec 1a27c12b26
update ngc for 1.3 (#7414) 2021-05-07 13:13:54 +02:00
Jirka Borovec 626ef08694
enable Dockers for PT 1.9 (#7363)
* enable PT 1.9

* fix versions

* args

* fix
2021-05-05 14:26:22 +02:00
Carlos Mocholí c6a171b776
Fix requirements/adjust_versions.py (#7149)
Co-authored-by: jirka <jirka.borovec@seznam.cz>
2021-05-04 01:06:28 +02:00
Adrian Wälchli 7636d422fa
Update DeepSpeed version requirement in Dockerfile (#7326)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2021-05-03 20:21:19 +02:00
Jirka Borovec a153c15c90
Docker/nvidia (#7109)
* version check

* ...
2021-04-27 20:29:49 +01:00
Sean Naren 8439aead66
Update FairScale on CI (#7017)
* Try updating CI to latest fairscale

* Update availability of imports.py

* Remove some of the fairscale custom ci stuff

* Update grad scaler within the new process as reference is incorrect for spawn

* Remove fairscale from mocks

* Install fairscale 0.3.4 into the base container, remove from extra.txt

* Update docs/source/conf.py

* Fix import issues

* Mock fairscale for docs

* Fix DeepSpeed and FairScale to specific versions

* Swap back to greater than

* extras

* Revert "extras"

This reverts commit 7353479f

* ci

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: jirka <jirka.borovec@seznam.cz>
2021-04-23 12:37:00 +01:00
Jirka Borovec 1e4bc69a16
Ban `tensorboard==2.5.0` and `deepspeed==0.3.15` (#7159)
* ban TB 2.5

* note

* push

* Ban tb==2.5.0 and deepspeed==0.3.15

* Fix pip command

* pull

* up

* up

Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
2021-04-22 11:08:21 -04:00