Commit Graph

94 Commits

Author SHA1 Message Date
Aki Nitta 0a1b8b880d
Fix horovod installation `base-cuda` Dockerfile (#11811)
* pip install --user

* add checks

* rm unrelated comment

* consistent format

* Fail if horovod not found

Co-authored-by: Jirka <jirka.borovec@seznam.cz>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2022-02-10 16:48:33 +09:00
Aki Nitta 86b177ebe5
Fix `apex` installation path in Dockerfile (#11596)
* empty commit

* Specify apex installation target directory

* pip install --user
2022-01-27 20:14:16 -05:00
Kaushik B 650c710efa
Rename training plugin test files & names to strategy (#11303) 2022-01-04 14:32:45 +01:00
Carlos Mocholí 3692eba807
Drop Python 3.6 support (#11117) 2021-12-21 17:06:15 +00:00
Kaushik B 2a5d05b562
Fix tpu spawn plugin test (#11131) 2021-12-18 02:53:37 +00:00
Sean Naren c66cd12445
Remove partitioning of model in ZeRO 3 (#10655) 2021-12-17 12:36:53 +00:00
Jirka Borovec e8659bd40e
update NGC (#10770) 2021-11-29 14:14:37 +00:00
Carlos Mocholí d2aaf6b4cc
Upgrade CI after the 1.10 release (#10075) 2021-11-10 17:59:10 +01:00
Carlos Mocholí 939a861853
Update Python testing (#10269) 2021-11-04 18:26:24 +01:00
Carlos Mocholí 70570f9eaa
Minimize the number of docker jobs (#10202)
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2021-10-29 07:48:05 +01:00
Carlos Mocholí 3a4e9970d6
Pin fairscale version (#10200) 2021-10-27 23:24:17 +00:00
Carlos Mocholí a0e45dc071
Some minor CI cleanup (#10088) 2021-10-26 13:58:20 +02:00
Kaushik B af4a8f1950
Refactor tests for TPU Accelerator (#9718)
Co-authored-by: tchaton <thomas@grid.ai>
2021-10-14 19:45:15 +00:00
Danielle Pintz 940b910d27
[2/4] Add DeviceStatsMonitor callback (#9712)
Co-authored-by: ananthsub <ananth.subramaniam@gmail.com>
Co-authored-by: thomas chaton <thomas@grid.ai>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: Kaushik B <kaushikbokka@gmail.com>
Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com>
2021-10-13 18:29:36 +00:00
edwardpwtsoi 7c6efbc8a8
Resolved wrong mv usage for extracted directory (#9678)
Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
2021-10-05 12:56:33 +00:00
Jirka Borovec 0e6ee9c39d
CI: add mdformat (#8673)
* add mdformat
* exclude chlog
* fix ***

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2021-08-03 18:19:09 +00:00
Jirka Borovec 66cc505339
update NGC (#8652)
* update NGC

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2021-08-02 16:05:36 +00:00
Jirka Borovec abbcfa1ab7
fix CI for PT 1.10 (#8526)
* fix CI for PT 1.10
* Apply suggestions from code review

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2021-07-23 19:24:31 +02:00
thomas chaton 8d0df6fad2
[Feat] Improve TPU CI (#6078)
* i

* i

* i

* i

* i

* i

* i

* i

* i

* i

* i

* i

* i

* i

* i

* i

* i

* i

* i

* i

* i

* i

* i

* i

* i

* i

* i

* i

* i

* i

* i

* i

* i

* i

* i

* update

* update ci

* i

* i

* i

* i
2021-07-19 19:43:21 +05:30
Jirka Borovec 74a09a23f1
CI: support PT 1.10 (#8133)
* prepare PT 1.10

* dockers

* fixes

* readme
2021-07-14 18:04:33 +03:00
Carlos Mocholí 6ce77a102b
Set minimum PyTorch version to 1.6 (#8288)
Co-authored-by: Jirka <jirka.borovec@seznam.cz>
2021-07-13 17:12:49 +00:00
Jirka Borovec ed6d4baea2
ngc (#8242) 2021-07-02 13:12:45 +01:00
Kaushik B 2f3c65e57b
XLA Profiler integration (#8014) 2021-06-29 00:58:05 +05:30
Sean Naren f7459f5328
DeepSpeed Infinity Update (#7234)
* Update configs to match latest API

* Ensure we move the entire model to device before configure optimizer is called

* Add missing param

* Expose parameters

* Update references, drop local rank as it's now infered from the environment variable

* Fix ref

* Force install deepspeed 0.3.16

* Add guard for init

* Update pytorch_lightning/plugins/training_type/deepspeed.py

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

* Revert type checking

* Install master for CI for testing purposes

* Update CI

* Fix tests

* Add check

* Update versions

* Set precision

* Fix

* See if i can force upgrade

* Attempt to fix

* Drop

* Add changelog

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2021-06-14 16:38:28 +00:00
Jirka Borovec 7b531ac7ac
Fix NVIDIA docker versions (#7834) 2021-06-06 23:56:27 +02:00
Jirka Borovec 9a001fea22
update NGC docker (#7787) 2021-06-01 12:11:29 +02:00
Tomy Hsieh 037a71b156
Update README.md (#7717) 2021-05-26 12:58:11 +02:00
Kaushik B 2c10ecc232
MAINTAINER has been deprecated (#7683) 2021-05-25 00:01:31 +05:30
Jirka Borovec 6e56f56aa1
docker use $(nproc) (#7606)
* docker use $(nproc)

* Update typo

Co-authored-by: Roger Shieh <sh.rog@protonmail.ch>

Co-authored-by: Roger Shieh <sh.rog@protonmail.ch>
2021-05-19 21:48:14 +02:00
Jirka Borovec 298f9e5c2d
Prune deprecated utils modules (#7503)
* argparse_utils

* model_utils

* warning_utils

* xla_device_utils

* chlog

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2021-05-13 07:24:42 +00:00
Jirka Borovec db54b30776
Update README to 1.3 (#7489) 2021-05-12 13:36:52 +02:00
Louis Taylor 2b7e65b747
Add base IPU dockerfiles (#7252) 2021-05-07 12:07:29 +00:00
Jirka Borovec 1a27c12b26
update ngc for 1.3 (#7414) 2021-05-07 13:13:54 +02:00
Jirka Borovec 626ef08694
enable Dockers for PT 1.9 (#7363)
* enable PT 1.9

* fix versions

* args

* fix
2021-05-05 14:26:22 +02:00
Carlos Mocholí c6a171b776
Fix requirements/adjust_versions.py (#7149)
Co-authored-by: jirka <jirka.borovec@seznam.cz>
2021-05-04 01:06:28 +02:00
Adrian Wälchli 7636d422fa
Update DeepSpeed version requirement in Dockerfile (#7326)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2021-05-03 20:21:19 +02:00
Jirka Borovec a153c15c90
Docker/nvidia (#7109)
* version check

* ...
2021-04-27 20:29:49 +01:00
Sean Naren 8439aead66
Update FairScale on CI (#7017)
* Try updating CI to latest fairscale

* Update availability of imports.py

* Remove some of the fairscale custom ci stuff

* Update grad scaler within the new process as reference is incorrect for spawn

* Remove fairscale from mocks

* Install fairscale 0.3.4 into the base container, remove from extra.txt

* Update docs/source/conf.py

* Fix import issues

* Mock fairscale for docs

* Fix DeepSpeed and FairScale to specific versions

* Swap back to greater than

* extras

* Revert "extras"

This reverts commit 7353479f

* ci

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: jirka <jirka.borovec@seznam.cz>
2021-04-23 12:37:00 +01:00
Jirka Borovec 1e4bc69a16
Ban `tensorboard==2.5.0` and `deepspeed==0.3.15` (#7159)
* ban TB 2.5

* note

* push

* Ban tb==2.5.0 and deepspeed==0.3.15

* Fix pip command

* pull

* up

* up

Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
2021-04-22 11:08:21 -04:00
Sean Naren 5d8610955a
Fix `apex` version in Docker due to broken upstream (#7146)
* Set Apex commit before introduction of new MLP extensions

* Refactor install command

Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
2021-04-21 23:58:55 +01:00
Jirka Borovec da1ac3a530
update docker base on PT 1.7 (#6931)
* update docker base on PT 1.7

* fix path
2021-04-13 10:06:06 +01:00
Sean Naren b46cc557ef
[Feat] DeepSpeed single file saving (#6900)
* Add single checkpoint capability

* Fix checkpointing in test, few cleanups

* Add comment

* Change restore logic

* Move vars around, add better explanation, make todo align with DeepSpeed team

* Fix checkpointing

* Remove deepspeed from extra, install in Dockerfile

* push

* pull

* Split to two tests to see if it fixes Deepspeed error

* Add comment
2021-04-12 22:44:09 +00:00
thomas chaton 1302766f83
DeepSpeed ZeRO Update (#6546)
* Add context to call hook to handle all modules defined within the hook

* Expose some additional parameters

* Added docs, exposed parameters

* Make sure we only configure if necessary

* Setup activation checkpointing regardless, saves the user having to do it manually

* Add some tests that fail currently

* update

* update

* update

* add tests

* change docstring

* resolve accumulate_grad_batches

* resolve flake8

* Update DeepSpeed to use latest version, add some comments

* add metrics

* update

* Small formatting fixes, clean up some code

* Few cleanups

* No need for default state

* Fix tests, add some boilerplate that should move eventually

* Add hook removal

* Add a context manager to handle hook

* Small naming cleanup

* wip

* move save_checkpoint responsability to accelerator

* resolve flake8

* add BC

* Change recommended scale to 16

* resolve flake8

* update test

* update install

* update

* update test

* update

* update

* update test

* resolve flake8

* update

* update

* update on comments

* Push

* pull

* Update pytorch_lightning/plugins/training_type/deepspeed.py

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* Update pytorch_lightning/plugins/training_type/deepspeed.py

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* update

* Apply suggestions from code review

* Swap to using world size defined by plugin

* update

* update todo

* Remove deepspeed from extra, keep it in the base cuda docker install

* Push

* pull

* update

* update

* update

* update

* Minor changes

* duplicate

* format

* format2

Co-authored-by: SeanNaren <sean@grid.ai>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: Sean Naren <sean.narenthiran@gmail.com>
Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Jirka Borovec <jirka.borovec@seznam.cz>
2021-03-30 13:39:02 -04:00
Jirka Borovec dcf6e4e310
remake nvidia docker (#6686)
* use latest

* remake

* examples
2021-03-29 09:39:06 +01:00
Jirka Borovec 5780796931
NGC container PoC (#6187)
* add NVIDIA flows

* push

* pull

* ...

* extras

* ci prune

* fix

* tag

* .

* list
2021-03-20 02:55:46 +05:30
Jirka Borovec 85c8074bee
require: adjust versions (#6363)
* adjust versions

* release

* manifest

* pep8

* CI

* fix

* build
2021-03-06 14:34:54 +01:00
Sean Naren 8440595b26
[CI] Move DeepSpeed into CUDA image, remove DeepSpeed install from azure (#6043)
* Move to CUDA image

* Remove deepspeed install as deepspeed now in the cuda image

* Remove path setting, as ninja should be in the container now
2021-02-17 18:51:31 -05:00
Sean Naren 5157ba5509
Add openmpi to our base cuda container for MPI support (#6026)
* Add openmpi to our base container for DeepSpeed MPI support

* conda

Co-authored-by: Jirka Borovec <jirka.borovec@seznam.cz>
2021-02-17 12:15:49 +00:00
Jirka Borovec b5d7d08da5
fix nightly releases & readme (#5922)
* fix nightly releases

* readme

* cuda

* doxker

* Apply suggestions from code review

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

* revert

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2021-02-16 13:46:28 -05:00
Adrian Wälchli a3d4e7c86a
move accelerator legacy tests (#5948)
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
2021-02-13 19:42:18 -05:00