Kaushik B
bd035af78a
Fix TPU CI ( #12419 )
2022-03-23 11:35:38 +05:30
Jirka Borovec
fe940e195d
CI: update prune_pkgs ( #12382 )
2022-03-21 12:50:50 +00:00
four4fish
1eff3b53c1
Update fairscale version ( #11567 )
...
Co-authored-by: Aki Nitta <nitta@akihironitta.com>
Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
Co-authored-by: Jirka <jirka.borovec@seznam.cz>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2022-03-21 11:38:55 +00:00
Jirka Borovec
efa870eebc
Docker: fix NCCL building Horovod ( #12318 )
...
* Horovod w. MPI
* nccl_built
* fix
2022-03-18 14:23:19 +00:00
Jirka Borovec
7ee690758c
CI: fix running PT 1.11 ( #12304 )
...
* fix fire
* horovod
* assistant
* cmake
* u20
* cuda
* -j2
* fix mypy
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2022-03-12 09:00:20 +00:00
Jirka Borovec
bc8172856f
aggregate multiple helper scripts to single CLI ( #11147 )
...
* nightly release
* min version
* fire
2022-03-11 11:13:43 +00:00
Jirka Borovec
1144673cd9
CI: sanity check for req. pkgs ( #11819 )
...
* CI: sanity check for req. pkgs
* scripts
* rename
* gcsfs ?
* rich !
* install extra
* move
* set -e
Co-authored-by: Aki Nitta <nitta@akihironitta.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2022-03-11 09:20:47 +00:00
Jirka Borovec
3b4061f39a
CI: enable testing for PT 1.11 ( #11792 )
...
* enable PT 1.11
* horovod
* Apply suggestions from code review
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: Aki Nitta <nitta@akihironitta.com>
2022-03-10 18:38:47 +00:00
Jirka Borovec
8577ef7bba
Skip horovod 0.24.0 only ( #12248 )
...
* try skip horovod 0.24.0 only
* HOROVOD_BUILD_CUDA_CC_LIST
* fix test
Co-authored-by: Aki Nitta <nitta@akihironitta.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2022-03-10 16:01:08 +00:00
wangraying
a0655611de
Add bagua installation in dockerfile ( #11283 )
...
Co-authored-by: Aki Nitta <nitta@akihironitta.com>
Co-authored-by: Jirka <jirka.borovec@seznam.cz>
2022-02-24 15:17:31 +01:00
Jirka Borovec
7bc87015ea
Unblock GPU CI ( #11934 )
...
Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
2022-02-16 21:15:44 +01:00
Aki Nitta
0a1b8b880d
Fix horovod installation `base-cuda` Dockerfile ( #11811 )
...
* pip install --user
* add checks
* rm unrelated comment
* consistent format
* Fail if horovod not found
Co-authored-by: Jirka <jirka.borovec@seznam.cz>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2022-02-10 16:48:33 +09:00
Aki Nitta
86b177ebe5
Fix `apex` installation path in Dockerfile ( #11596 )
...
* empty commit
* Specify apex installation target directory
* pip install --user
2022-01-27 20:14:16 -05:00
Kaushik B
650c710efa
Rename training plugin test files & names to strategy ( #11303 )
2022-01-04 14:32:45 +01:00
Carlos Mocholí
3692eba807
Drop Python 3.6 support ( #11117 )
2021-12-21 17:06:15 +00:00
Kaushik B
2a5d05b562
Fix tpu spawn plugin test ( #11131 )
2021-12-18 02:53:37 +00:00
Sean Naren
c66cd12445
Remove partitioning of model in ZeRO 3 ( #10655 )
2021-12-17 12:36:53 +00:00
Jirka Borovec
e8659bd40e
update NGC ( #10770 )
2021-11-29 14:14:37 +00:00
Carlos Mocholí
d2aaf6b4cc
Upgrade CI after the 1.10 release ( #10075 )
2021-11-10 17:59:10 +01:00
Carlos Mocholí
939a861853
Update Python testing ( #10269 )
2021-11-04 18:26:24 +01:00
Carlos Mocholí
70570f9eaa
Minimize the number of docker jobs ( #10202 )
...
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2021-10-29 07:48:05 +01:00
Carlos Mocholí
3a4e9970d6
Pin fairscale version ( #10200 )
2021-10-27 23:24:17 +00:00
Carlos Mocholí
a0e45dc071
Some minor CI cleanup ( #10088 )
2021-10-26 13:58:20 +02:00
Kaushik B
af4a8f1950
Refactor tests for TPU Accelerator ( #9718 )
...
Co-authored-by: tchaton <thomas@grid.ai>
2021-10-14 19:45:15 +00:00
Danielle Pintz
940b910d27
[2/4] Add DeviceStatsMonitor callback ( #9712 )
...
Co-authored-by: ananthsub <ananth.subramaniam@gmail.com>
Co-authored-by: thomas chaton <thomas@grid.ai>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: Kaushik B <kaushikbokka@gmail.com>
Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com>
2021-10-13 18:29:36 +00:00
edwardpwtsoi
7c6efbc8a8
Resolved wrong mv usage for extracted directory ( #9678 )
...
Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
2021-10-05 12:56:33 +00:00
Jirka Borovec
0e6ee9c39d
CI: add mdformat ( #8673 )
...
* add mdformat
* exclude chlog
* fix ***
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2021-08-03 18:19:09 +00:00
Jirka Borovec
66cc505339
update NGC ( #8652 )
...
* update NGC
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2021-08-02 16:05:36 +00:00
Jirka Borovec
abbcfa1ab7
fix CI for PT 1.10 ( #8526 )
...
* fix CI for PT 1.10
* Apply suggestions from code review
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2021-07-23 19:24:31 +02:00
thomas chaton
8d0df6fad2
[Feat] Improve TPU CI ( #6078 )
...
* i
* i
* i
* i
* i
* i
* i
* i
* i
* i
* i
* i
* i
* i
* i
* i
* i
* i
* i
* i
* i
* i
* i
* i
* i
* i
* i
* i
* i
* i
* i
* i
* i
* i
* i
* update
* update ci
* i
* i
* i
* i
2021-07-19 19:43:21 +05:30
Jirka Borovec
74a09a23f1
CI: support PT 1.10 ( #8133 )
...
* prepare PT 1.10
* dockers
* fixes
* readme
2021-07-14 18:04:33 +03:00
Carlos Mocholí
6ce77a102b
Set minimum PyTorch version to 1.6 ( #8288 )
...
Co-authored-by: Jirka <jirka.borovec@seznam.cz>
2021-07-13 17:12:49 +00:00
Jirka Borovec
ed6d4baea2
ngc ( #8242 )
2021-07-02 13:12:45 +01:00
Kaushik B
2f3c65e57b
XLA Profiler integration ( #8014 )
2021-06-29 00:58:05 +05:30
Sean Naren
f7459f5328
DeepSpeed Infinity Update ( #7234 )
...
* Update configs to match latest API
* Ensure we move the entire model to device before configure optimizer is called
* Add missing param
* Expose parameters
* Update references, drop local rank as it's now infered from the environment variable
* Fix ref
* Force install deepspeed 0.3.16
* Add guard for init
* Update pytorch_lightning/plugins/training_type/deepspeed.py
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
* Revert type checking
* Install master for CI for testing purposes
* Update CI
* Fix tests
* Add check
* Update versions
* Set precision
* Fix
* See if i can force upgrade
* Attempt to fix
* Drop
* Add changelog
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2021-06-14 16:38:28 +00:00
Jirka Borovec
7b531ac7ac
Fix NVIDIA docker versions ( #7834 )
2021-06-06 23:56:27 +02:00
Jirka Borovec
9a001fea22
update NGC docker ( #7787 )
2021-06-01 12:11:29 +02:00
Tomy Hsieh
037a71b156
Update README.md ( #7717 )
2021-05-26 12:58:11 +02:00
Kaushik B
2c10ecc232
MAINTAINER has been deprecated ( #7683 )
2021-05-25 00:01:31 +05:30
Jirka Borovec
6e56f56aa1
docker use $(nproc) ( #7606 )
...
* docker use $(nproc)
* Update typo
Co-authored-by: Roger Shieh <sh.rog@protonmail.ch>
Co-authored-by: Roger Shieh <sh.rog@protonmail.ch>
2021-05-19 21:48:14 +02:00
Jirka Borovec
298f9e5c2d
Prune deprecated utils modules ( #7503 )
...
* argparse_utils
* model_utils
* warning_utils
* xla_device_utils
* chlog
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2021-05-13 07:24:42 +00:00
Jirka Borovec
db54b30776
Update README to 1.3 ( #7489 )
2021-05-12 13:36:52 +02:00
Louis Taylor
2b7e65b747
Add base IPU dockerfiles ( #7252 )
2021-05-07 12:07:29 +00:00
Jirka Borovec
1a27c12b26
update ngc for 1.3 ( #7414 )
2021-05-07 13:13:54 +02:00
Jirka Borovec
626ef08694
enable Dockers for PT 1.9 ( #7363 )
...
* enable PT 1.9
* fix versions
* args
* fix
2021-05-05 14:26:22 +02:00
Carlos Mocholí
c6a171b776
Fix requirements/adjust_versions.py ( #7149 )
...
Co-authored-by: jirka <jirka.borovec@seznam.cz>
2021-05-04 01:06:28 +02:00
Adrian Wälchli
7636d422fa
Update DeepSpeed version requirement in Dockerfile ( #7326 )
...
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2021-05-03 20:21:19 +02:00
Jirka Borovec
a153c15c90
Docker/nvidia ( #7109 )
...
* version check
* ...
2021-04-27 20:29:49 +01:00
Sean Naren
8439aead66
Update FairScale on CI ( #7017 )
...
* Try updating CI to latest fairscale
* Update availability of imports.py
* Remove some of the fairscale custom ci stuff
* Update grad scaler within the new process as reference is incorrect for spawn
* Remove fairscale from mocks
* Install fairscale 0.3.4 into the base container, remove from extra.txt
* Update docs/source/conf.py
* Fix import issues
* Mock fairscale for docs
* Fix DeepSpeed and FairScale to specific versions
* Swap back to greater than
* extras
* Revert "extras"
This reverts commit 7353479f
* ci
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: jirka <jirka.borovec@seznam.cz>
2021-04-23 12:37:00 +01:00
Jirka Borovec
1e4bc69a16
Ban `tensorboard==2.5.0` and `deepspeed==0.3.15` ( #7159 )
...
* ban TB 2.5
* note
* push
* Ban tb==2.5.0 and deepspeed==0.3.15
* Fix pip command
* pull
* up
* up
Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
2021-04-22 11:08:21 -04:00