Jirka Borovec
759e89df21
Future 1/n: package in src/ folder ( #13293 )
...
* move: pytorch_lightning >> src/
* update setup & install
* update CI
* ci
* update CI for examples
* Self review
* mypy
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
* ci
* make
* docs
* typo
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* ci: gpu
* .
* hpu
* typing
* docs
* tpu
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2022-06-14 20:54:55 -04:00
Carlos Mocholí
0cf9d73d28
Drop PyTorch 1.8 support ( #13155 )
...
* Drop PyTorch 1.8 support
* Missed update
* Skip profiler test until supported
* Upgrade ipu dockerfile pytorch version
* Update XLA version
2022-06-14 20:46:44 -04:00
Jirka Borovec
78ff201c7e
Update CI setup ( #13291 )
...
* drop mamba
* use legacy GPU machines
2022-06-14 17:11:54 +00:00
Akarsha Rao
bfa8b7be2d
Create hpu-ci-runner Dockerfile ( #13239 )
...
* Create hpu-ci-runner Dockerfile
* Add ENTRYPOINT script 'start.sh' to hpu-ci-runner
* rename dirs
* ci
* add docker
* Fix build failure
* Fix build failure
* Fix title of nightly ci runner build
* Fix comments
* Fix comments
Co-authored-by: Jirka <jirka.borovec@seznam.cz>
Co-authored-by: Akihiro Nitta <nitta@akihironitta.com>
2022-06-08 16:02:16 -04:00
Akihiro Nitta
3c5a8a833e
Decouple pulling legacy checkpoints from existing GHA workflows and docker files ( #13185 )
...
* Add pull-legacy-checkpoints action
* Replace pulls with the new action and script
* Simplify
2022-06-02 15:39:14 +02:00
Jirka Borovec
de4ab1c027
update NGC docker ( #13136 )
...
* update docker
* Apply suggestions from code review
Co-authored-by: Akihiro Nitta <nitta@akihironitta.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2022-06-02 12:54:13 +00:00
Jirka Borovec
fab2ff35ad
CI: Azure - multiple configs ( #12984 )
...
* CI: Azure - multiple configs
* names
* benchmark
* Apply suggestions from code review
Co-authored-by: Akihiro Nitta <nitta@akihironitta.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2022-05-14 01:59:03 +00:00
Jirka Borovec
fec9a09672
add freeze for development and full range for install ( #12994 )
...
* freeze versions
* unfreeze
* dependabot
* Apply suggestions from code review
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
* fix all req
* ...
* use base
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* fix refs
* Apply suggestions from code review
Co-authored-by: Akihiro Nitta <nitta@akihironitta.com>
* Apply suggestions from code review
* dockers
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Akihiro Nitta <nitta@akihironitta.com>
2022-05-12 09:14:18 -04:00
Eric Wiener
3f78c4ca7a
Track CPU stats with DeviceStatsMonitor ( #11795 )
...
Co-authored-by: ananthsub <ananth.subramaniam@gmail.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Kaushik B <kaushikbokka@gmail.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2022-05-10 10:57:38 +00:00
Jirka Borovec
783ec43a85
parse strategies as own extras ( #12975 )
...
* parse strategies as own extras
* prune devel
* Update Makefile
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
* revert parse_requirements
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2022-05-09 09:25:53 -04:00
Jirka Borovec
7ce948edb6
Unpin CUDA docker image for GPU CI ( #12373 )
...
* unpin CUDA docker image for GPU CI
* Apply suggestions from code review
Co-authored-by: Aki Nitta <nitta@akihironitta.com>
Co-authored-by: Akihiro Nitta <akihiro@pytorchlightning.ai>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2022-05-06 02:56:57 +00:00
Jirka Borovec
bb51e2a55b
Merge pull request #12723 from PyTorchLightning/req/strategies
...
Separate strategies' requirements
2022-05-04 10:06:02 -04:00
Akihiro Nitta
ecd135e939
Update nvidia gpg key to fix nightly docker builds ( #12930 )
...
* Update gpg key
* Use curl instead of wget
* Install key manually
2022-05-02 09:00:44 +02:00
Akihiro Nitta
98b206e836
Use cmake installed with apt ( #12907 )
2022-04-28 07:44:52 +00:00
Akihiro Nitta
ace6a5827b
Update building docker images ( #12837 )
...
Co-authored-by: Akihiro Nitta <akihiro@pytorchlightning.ai>
2022-04-21 22:10:42 +00:00
Jirka Borovec
16b9580958
build more dockers & slack fails ( #12675 )
...
* build dockers
* add slack
* Apply suggestions from code review
Co-authored-by: Akihiro Nitta <nitta@akihironitta.com>
2022-04-13 17:24:08 +02:00
Jirka Borovec
f9b69ce5b0
CI: check docker requires ( #12677 )
...
* check docker requires
* ci update
* bagua
* conda
* cuda
2022-04-12 00:29:54 +09:00
Kaushik B
bd035af78a
Fix TPU CI ( #12419 )
2022-03-23 11:35:38 +05:30
Jirka Borovec
fe940e195d
CI: update prune_pkgs ( #12382 )
2022-03-21 12:50:50 +00:00
four4fish
1eff3b53c1
Update fairscale version ( #11567 )
...
Co-authored-by: Aki Nitta <nitta@akihironitta.com>
Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
Co-authored-by: Jirka <jirka.borovec@seznam.cz>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2022-03-21 11:38:55 +00:00
Jirka Borovec
efa870eebc
Docker: fix NCCL building Horovod ( #12318 )
...
* Horovod w. MPI
* nccl_built
* fix
2022-03-18 14:23:19 +00:00
Jirka Borovec
7ee690758c
CI: fix running PT 1.11 ( #12304 )
...
* fix fire
* horovod
* assistant
* cmake
* u20
* cuda
* -j2
* fix mypy
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2022-03-12 09:00:20 +00:00
Jirka Borovec
bc8172856f
aggregate multiple helper scripts to single CLI ( #11147 )
...
* nightly release
* min version
* fire
2022-03-11 11:13:43 +00:00
Jirka Borovec
1144673cd9
CI: sanity check for req. pkgs ( #11819 )
...
* CI: sanity check for req. pkgs
* scripts
* rename
* gcsfs ?
* rich !
* install extra
* move
* set -e
Co-authored-by: Aki Nitta <nitta@akihironitta.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2022-03-11 09:20:47 +00:00
Jirka Borovec
3b4061f39a
CI: enable testing for PT 1.11 ( #11792 )
...
* enable PT 1.11
* horovod
* Apply suggestions from code review
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: Aki Nitta <nitta@akihironitta.com>
2022-03-10 18:38:47 +00:00
Jirka Borovec
8577ef7bba
Skip horovod 0.24.0 only ( #12248 )
...
* try skip horovod 0.24.0 only
* HOROVOD_BUILD_CUDA_CC_LIST
* fix test
Co-authored-by: Aki Nitta <nitta@akihironitta.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2022-03-10 16:01:08 +00:00
wangraying
a0655611de
Add bagua installation in dockerfile ( #11283 )
...
Co-authored-by: Aki Nitta <nitta@akihironitta.com>
Co-authored-by: Jirka <jirka.borovec@seznam.cz>
2022-02-24 15:17:31 +01:00
Jirka Borovec
7bc87015ea
Unblock GPU CI ( #11934 )
...
Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
2022-02-16 21:15:44 +01:00
Aki Nitta
0a1b8b880d
Fix horovod installation `base-cuda` Dockerfile ( #11811 )
...
* pip install --user
* add checks
* rm unrelated comment
* consistent format
* Fail if horovod not found
Co-authored-by: Jirka <jirka.borovec@seznam.cz>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2022-02-10 16:48:33 +09:00
Aki Nitta
86b177ebe5
Fix `apex` installation path in Dockerfile ( #11596 )
...
* empty commit
* Specify apex installation target directory
* pip install --user
2022-01-27 20:14:16 -05:00
Kaushik B
650c710efa
Rename training plugin test files & names to strategy ( #11303 )
2022-01-04 14:32:45 +01:00
Carlos Mocholí
3692eba807
Drop Python 3.6 support ( #11117 )
2021-12-21 17:06:15 +00:00
Kaushik B
2a5d05b562
Fix tpu spawn plugin test ( #11131 )
2021-12-18 02:53:37 +00:00
Sean Naren
c66cd12445
Remove partitioning of model in ZeRO 3 ( #10655 )
2021-12-17 12:36:53 +00:00
Jirka Borovec
e8659bd40e
update NGC ( #10770 )
2021-11-29 14:14:37 +00:00
Carlos Mocholí
d2aaf6b4cc
Upgrade CI after the 1.10 release ( #10075 )
2021-11-10 17:59:10 +01:00
Carlos Mocholí
939a861853
Update Python testing ( #10269 )
2021-11-04 18:26:24 +01:00
Carlos Mocholí
70570f9eaa
Minimize the number of docker jobs ( #10202 )
...
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2021-10-29 07:48:05 +01:00
Carlos Mocholí
3a4e9970d6
Pin fairscale version ( #10200 )
2021-10-27 23:24:17 +00:00
Carlos Mocholí
a0e45dc071
Some minor CI cleanup ( #10088 )
2021-10-26 13:58:20 +02:00
Kaushik B
af4a8f1950
Refactor tests for TPU Accelerator ( #9718 )
...
Co-authored-by: tchaton <thomas@grid.ai>
2021-10-14 19:45:15 +00:00
Danielle Pintz
940b910d27
[2/4] Add DeviceStatsMonitor callback ( #9712 )
...
Co-authored-by: ananthsub <ananth.subramaniam@gmail.com>
Co-authored-by: thomas chaton <thomas@grid.ai>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: Kaushik B <kaushikbokka@gmail.com>
Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com>
2021-10-13 18:29:36 +00:00
edwardpwtsoi
7c6efbc8a8
Resolved wrong mv usage for extracted directory ( #9678 )
...
Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
2021-10-05 12:56:33 +00:00
Jirka Borovec
0e6ee9c39d
CI: add mdformat ( #8673 )
...
* add mdformat
* exclude chlog
* fix ***
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2021-08-03 18:19:09 +00:00
Jirka Borovec
66cc505339
update NGC ( #8652 )
...
* update NGC
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2021-08-02 16:05:36 +00:00
Jirka Borovec
abbcfa1ab7
fix CI for PT 1.10 ( #8526 )
...
* fix CI for PT 1.10
* Apply suggestions from code review
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2021-07-23 19:24:31 +02:00
thomas chaton
8d0df6fad2
[Feat] Improve TPU CI ( #6078 )
...
* i
* i
* i
* i
* i
* i
* i
* i
* i
* i
* i
* i
* i
* i
* i
* i
* i
* i
* i
* i
* i
* i
* i
* i
* i
* i
* i
* i
* i
* i
* i
* i
* i
* i
* i
* update
* update ci
* i
* i
* i
* i
2021-07-19 19:43:21 +05:30
Jirka Borovec
74a09a23f1
CI: support PT 1.10 ( #8133 )
...
* prepare PT 1.10
* dockers
* fixes
* readme
2021-07-14 18:04:33 +03:00
Carlos Mocholí
6ce77a102b
Set minimum PyTorch version to 1.6 ( #8288 )
...
Co-authored-by: Jirka <jirka.borovec@seznam.cz>
2021-07-13 17:12:49 +00:00
Jirka Borovec
ed6d4baea2
ngc ( #8242 )
2021-07-02 13:12:45 +01:00
Kaushik B
2f3c65e57b
XLA Profiler integration ( #8014 )
2021-06-29 00:58:05 +05:30
Sean Naren
f7459f5328
DeepSpeed Infinity Update ( #7234 )
...
* Update configs to match latest API
* Ensure we move the entire model to device before configure optimizer is called
* Add missing param
* Expose parameters
* Update references, drop local rank as it's now infered from the environment variable
* Fix ref
* Force install deepspeed 0.3.16
* Add guard for init
* Update pytorch_lightning/plugins/training_type/deepspeed.py
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
* Revert type checking
* Install master for CI for testing purposes
* Update CI
* Fix tests
* Add check
* Update versions
* Set precision
* Fix
* See if i can force upgrade
* Attempt to fix
* Drop
* Add changelog
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2021-06-14 16:38:28 +00:00
Jirka Borovec
7b531ac7ac
Fix NVIDIA docker versions ( #7834 )
2021-06-06 23:56:27 +02:00
Jirka Borovec
9a001fea22
update NGC docker ( #7787 )
2021-06-01 12:11:29 +02:00
Tomy Hsieh
037a71b156
Update README.md ( #7717 )
2021-05-26 12:58:11 +02:00
Kaushik B
2c10ecc232
MAINTAINER has been deprecated ( #7683 )
2021-05-25 00:01:31 +05:30
Jirka Borovec
6e56f56aa1
docker use $(nproc) ( #7606 )
...
* docker use $(nproc)
* Update typo
Co-authored-by: Roger Shieh <sh.rog@protonmail.ch>
Co-authored-by: Roger Shieh <sh.rog@protonmail.ch>
2021-05-19 21:48:14 +02:00
Jirka Borovec
298f9e5c2d
Prune deprecated utils modules ( #7503 )
...
* argparse_utils
* model_utils
* warning_utils
* xla_device_utils
* chlog
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2021-05-13 07:24:42 +00:00
Jirka Borovec
db54b30776
Update README to 1.3 ( #7489 )
2021-05-12 13:36:52 +02:00
Louis Taylor
2b7e65b747
Add base IPU dockerfiles ( #7252 )
2021-05-07 12:07:29 +00:00
Jirka Borovec
1a27c12b26
update ngc for 1.3 ( #7414 )
2021-05-07 13:13:54 +02:00
Jirka Borovec
626ef08694
enable Dockers for PT 1.9 ( #7363 )
...
* enable PT 1.9
* fix versions
* args
* fix
2021-05-05 14:26:22 +02:00
Carlos Mocholí
c6a171b776
Fix requirements/adjust_versions.py ( #7149 )
...
Co-authored-by: jirka <jirka.borovec@seznam.cz>
2021-05-04 01:06:28 +02:00
Adrian Wälchli
7636d422fa
Update DeepSpeed version requirement in Dockerfile ( #7326 )
...
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2021-05-03 20:21:19 +02:00
Jirka Borovec
a153c15c90
Docker/nvidia ( #7109 )
...
* version check
* ...
2021-04-27 20:29:49 +01:00
Sean Naren
8439aead66
Update FairScale on CI ( #7017 )
...
* Try updating CI to latest fairscale
* Update availability of imports.py
* Remove some of the fairscale custom ci stuff
* Update grad scaler within the new process as reference is incorrect for spawn
* Remove fairscale from mocks
* Install fairscale 0.3.4 into the base container, remove from extra.txt
* Update docs/source/conf.py
* Fix import issues
* Mock fairscale for docs
* Fix DeepSpeed and FairScale to specific versions
* Swap back to greater than
* extras
* Revert "extras"
This reverts commit 7353479f
* ci
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: jirka <jirka.borovec@seznam.cz>
2021-04-23 12:37:00 +01:00
Jirka Borovec
1e4bc69a16
Ban `tensorboard==2.5.0` and `deepspeed==0.3.15` ( #7159 )
...
* ban TB 2.5
* note
* push
* Ban tb==2.5.0 and deepspeed==0.3.15
* Fix pip command
* pull
* up
* up
Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
2021-04-22 11:08:21 -04:00
Sean Naren
5d8610955a
Fix `apex` version in Docker due to broken upstream ( #7146 )
...
* Set Apex commit before introduction of new MLP extensions
* Refactor install command
Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
2021-04-21 23:58:55 +01:00
Jirka Borovec
da1ac3a530
update docker base on PT 1.7 ( #6931 )
...
* update docker base on PT 1.7
* fix path
2021-04-13 10:06:06 +01:00
Sean Naren
b46cc557ef
[Feat] DeepSpeed single file saving ( #6900 )
...
* Add single checkpoint capability
* Fix checkpointing in test, few cleanups
* Add comment
* Change restore logic
* Move vars around, add better explanation, make todo align with DeepSpeed team
* Fix checkpointing
* Remove deepspeed from extra, install in Dockerfile
* push
* pull
* Split to two tests to see if it fixes Deepspeed error
* Add comment
2021-04-12 22:44:09 +00:00
thomas chaton
1302766f83
DeepSpeed ZeRO Update ( #6546 )
...
* Add context to call hook to handle all modules defined within the hook
* Expose some additional parameters
* Added docs, exposed parameters
* Make sure we only configure if necessary
* Setup activation checkpointing regardless, saves the user having to do it manually
* Add some tests that fail currently
* update
* update
* update
* add tests
* change docstring
* resolve accumulate_grad_batches
* resolve flake8
* Update DeepSpeed to use latest version, add some comments
* add metrics
* update
* Small formatting fixes, clean up some code
* Few cleanups
* No need for default state
* Fix tests, add some boilerplate that should move eventually
* Add hook removal
* Add a context manager to handle hook
* Small naming cleanup
* wip
* move save_checkpoint responsability to accelerator
* resolve flake8
* add BC
* Change recommended scale to 16
* resolve flake8
* update test
* update install
* update
* update test
* update
* update
* update test
* resolve flake8
* update
* update
* update on comments
* Push
* pull
* Update pytorch_lightning/plugins/training_type/deepspeed.py
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
* Update pytorch_lightning/plugins/training_type/deepspeed.py
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
* update
* Apply suggestions from code review
* Swap to using world size defined by plugin
* update
* update todo
* Remove deepspeed from extra, keep it in the base cuda docker install
* Push
* pull
* update
* update
* update
* update
* Minor changes
* duplicate
* format
* format2
Co-authored-by: SeanNaren <sean@grid.ai>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: Sean Naren <sean.narenthiran@gmail.com>
Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Jirka Borovec <jirka.borovec@seznam.cz>
2021-03-30 13:39:02 -04:00
Jirka Borovec
dcf6e4e310
remake nvidia docker ( #6686 )
...
* use latest
* remake
* examples
2021-03-29 09:39:06 +01:00
Jirka Borovec
5780796931
NGC container PoC ( #6187 )
...
* add NVIDIA flows
* push
* pull
* ...
* extras
* ci prune
* fix
* tag
* .
* list
2021-03-20 02:55:46 +05:30
Jirka Borovec
85c8074bee
require: adjust versions ( #6363 )
...
* adjust versions
* release
* manifest
* pep8
* CI
* fix
* build
2021-03-06 14:34:54 +01:00
Sean Naren
8440595b26
[CI] Move DeepSpeed into CUDA image, remove DeepSpeed install from azure ( #6043 )
...
* Move to CUDA image
* Remove deepspeed install as deepspeed now in the cuda image
* Remove path setting, as ninja should be in the container now
2021-02-17 18:51:31 -05:00
Sean Naren
5157ba5509
Add openmpi to our base cuda container for MPI support ( #6026 )
...
* Add openmpi to our base container for DeepSpeed MPI support
* conda
Co-authored-by: Jirka Borovec <jirka.borovec@seznam.cz>
2021-02-17 12:15:49 +00:00
Jirka Borovec
b5d7d08da5
fix nightly releases & readme ( #5922 )
...
* fix nightly releases
* readme
* cuda
* doxker
* Apply suggestions from code review
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
* revert
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2021-02-16 13:46:28 -05:00
Adrian Wälchli
a3d4e7c86a
move accelerator legacy tests ( #5948 )
...
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
2021-02-13 19:42:18 -05:00
Justus Schock
da6dbc8d1d
PoC: Accelerator refactor ( #5743 )
...
* restoring the result from subprocess
* fix queue.get() order for results
* add missing "block_backward_sync" context manager
* add missing "block_backward_sync" context manager
* fix sync_batchnorm
* fix supported gpu-ids for tuple
* fix clip gradients and inf recursion
* accelerator selection: added cluster_environment plugin
* fix torchelastic test
* fix reduce early stopping decision for DDP
* fix tests: callbacks, conversion to lightning optimizer
* fix lightning optimizer does not pickle
* fix setting benchmark and deterministic option
* fix slurm amp test
* fix prepare_data test and determine node_rank
* fix retrieving last path when testing
* remove obsolete plugin argument
* fix test: test_trainer_config
* fix torchscript tests
* fix trainer.model access
* move properties
* fix test_transfer_batch_hook
* fix auto_select_gpus
* fix omegaconf test
* fix test that needs to simulate slurm ddp
* add horovod plugin
* fix test with named arguments
* clean up whitespace
* fix datamodules test
* remove old accelerators
* fix naming
* move old plugins
* move to plugins
* create precision subpackage
* create training_type subpackage
* fix all new import errors
* fix wrong arguments order passed to test
* fix LR finder
* Added sharded training type and amp plugin
* Move clip grad to precision plugin
* Added sharded spawn, select accelerators based on distributed_backend + enable custom fp16 plugin automatically
* Fix import issue, attempting to fix tests
* Fix initial test
* Reflect hook logic from master, should wrap model after move to device
* Optional state consolidation, since master has optimizers not wrapped
* change attribute for instance test
* reset optimizers
optimizers are not used in main process, so state would be wrong.
* legacy
* imports in accel
* legacy2
* trainer imports
* fix import errors after rebase
* move hook to new setup location
* provide unwrapping logic
* fix trainer callback system
* added ddp2 implementation
* fix imports .legacy
* move plugins
* restore legacy
* drop test.py from root
* add tpu accelerator and plugins
* fixes
* fix lightning optimizer merge
* reset bugreportmodel
* unwrapping
* step routing forward
* model access
* unwrap
* opt
* integrate distrib_type
* sync changes
* sync
* fixes
* add forgotten generators
* add missing logic
* update
* import
* missed imports
* import fixes
* isort
* mv f
* changelog
* format
* move helper to parallel plugin
* d
* add world size
* clean up
* duplicate
* activate ddp_sharded and tpu
* set nvidia flags
* remove unused colab var
* use_tpu <-> on_tpu attrs
* make some ddp_cpu and clusterplugin tests pass
* Ref/accelerator connector (#5742 )
* final cleanup
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
* connector cleanup
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
* trainer cleanup
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
* accelerator cleanup + missing logic in accelerator connector
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
* add missing changes to callbacks
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
* reflect accelerator changes to lightning module
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
* clean cluster envs
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
* cleanup plugins
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
* add broadcasting
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
* yapf
* remove plugin connector
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
* plugins
* manual optimization
* update optimizer routing
* add rank to torchelastic
* fix memory mixed precision
* setstate on trainer for pickling in ddp spawn
* add predict method
* add back commented accelerator code
* adapt test for sync_batch_norm to new plugin
* fix deprecated tests
* fix ddp cpu choice when no num_processes are given
* yapf format
* skip a memory test that cannot pass anymore
* fix pickle error in spawn plugin
* x
* avoid
* x
* fix cyclic import in docs build
* add support for sharded
* update typing
* add sharded and sharded_spawn to distributed types
* make unwrap model default
* refactor LightningShardedDataParallel similar to LightningDistributedDataParallel
* update sharded spawn to reflect changes
* update sharded to reflect changes
* Merge 1.1.5 changes
* fix merge
* fix merge
* yapf isort
* fix merge
* yapf isort
* fix indentation in test
* copy over reinit scheduler implementation from dev1.2
* fix apex tracking calls with dev_debugger
* reduce diff to dev1.2, clean up
* fix trainer config test when gpus>0 and num_processes >0 and ddp_cpu
* sort plugin tests legacy/new
* fix error handling for amp on cpu
* fix merge
fix merge
fix merge
* [Feat] Resolve manual_backward (#5837 )
* resolve manual_backward
* resolve flake8
* update
* resolve for ddp_spawn
* resolve flake8
* resolve flake8
* resolve flake8
Co-authored-by: Ubuntu <ubuntu@ip-172-31-88-60.ec2.internal>
* fix tests/accelerator tests on cpu
* [BugFix] Resolve manual optimization (#5852 )
* resolve manual_optimization
* update
* update
Co-authored-by: Ubuntu <ubuntu@ip-172-31-88-60.ec2.internal>
* Remove copy trainer parameters to happen earlier within the loop and add safe guard to get ref model (#5856 )
* resovle a bug
* Accelerator refactor sharded rpc (#5854 )
* rpc branch
* merge
* update handling of rpc
* make devices etc. Optional in RPC
* set devices etc. later if necessary
* remove devices from sequential
* make devices optional in rpc
* fix import
* uncomment everything
* fix cluster selection
Co-authored-by: Ubuntu <ubuntu@ip-172-31-88-60.ec2.internal>
* resolve bug
* fix assert in rpc test
* resolve a test
* fix docs compilation
* accelerator refactor - fix for sharded parity test (#5866 )
* fix memory issue with ddp_spawn
* x
x
x
x
x
x
x
x
x
* x
* Remove DDP2 as this does not apply
* Add missing pre optimizer hook to ensure lambda closure is called
* fix apex docstring
* [accelerator][BugFix] Resolve some test for 1 gpu (#5863 )
* update
* revert init
* resolve a bug
* update
* resolve flake8
* update
* update
* update
* revert init
* resolve a bug
* update
* resolve flake8
* update
* update
* update
* update
* update
* revert init
* resolve a bug
* update
* resolve flake8
* update
* update
* update
* revert init
* update
* resolve flake8
* update
* update
* update
* update
* update
* all_gather
* update
* make plugins work, add misconfig for RPC
* update
* update
* remove breaking test
* resolve some tests
* resolve flake8
* revert to ddp_spawn
Co-authored-by: root <root@ip-172-31-88-60.ec2.internal>
Co-authored-by: Ubuntu <ubuntu@ip-172-31-88-60.ec2.internal>
Co-authored-by: Justus Schock <justus.schock@rwth-aachen.de>
* yapf isort
* resolve flake8
* fix apex doctests
* fix apex doctests 2
* resolve docs
* update drone
* clean env
* update
* update
* update
* update
* merge
* Fix RPC related tests, clean out old API, update for new accelerator API [skip ci] (#5881 )
* Fix RPC related tests, clean out old API, update for new accelerator API
* Move tests out of legacy folder, update paths and names
* Update test_remove_1-4.py
* Expose properties for tpu cores/gpus/num_gpus
* Add root GPU property
* Move properties to properties.py
* move tests that were previously in drone
* Fix root GPU property (#5908 )
* Move root GPU to property, remove horovod set as this is handled in horovod plugin, ensure we mock correctly to set GPU accelerator
* Add missing tests back
* fix best model path transfer when no checkpoint callback available
* Fix setup hook order [wip] (#5858 )
* Call trainer setup hook before accelerator setup
* Add test case
* add new test
* typo
* fix callback order in test
Co-authored-by: tchaton <thomas@grid.ai>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
* rename ddp sequential -> rpc sequential for special test
* revert
* fix stupid merge problem
* Use property in connector for sampler (#5913 )
* merge the import conflicts
* fix spawning of processes in slurm
* [wip] Fix some bugs for TPU [skip ci] (#5878 )
* fixed for single tpu
* fixed spawn
* fixed spawn
* update
* update
* wip
* resolve bugs
* resolve bug
* update on comment
* removed decorator
* resolve comments
* set to 4
* update
* update
* need cleaning
* update
* update
* update
* resolve flake8
* resolve bugs
* exclude broadcast
* resolve bugs
* change test
* update
* update
* skip if meet fails
* properly raise trace
* update
* add catch
* wrap test
* resolve typo
* update
* typo
Co-authored-by: Lezwon Castelino <lezwon@gmail.com>
Co-authored-by: Your Name <you@example.com>
* resolve some tests
* update
* fix imports
* update
* resolve flake8
* update azure pipeline
* skip a sharded test on cpu that requires a gpu
* resolve tpus
* resolve bug
* resolve flake8
* update
* updat utils
* revert permission change on files
* suggestions from carlos
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
* remove unrelated formatting changes
* remove incomplete comment
* Update pytorch_lightning/accelerators/__init__.py
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
* remove unrelated formatting change
* add types
* warn 1.7 ddp manual backward only if ddp kwarg unset
* yapf + isort
* pep8 unused imports
* fix cyclic import in docs
* Apply suggestions from code review
* typer in accelerator.py
* typo
* Apply suggestions from code review
* formatting
* update on comments
* update typo
* Update pytorch_lightning/trainer/properties.py
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
* update
* suggestion from code review
* suggestion from code review
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: SeanNaren <sean@grid.ai>
Co-authored-by: Jirka Borovec <jirka.borovec@seznam.cz>
Co-authored-by: chaton <thomas@grid.ai>
Co-authored-by: Ubuntu <ubuntu@ip-172-31-88-60.ec2.internal>
Co-authored-by: Sean Naren <sean.narenthiran@gmail.com>
Co-authored-by: root <root@ip-172-31-88-60.ec2.internal>
Co-authored-by: Lezwon Castelino <lezwon@gmail.com>
Co-authored-by: Your Name <you@example.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
2021-02-12 15:48:56 -05:00
Jirka Borovec
c2c82dad62
CI: Azure ( #5882 )
...
* add base Azure pipeline
* skip
2021-02-10 04:43:26 -05:00
Jirka Borovec
1ac9164f91
create new Conda images ( #5877 )
...
* create new Conda images
* .
* .
2021-02-09 15:30:48 +00:00
Jirka Borovec
937f11c05b
try fix: Docker with Conda & PT 1.8 ( #5842 )
...
* ci
* ver
* list
* pt
* nk
* ch
* 4.9
2021-02-09 08:22:35 +00:00
tchaton
77be6f6e24
resolve conflits
...
resolve doc
boring commit
docs
torchvision
tpu
Update dockers/tpu-tests/tpu_test_cases.jsonnet
Update dockers/tpu-tests/tpu_test_cases.jsonnet
2021-02-05 21:43:10 +01:00
Jirka Borovec
a39b382fe1
hotfix for GHA tpu ( #5762 )
...
* -y
* t
* .
* t
2021-02-05 21:43:10 +01:00
Sumanth Ratna
8732475701
Remove unnecessary intermediate layers in base-conda Dockerfile ( #5697 )
...
* [docker][base-conda] Combine ENV+COPY instructions
* [docker][base-cuda] Combine ENV+COPY instructions
* [docker][base-xla] Combine ENV+COPY instructions
* [docker][base-cuda] Fix COPY instruction
* [docker][base-xla] Fix quote in ENV
* [docker][base-xla] Fix $PATH in ENV
* [docker][base-conda] Fix COPY instruction
* chlog
Co-authored-by: Jirka Borovec <jirka.borovec@seznam.cz>
2021-02-05 21:40:40 +01:00
Jirka Borovec
07f24d2438
add nvidia docker image ( #5668 )
...
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2021-01-29 11:01:03 -05:00
Jirka Borovec
7e2e874d95
Refactor: legacy accelerators and plugins ( #5645 )
...
* tests: legacy
* legacy: accel
* legacy: plug
* fix imports
* mypy
* flake8
2021-01-26 20:04:36 -05:00
Jirka Borovec
9dd04028d5
tests for legacy checkpoints ( #5223 )
...
* wip
* generate
* clean
* tests
* copy
* download
* download
* download
* download
* download
* download
* download
* download
* download
* download
* download
* flake8
* extend
* aws
* extension
* pull
* pull
* pull
* pull
* pull
* pull
* pull
* try
* try
* try
* got it
* Apply suggestions from code review
(cherry picked from commit 72525f0a83
)
2021-01-26 14:27:56 +01:00
Jeff Yang
e1a4c2e448
docker: run ci only docker related files are changed ( #5203 )
...
* only run ci on docker related files
* docker related files changed!
* install pytorch along with cudatoolkit
* build docker only on SUN
* conda exit status has been fixed
* reverts back to old conda version
* add more docker related files
* conda env update --name
* create env and install pytorch again
* create env and install pytorch again
* ${PYTORCH_CHANNEL}
* dont update pytorch with conda env update
* Apply suggestions from code review
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
* Update dockers/base-conda/Dockerfile
* Apply suggestions from code review
* remove checks in cron job
* Apply suggestions from code review
* readd #
* readd #
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Roger Shieh <sh.rog@protonmail.ch>
(cherry picked from commit cc624358c8
)
2021-01-26 14:27:56 +01:00
Jirka Borovec
9be04c1c0b
try to update failing dockers ( #5611 )
2021-01-25 17:10:56 -05:00
Jirka Borovec
7e4d6cbe48
set minimal req. PT 1.4 ( #5418 )
...
* set minimal req. PT 1.4
* chlog
2021-01-12 19:15:35 -05:00
Jirka Borovec
5119013c81
drop install FairScale for TPU ( #5113 )
...
* drop install FairScale for TPU
* typo
Co-authored-by: Roger Shieh <sh.rog@protonmail.ch>
2021-01-05 09:58:37 +01:00
Lezwon Castelino
12cb9942a1
Tpu save ( #4309 )
...
* convert xla tensor to cpu before save
* move_to_cpu
* updated CHANGELOG.md
* added on_save to accelerators
* if accelerator is not None
* refactors
* change filename to run test
* run test_tpu_backend
* added xla_device_utils to tests
* added xla_device_utils to test
* removed tests
* Revert "added xla_device_utils to test"
This reverts commit 0c9316bb
* fixed pep
* increase timeout and print traceback
* lazy check tpu exists
* increased timeout
removed barrier for tpu during test
reduced epochs
* fixed torch_xla imports
* fix tests
* define xla utils
* fix test
* aval
* chlog
* docs
* aval
* Apply suggestions from code review
Co-authored-by: Jirka Borovec <jirka.borovec@seznam.cz>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: chaton <thomas@grid.ai>
Co-authored-by: Sean Naren <sean.narenthiran@gmail.com>
2020-12-02 13:05:11 +00:00
Jirka Borovec
2fe1eff85d
drop fairscale for PT <= 1.4 ( #4910 )
...
* drop fairscale for PT <= 1.4
* fix
* Add extra check to remove fairscale from minimal testing if using minimal torch version 1.3
* Update ci_test-full.yml
* Update gym to .3 to see if this fixes examples CI
* Update omegaconf to minimum for hydra v1.0
* Revert "Update gym to .3 to see if this fixes examples CI"
This reverts commit 4221d4b9
* Revert "Update omegaconf to minimum for hydra v1.0"
This reverts commit 4f579217
Co-authored-by: Sean Naren <sean.narenthiran@gmail.com>
Co-authored-by: SeanNaren <sean@grid.ai>
2020-11-30 23:19:30 +00:00
Jirka Borovec
597dfa174c
build dockers XLA 1.7 ( #4891 )
...
* build XLA 1.7
* night XLA 1.7
* rename
* use 1.7
* tpu ver
2020-11-29 15:14:19 -04:00
Jirka Borovec
bddc6cd77a
pytest default color ( #4703 )
...
* pytest default color
* time
Co-authored-by: chaton <thomas@grid.ai>
2020-11-18 10:53:44 +00:00
Jirka Borovec
7940ea5aaf
CI: TPU drop install horovod ( #4622 )
...
Co-authored-by: chaton <thomas@grid.ai>
2020-11-13 11:33:52 +01:00
Jirka Borovec
bd6c413829
Conda: PT 1.8 ( #3833 )
...
* PT 1.8
* unfreeze PT
* drop nightly from full
* add PT 1.8 to workflow
* readme table
* cuda
* skip cuda
* test 1.8
* unfreeze torch vision
Co-authored-by: ydcjeff <ydcjeff@outlook.com>
Co-authored-by: chaton <thomas@grid.ai>
Co-authored-by: Nicki Skafte <skaftenicki@gmail.com>
2020-11-12 15:03:43 +01:00
Jeff Yang
23719e3c05
[dockers] install nvidia-dali-cudaXXX ( #4532 )
...
* [dockers] install nvidia-dali-cuda100
* Apply suggestions from code review
* build DALI
* build DALI
* build DALI
* dali from source
* dali from source
* use binaries
* qq
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Jirka Borovec <jirka@pytorchlightning.ai>
Co-authored-by: Sean Naren <sean.narenthiran@gmail.com>
2020-11-09 21:18:24 +06:30
Jeff Yang
1d594c5d0c
[docker] Lock cuda version ( #4453 )
...
* lock cuda version
* back to normal
2020-10-31 20:17:07 +06:30
Jeff Yang
0f584faa6b
PyTorch 1.7 Stable support ( #3821 )
...
* prepare for 1.7 support [ci skip]
* tpu [ci skip]
* test run 1.7
* all 1.7, needs to fix tests
* couple with torchvision
* windows try
* remove windows
* 1.7 is here
* on purpose fail [ci skip]
* return [ci skip]
* 1.7 docker
* back to normal [ci skip]
* change to some_val [ci skip]
* add seed [ci skip]
* 4 places [ci skip]
* fail on purpose [ci skip]
* verbose=True [ci skip]
* use filename to track
* use filename to track
* monitor epoch + changelog
* Update tests/checkpointing/test_model_checkpoint.py
Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>
Co-authored-by: Sean Naren <sean.narenthiran@gmail.com>
Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>
2020-10-30 15:42:14 +00:00
Jirka Borovec
ce8abd6255
Drone: use nightly build cuda docker images ( #3658 )
...
* upgrade PT version
* update docker
* docker
* try 1.5
* badge
* fix typo: dor -> for (#3918 )
* prune
* prune
* env
* echo
* try
* notes
* env
* env
* env
* notes
* docker
* prune
* maintainer
* CI
* update
* just 1.5
* CI
* CI
* CI
* CI
* CI
* CI
* CI
* CI
* CI
* CI
* CI
* docker
* CI
* CI
* CI
* CI
* CI
* CI
* CI
* CI
* CI
* push
* try
* prune
* CI
* CI
* CI
* CI
Co-authored-by: Klyukin Valeriy <mr.clyukin@gmail.com>
Co-authored-by: Jeff Yang <ydcjeff@outlook.com>
2020-10-26 10:47:09 +00:00
Jeff Yang
d83c4e4d69
Cache docker builds ( #3659 )
...
* parent faa357648f
author ydcjeff <ydcjeff@outlook.com> 1601049378 +0630
committer ydcjeff <ydcjeff@outlook.com> 1601469495 +0630
cache docker builds
lock horovod at 0.19.5
done [ci skip] [CI SKIP]
use --cache-from [ci skip]
typo and horovod [ci skip]
exclude pt 1.3 py3.8 [ci skip]
conda no cache [ci skip]
fix
* revert
* align with master [ci skip]
* retry
* remove empty continuation lines
* add comment
* fix build-args
2020-10-25 18:46:10 +06:30
chaton
829d90b257
activated color in all pytest runs ( #4254 )
...
* activated color in all pytest runs
* Update .drone.yml
Co-authored-by: Jeff Yang <ydcjeff@outlook.com>
Co-authored-by: Jeff Yang <ydcjeff@outlook.com>
2020-10-20 16:38:17 +02:00
Jirka Borovec
d3567c33a6
move base req. to root ( #4219 )
...
* move base req. to root
* check-manifest
* check-manifest
* manifest
* req
2020-10-18 20:40:18 +02:00
Jeff Yang
90929fa433
Fix apt repo issue for docker ( #3823 )
...
* fix docker repo issue
* docker
* docker
* docker
* no cudnn
* no cudnn
* try 16.04
Co-authored-by: Jirka Borovec <jirka@pytorchlightning.ai>
2020-10-05 23:18:14 -04:00
Jirka Borovec
1160270882
fix path in CI for release & python version in all dockers & duplicated badges ( #3765 )
...
* typo
* path
* check
* trigger
* fix conda
* pip ver
* fix cuda
* fix XLA
* fix xla
* ci
* docker
* BIULD
* unBIULD
* update
* py 3.8
* apex
* apex
2020-10-02 05:26:21 -04:00
Jirka Borovec
ab508dae0c
run TPU tests with multiple versions ( #3024 )
...
* rename
* multi build
* multi build
* copy
* copy
* copy
* copy
* copy
* copy
* clean
* note
* docker
* formatting
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: William Falcon <waf2107@columbia.edu>
2020-09-30 08:36:02 -04:00
Jirka Borovec
a0968e4bdf
fix PT version in CUDA docker images ( #3739 )
...
* upgrade PT version
* update docker
* docker
* try 1.5
* fix docker versions
* old
* badge
2020-09-30 08:33:22 -04:00
Jirka Borovec
a94728c99b
spec Horovod version ( #3661 )
...
* spec Horovod version
* MAKEFLAGS="-j2"
* tests
* CI
* docker
* CI
* docker
2020-09-26 19:30:25 +02:00
Jirka Borovec
0784cf3ab4
dockers nightly ( #3615 )
...
* dockers nightly
* typo
* Apply suggestions from code review
Co-authored-by: Jeff Yang <ydcjeff@outlook.com>
Co-authored-by: Jeff Yang <ydcjeff@outlook.com>
2020-09-25 15:58:01 +02:00
Jeff Yang
a2120130ed
Lightning docker image based on base-cuda ( #3637 )
...
* use lightning CI docker
* exclude py3.8 and torch1.3
* torch 1.7
* mergify
* Apply suggestions from code review
Co-authored-by: Jirka Borovec <jirka@pytorchlightning.ai>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2020-09-24 23:14:15 +02:00
Jirka Borovec
37a59be21b
build more docker configs ( #3533 )
...
* update build cases
* list
* matrix
* matrix
* builds
* docker
* -j1
* -q
* -q
* sep
* docker
* docker
* mergify
* -j1
* -j1
* horovod
* copy
2020-09-23 01:41:35 +02:00
Jeff Yang
8be79a9a96
stable, dev PyTorch in Dockerfile and conda gh actions ( #3074 )
...
* dockerfile and actions file
* dockerfile and actions file
* added pytorch conda cpu nightly
* added pytorch conda cpu nightly
* recopy base reqs
* gh action `include` torch nightly
* add pytorch nightly & conda gh badge
* rebase
* fix horovod
* proposal refactor
* Update .github/workflows/ci_pt-conda.yml
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
* Update .github/workflows/ci_pt-conda.yml
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
* update
* update
* fix cmd
* filled &&
* fix
* add -y
* torchvision >0.7 allowed
* explicitly install torchvision
* use HOROVOD_GPU_OPERATIONS env variable
* CI
* skip 1.7
* table
Co-authored-by: Jirka Borovec <jirka@pytorchlightning.ai>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2020-09-17 20:30:39 +02:00
Jirka Borovec
cbc4f6f8a4
add CI for building dockers ( #3383 )
...
* rename
* fix badges
* add docker build
* mergify
* update
* env
* ci
* times
* CI
* name
* comment
2020-09-10 18:38:29 -04:00
Jirka Borovec
9f2b29a7cd
build XLA with py3.6 ( #2863 )
...
* build py3.6
* info
* conda
* update
* version
* version
* builds
* builds
* builds
* builds
* builds
2020-08-15 15:39:44 -04:00
Jirka Borovec
a6e7aa7796
allow using apex with any PT version ( #2865 )
...
* wip
* setup
* type
* name
* wip
* docs
* imports
* fix if
* fix if
* use_amp
* Apply suggestions from code review
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
* Apply suggestions from code review
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
* fix tests
* Apply suggestions from code review
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
* fix tests
* todos
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2020-08-08 11:07:32 +02:00
Jirka Borovec
448be60701
update GPU to PT 1.5 ( #2779 )
...
* update gpu PT 1.6
* fix docker
* use PT 1.5
* Update tests/install_AMP.sh
Co-authored-by: Nathan Raw <nxr9266@g.rit.edu>
Co-authored-by: Nathan Raw <nxr9266@g.rit.edu>
2020-08-02 08:14:53 -04:00
Jirka Borovec
bc7a08fbe0
test dockers & add AMP in pt-1.6 ( #1584 )
...
* exist images
* names
* images
* args
* pt 1.6 dev
* circleci
* update
* refactor
* build
* fix
* MKL
2020-07-31 08:23:13 -04:00
zcain117
d0b8e850a4
integrate with CircleCI ( #2486 )
...
* add circleCI
* wip
* CircleCI setup that worked on my private repo. Use a working pytorch-lightning commit
* Fix the orb imports
* Update circleci header comment
* Try to pull the GITHUB_REF from the CI_PULL_REQUEST
* Use null instead of space for 'sed'
* Add TODO for codecov
* Remove echo of GKE_CLUSTER since it will be redacted by CircleCI.
* Try running codecov upload.
* Try using codecov orb
* Use pip install codecov
* Use codecov orb again since it should be approved
* dockers/tpu-tests/Dockerfile
* action
* suggestions
* drop suggestion
* suggestion
Co-authored-by: Jirka <jirka@pytorchlightning.ai>
2020-07-23 12:13:10 -04:00
Jirka Borovec
fb85d493d0
use XLA base image for TPU testing ( #2536 )
...
* drop py3.6
* use base image
* typo
* skip extra
* drop cache
2020-07-07 07:05:17 -04:00
Jirka Borovec
977df6ed31
Docker: building XLA base image ( #2494 )
...
* refactor
* add TPU base
* wip
* builds
* typo
* extras
* simple
* unzip
* rename
2020-07-06 14:21:36 -04:00