Kaushik B
2f3c65e57b
XLA Profiler integration ( #8014 )
2021-06-29 00:58:05 +05:30
Sean Naren
f7459f5328
DeepSpeed Infinity Update ( #7234 )
...
* Update configs to match latest API
* Ensure we move the entire model to device before configure optimizer is called
* Add missing param
* Expose parameters
* Update references, drop local rank as it's now infered from the environment variable
* Fix ref
* Force install deepspeed 0.3.16
* Add guard for init
* Update pytorch_lightning/plugins/training_type/deepspeed.py
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
* Revert type checking
* Install master for CI for testing purposes
* Update CI
* Fix tests
* Add check
* Update versions
* Set precision
* Fix
* See if i can force upgrade
* Attempt to fix
* Drop
* Add changelog
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2021-06-14 16:38:28 +00:00
Jirka Borovec
7b531ac7ac
Fix NVIDIA docker versions ( #7834 )
2021-06-06 23:56:27 +02:00
Jirka Borovec
9a001fea22
update NGC docker ( #7787 )
2021-06-01 12:11:29 +02:00
Tomy Hsieh
037a71b156
Update README.md ( #7717 )
2021-05-26 12:58:11 +02:00
Kaushik B
2c10ecc232
MAINTAINER has been deprecated ( #7683 )
2021-05-25 00:01:31 +05:30
Jirka Borovec
6e56f56aa1
docker use $(nproc) ( #7606 )
...
* docker use $(nproc)
* Update typo
Co-authored-by: Roger Shieh <sh.rog@protonmail.ch>
Co-authored-by: Roger Shieh <sh.rog@protonmail.ch>
2021-05-19 21:48:14 +02:00
Jirka Borovec
298f9e5c2d
Prune deprecated utils modules ( #7503 )
...
* argparse_utils
* model_utils
* warning_utils
* xla_device_utils
* chlog
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2021-05-13 07:24:42 +00:00
Jirka Borovec
db54b30776
Update README to 1.3 ( #7489 )
2021-05-12 13:36:52 +02:00
Louis Taylor
2b7e65b747
Add base IPU dockerfiles ( #7252 )
2021-05-07 12:07:29 +00:00
Jirka Borovec
1a27c12b26
update ngc for 1.3 ( #7414 )
2021-05-07 13:13:54 +02:00
Jirka Borovec
626ef08694
enable Dockers for PT 1.9 ( #7363 )
...
* enable PT 1.9
* fix versions
* args
* fix
2021-05-05 14:26:22 +02:00
Carlos Mocholí
c6a171b776
Fix requirements/adjust_versions.py ( #7149 )
...
Co-authored-by: jirka <jirka.borovec@seznam.cz>
2021-05-04 01:06:28 +02:00
Adrian Wälchli
7636d422fa
Update DeepSpeed version requirement in Dockerfile ( #7326 )
...
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2021-05-03 20:21:19 +02:00
Jirka Borovec
a153c15c90
Docker/nvidia ( #7109 )
...
* version check
* ...
2021-04-27 20:29:49 +01:00
Sean Naren
8439aead66
Update FairScale on CI ( #7017 )
...
* Try updating CI to latest fairscale
* Update availability of imports.py
* Remove some of the fairscale custom ci stuff
* Update grad scaler within the new process as reference is incorrect for spawn
* Remove fairscale from mocks
* Install fairscale 0.3.4 into the base container, remove from extra.txt
* Update docs/source/conf.py
* Fix import issues
* Mock fairscale for docs
* Fix DeepSpeed and FairScale to specific versions
* Swap back to greater than
* extras
* Revert "extras"
This reverts commit 7353479f
* ci
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: jirka <jirka.borovec@seznam.cz>
2021-04-23 12:37:00 +01:00
Jirka Borovec
1e4bc69a16
Ban `tensorboard==2.5.0` and `deepspeed==0.3.15` ( #7159 )
...
* ban TB 2.5
* note
* push
* Ban tb==2.5.0 and deepspeed==0.3.15
* Fix pip command
* pull
* up
* up
Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
2021-04-22 11:08:21 -04:00
Sean Naren
5d8610955a
Fix `apex` version in Docker due to broken upstream ( #7146 )
...
* Set Apex commit before introduction of new MLP extensions
* Refactor install command
Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
2021-04-21 23:58:55 +01:00
Jirka Borovec
da1ac3a530
update docker base on PT 1.7 ( #6931 )
...
* update docker base on PT 1.7
* fix path
2021-04-13 10:06:06 +01:00
Sean Naren
b46cc557ef
[Feat] DeepSpeed single file saving ( #6900 )
...
* Add single checkpoint capability
* Fix checkpointing in test, few cleanups
* Add comment
* Change restore logic
* Move vars around, add better explanation, make todo align with DeepSpeed team
* Fix checkpointing
* Remove deepspeed from extra, install in Dockerfile
* push
* pull
* Split to two tests to see if it fixes Deepspeed error
* Add comment
2021-04-12 22:44:09 +00:00
thomas chaton
1302766f83
DeepSpeed ZeRO Update ( #6546 )
...
* Add context to call hook to handle all modules defined within the hook
* Expose some additional parameters
* Added docs, exposed parameters
* Make sure we only configure if necessary
* Setup activation checkpointing regardless, saves the user having to do it manually
* Add some tests that fail currently
* update
* update
* update
* add tests
* change docstring
* resolve accumulate_grad_batches
* resolve flake8
* Update DeepSpeed to use latest version, add some comments
* add metrics
* update
* Small formatting fixes, clean up some code
* Few cleanups
* No need for default state
* Fix tests, add some boilerplate that should move eventually
* Add hook removal
* Add a context manager to handle hook
* Small naming cleanup
* wip
* move save_checkpoint responsability to accelerator
* resolve flake8
* add BC
* Change recommended scale to 16
* resolve flake8
* update test
* update install
* update
* update test
* update
* update
* update test
* resolve flake8
* update
* update
* update on comments
* Push
* pull
* Update pytorch_lightning/plugins/training_type/deepspeed.py
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
* Update pytorch_lightning/plugins/training_type/deepspeed.py
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
* update
* Apply suggestions from code review
* Swap to using world size defined by plugin
* update
* update todo
* Remove deepspeed from extra, keep it in the base cuda docker install
* Push
* pull
* update
* update
* update
* update
* Minor changes
* duplicate
* format
* format2
Co-authored-by: SeanNaren <sean@grid.ai>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: Sean Naren <sean.narenthiran@gmail.com>
Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Jirka Borovec <jirka.borovec@seznam.cz>
2021-03-30 13:39:02 -04:00
Jirka Borovec
dcf6e4e310
remake nvidia docker ( #6686 )
...
* use latest
* remake
* examples
2021-03-29 09:39:06 +01:00
Jirka Borovec
5780796931
NGC container PoC ( #6187 )
...
* add NVIDIA flows
* push
* pull
* ...
* extras
* ci prune
* fix
* tag
* .
* list
2021-03-20 02:55:46 +05:30
Jirka Borovec
85c8074bee
require: adjust versions ( #6363 )
...
* adjust versions
* release
* manifest
* pep8
* CI
* fix
* build
2021-03-06 14:34:54 +01:00
Sean Naren
8440595b26
[CI] Move DeepSpeed into CUDA image, remove DeepSpeed install from azure ( #6043 )
...
* Move to CUDA image
* Remove deepspeed install as deepspeed now in the cuda image
* Remove path setting, as ninja should be in the container now
2021-02-17 18:51:31 -05:00
Sean Naren
5157ba5509
Add openmpi to our base cuda container for MPI support ( #6026 )
...
* Add openmpi to our base container for DeepSpeed MPI support
* conda
Co-authored-by: Jirka Borovec <jirka.borovec@seznam.cz>
2021-02-17 12:15:49 +00:00
Jirka Borovec
b5d7d08da5
fix nightly releases & readme ( #5922 )
...
* fix nightly releases
* readme
* cuda
* doxker
* Apply suggestions from code review
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
* revert
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2021-02-16 13:46:28 -05:00
Adrian Wälchli
a3d4e7c86a
move accelerator legacy tests ( #5948 )
...
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
2021-02-13 19:42:18 -05:00
Justus Schock
da6dbc8d1d
PoC: Accelerator refactor ( #5743 )
...
* restoring the result from subprocess
* fix queue.get() order for results
* add missing "block_backward_sync" context manager
* add missing "block_backward_sync" context manager
* fix sync_batchnorm
* fix supported gpu-ids for tuple
* fix clip gradients and inf recursion
* accelerator selection: added cluster_environment plugin
* fix torchelastic test
* fix reduce early stopping decision for DDP
* fix tests: callbacks, conversion to lightning optimizer
* fix lightning optimizer does not pickle
* fix setting benchmark and deterministic option
* fix slurm amp test
* fix prepare_data test and determine node_rank
* fix retrieving last path when testing
* remove obsolete plugin argument
* fix test: test_trainer_config
* fix torchscript tests
* fix trainer.model access
* move properties
* fix test_transfer_batch_hook
* fix auto_select_gpus
* fix omegaconf test
* fix test that needs to simulate slurm ddp
* add horovod plugin
* fix test with named arguments
* clean up whitespace
* fix datamodules test
* remove old accelerators
* fix naming
* move old plugins
* move to plugins
* create precision subpackage
* create training_type subpackage
* fix all new import errors
* fix wrong arguments order passed to test
* fix LR finder
* Added sharded training type and amp plugin
* Move clip grad to precision plugin
* Added sharded spawn, select accelerators based on distributed_backend + enable custom fp16 plugin automatically
* Fix import issue, attempting to fix tests
* Fix initial test
* Reflect hook logic from master, should wrap model after move to device
* Optional state consolidation, since master has optimizers not wrapped
* change attribute for instance test
* reset optimizers
optimizers are not used in main process, so state would be wrong.
* legacy
* imports in accel
* legacy2
* trainer imports
* fix import errors after rebase
* move hook to new setup location
* provide unwrapping logic
* fix trainer callback system
* added ddp2 implementation
* fix imports .legacy
* move plugins
* restore legacy
* drop test.py from root
* add tpu accelerator and plugins
* fixes
* fix lightning optimizer merge
* reset bugreportmodel
* unwrapping
* step routing forward
* model access
* unwrap
* opt
* integrate distrib_type
* sync changes
* sync
* fixes
* add forgotten generators
* add missing logic
* update
* import
* missed imports
* import fixes
* isort
* mv f
* changelog
* format
* move helper to parallel plugin
* d
* add world size
* clean up
* duplicate
* activate ddp_sharded and tpu
* set nvidia flags
* remove unused colab var
* use_tpu <-> on_tpu attrs
* make some ddp_cpu and clusterplugin tests pass
* Ref/accelerator connector (#5742 )
* final cleanup
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
* connector cleanup
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
* trainer cleanup
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
* accelerator cleanup + missing logic in accelerator connector
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
* add missing changes to callbacks
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
* reflect accelerator changes to lightning module
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
* clean cluster envs
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
* cleanup plugins
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
* add broadcasting
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
* yapf
* remove plugin connector
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
* plugins
* manual optimization
* update optimizer routing
* add rank to torchelastic
* fix memory mixed precision
* setstate on trainer for pickling in ddp spawn
* add predict method
* add back commented accelerator code
* adapt test for sync_batch_norm to new plugin
* fix deprecated tests
* fix ddp cpu choice when no num_processes are given
* yapf format
* skip a memory test that cannot pass anymore
* fix pickle error in spawn plugin
* x
* avoid
* x
* fix cyclic import in docs build
* add support for sharded
* update typing
* add sharded and sharded_spawn to distributed types
* make unwrap model default
* refactor LightningShardedDataParallel similar to LightningDistributedDataParallel
* update sharded spawn to reflect changes
* update sharded to reflect changes
* Merge 1.1.5 changes
* fix merge
* fix merge
* yapf isort
* fix merge
* yapf isort
* fix indentation in test
* copy over reinit scheduler implementation from dev1.2
* fix apex tracking calls with dev_debugger
* reduce diff to dev1.2, clean up
* fix trainer config test when gpus>0 and num_processes >0 and ddp_cpu
* sort plugin tests legacy/new
* fix error handling for amp on cpu
* fix merge
fix merge
fix merge
* [Feat] Resolve manual_backward (#5837 )
* resolve manual_backward
* resolve flake8
* update
* resolve for ddp_spawn
* resolve flake8
* resolve flake8
* resolve flake8
Co-authored-by: Ubuntu <ubuntu@ip-172-31-88-60.ec2.internal>
* fix tests/accelerator tests on cpu
* [BugFix] Resolve manual optimization (#5852 )
* resolve manual_optimization
* update
* update
Co-authored-by: Ubuntu <ubuntu@ip-172-31-88-60.ec2.internal>
* Remove copy trainer parameters to happen earlier within the loop and add safe guard to get ref model (#5856 )
* resovle a bug
* Accelerator refactor sharded rpc (#5854 )
* rpc branch
* merge
* update handling of rpc
* make devices etc. Optional in RPC
* set devices etc. later if necessary
* remove devices from sequential
* make devices optional in rpc
* fix import
* uncomment everything
* fix cluster selection
Co-authored-by: Ubuntu <ubuntu@ip-172-31-88-60.ec2.internal>
* resolve bug
* fix assert in rpc test
* resolve a test
* fix docs compilation
* accelerator refactor - fix for sharded parity test (#5866 )
* fix memory issue with ddp_spawn
* x
x
x
x
x
x
x
x
x
* x
* Remove DDP2 as this does not apply
* Add missing pre optimizer hook to ensure lambda closure is called
* fix apex docstring
* [accelerator][BugFix] Resolve some test for 1 gpu (#5863 )
* update
* revert init
* resolve a bug
* update
* resolve flake8
* update
* update
* update
* revert init
* resolve a bug
* update
* resolve flake8
* update
* update
* update
* update
* update
* revert init
* resolve a bug
* update
* resolve flake8
* update
* update
* update
* revert init
* update
* resolve flake8
* update
* update
* update
* update
* update
* all_gather
* update
* make plugins work, add misconfig for RPC
* update
* update
* remove breaking test
* resolve some tests
* resolve flake8
* revert to ddp_spawn
Co-authored-by: root <root@ip-172-31-88-60.ec2.internal>
Co-authored-by: Ubuntu <ubuntu@ip-172-31-88-60.ec2.internal>
Co-authored-by: Justus Schock <justus.schock@rwth-aachen.de>
* yapf isort
* resolve flake8
* fix apex doctests
* fix apex doctests 2
* resolve docs
* update drone
* clean env
* update
* update
* update
* update
* merge
* Fix RPC related tests, clean out old API, update for new accelerator API [skip ci] (#5881 )
* Fix RPC related tests, clean out old API, update for new accelerator API
* Move tests out of legacy folder, update paths and names
* Update test_remove_1-4.py
* Expose properties for tpu cores/gpus/num_gpus
* Add root GPU property
* Move properties to properties.py
* move tests that were previously in drone
* Fix root GPU property (#5908 )
* Move root GPU to property, remove horovod set as this is handled in horovod plugin, ensure we mock correctly to set GPU accelerator
* Add missing tests back
* fix best model path transfer when no checkpoint callback available
* Fix setup hook order [wip] (#5858 )
* Call trainer setup hook before accelerator setup
* Add test case
* add new test
* typo
* fix callback order in test
Co-authored-by: tchaton <thomas@grid.ai>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
* rename ddp sequential -> rpc sequential for special test
* revert
* fix stupid merge problem
* Use property in connector for sampler (#5913 )
* merge the import conflicts
* fix spawning of processes in slurm
* [wip] Fix some bugs for TPU [skip ci] (#5878 )
* fixed for single tpu
* fixed spawn
* fixed spawn
* update
* update
* wip
* resolve bugs
* resolve bug
* update on comment
* removed decorator
* resolve comments
* set to 4
* update
* update
* need cleaning
* update
* update
* update
* resolve flake8
* resolve bugs
* exclude broadcast
* resolve bugs
* change test
* update
* update
* skip if meet fails
* properly raise trace
* update
* add catch
* wrap test
* resolve typo
* update
* typo
Co-authored-by: Lezwon Castelino <lezwon@gmail.com>
Co-authored-by: Your Name <you@example.com>
* resolve some tests
* update
* fix imports
* update
* resolve flake8
* update azure pipeline
* skip a sharded test on cpu that requires a gpu
* resolve tpus
* resolve bug
* resolve flake8
* update
* updat utils
* revert permission change on files
* suggestions from carlos
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
* remove unrelated formatting changes
* remove incomplete comment
* Update pytorch_lightning/accelerators/__init__.py
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
* remove unrelated formatting change
* add types
* warn 1.7 ddp manual backward only if ddp kwarg unset
* yapf + isort
* pep8 unused imports
* fix cyclic import in docs
* Apply suggestions from code review
* typer in accelerator.py
* typo
* Apply suggestions from code review
* formatting
* update on comments
* update typo
* Update pytorch_lightning/trainer/properties.py
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
* update
* suggestion from code review
* suggestion from code review
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: SeanNaren <sean@grid.ai>
Co-authored-by: Jirka Borovec <jirka.borovec@seznam.cz>
Co-authored-by: chaton <thomas@grid.ai>
Co-authored-by: Ubuntu <ubuntu@ip-172-31-88-60.ec2.internal>
Co-authored-by: Sean Naren <sean.narenthiran@gmail.com>
Co-authored-by: root <root@ip-172-31-88-60.ec2.internal>
Co-authored-by: Lezwon Castelino <lezwon@gmail.com>
Co-authored-by: Your Name <you@example.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
2021-02-12 15:48:56 -05:00
Jirka Borovec
c2c82dad62
CI: Azure ( #5882 )
...
* add base Azure pipeline
* skip
2021-02-10 04:43:26 -05:00
Jirka Borovec
1ac9164f91
create new Conda images ( #5877 )
...
* create new Conda images
* .
* .
2021-02-09 15:30:48 +00:00
Jirka Borovec
937f11c05b
try fix: Docker with Conda & PT 1.8 ( #5842 )
...
* ci
* ver
* list
* pt
* nk
* ch
* 4.9
2021-02-09 08:22:35 +00:00
tchaton
77be6f6e24
resolve conflits
...
resolve doc
boring commit
docs
torchvision
tpu
Update dockers/tpu-tests/tpu_test_cases.jsonnet
Update dockers/tpu-tests/tpu_test_cases.jsonnet
2021-02-05 21:43:10 +01:00
Jirka Borovec
a39b382fe1
hotfix for GHA tpu ( #5762 )
...
* -y
* t
* .
* t
2021-02-05 21:43:10 +01:00
Sumanth Ratna
8732475701
Remove unnecessary intermediate layers in base-conda Dockerfile ( #5697 )
...
* [docker][base-conda] Combine ENV+COPY instructions
* [docker][base-cuda] Combine ENV+COPY instructions
* [docker][base-xla] Combine ENV+COPY instructions
* [docker][base-cuda] Fix COPY instruction
* [docker][base-xla] Fix quote in ENV
* [docker][base-xla] Fix $PATH in ENV
* [docker][base-conda] Fix COPY instruction
* chlog
Co-authored-by: Jirka Borovec <jirka.borovec@seznam.cz>
2021-02-05 21:40:40 +01:00
Jirka Borovec
07f24d2438
add nvidia docker image ( #5668 )
...
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2021-01-29 11:01:03 -05:00
Jirka Borovec
7e2e874d95
Refactor: legacy accelerators and plugins ( #5645 )
...
* tests: legacy
* legacy: accel
* legacy: plug
* fix imports
* mypy
* flake8
2021-01-26 20:04:36 -05:00
Jirka Borovec
9dd04028d5
tests for legacy checkpoints ( #5223 )
...
* wip
* generate
* clean
* tests
* copy
* download
* download
* download
* download
* download
* download
* download
* download
* download
* download
* download
* flake8
* extend
* aws
* extension
* pull
* pull
* pull
* pull
* pull
* pull
* pull
* try
* try
* try
* got it
* Apply suggestions from code review
(cherry picked from commit 72525f0a83
)
2021-01-26 14:27:56 +01:00
Jeff Yang
e1a4c2e448
docker: run ci only docker related files are changed ( #5203 )
...
* only run ci on docker related files
* docker related files changed!
* install pytorch along with cudatoolkit
* build docker only on SUN
* conda exit status has been fixed
* reverts back to old conda version
* add more docker related files
* conda env update --name
* create env and install pytorch again
* create env and install pytorch again
* ${PYTORCH_CHANNEL}
* dont update pytorch with conda env update
* Apply suggestions from code review
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
* Update dockers/base-conda/Dockerfile
* Apply suggestions from code review
* remove checks in cron job
* Apply suggestions from code review
* readd #
* readd #
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Roger Shieh <sh.rog@protonmail.ch>
(cherry picked from commit cc624358c8
)
2021-01-26 14:27:56 +01:00
Jirka Borovec
9be04c1c0b
try to update failing dockers ( #5611 )
2021-01-25 17:10:56 -05:00
Jirka Borovec
7e4d6cbe48
set minimal req. PT 1.4 ( #5418 )
...
* set minimal req. PT 1.4
* chlog
2021-01-12 19:15:35 -05:00
Jirka Borovec
5119013c81
drop install FairScale for TPU ( #5113 )
...
* drop install FairScale for TPU
* typo
Co-authored-by: Roger Shieh <sh.rog@protonmail.ch>
2021-01-05 09:58:37 +01:00
Lezwon Castelino
12cb9942a1
Tpu save ( #4309 )
...
* convert xla tensor to cpu before save
* move_to_cpu
* updated CHANGELOG.md
* added on_save to accelerators
* if accelerator is not None
* refactors
* change filename to run test
* run test_tpu_backend
* added xla_device_utils to tests
* added xla_device_utils to test
* removed tests
* Revert "added xla_device_utils to test"
This reverts commit 0c9316bb
* fixed pep
* increase timeout and print traceback
* lazy check tpu exists
* increased timeout
removed barrier for tpu during test
reduced epochs
* fixed torch_xla imports
* fix tests
* define xla utils
* fix test
* aval
* chlog
* docs
* aval
* Apply suggestions from code review
Co-authored-by: Jirka Borovec <jirka.borovec@seznam.cz>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: chaton <thomas@grid.ai>
Co-authored-by: Sean Naren <sean.narenthiran@gmail.com>
2020-12-02 13:05:11 +00:00
Jirka Borovec
2fe1eff85d
drop fairscale for PT <= 1.4 ( #4910 )
...
* drop fairscale for PT <= 1.4
* fix
* Add extra check to remove fairscale from minimal testing if using minimal torch version 1.3
* Update ci_test-full.yml
* Update gym to .3 to see if this fixes examples CI
* Update omegaconf to minimum for hydra v1.0
* Revert "Update gym to .3 to see if this fixes examples CI"
This reverts commit 4221d4b9
* Revert "Update omegaconf to minimum for hydra v1.0"
This reverts commit 4f579217
Co-authored-by: Sean Naren <sean.narenthiran@gmail.com>
Co-authored-by: SeanNaren <sean@grid.ai>
2020-11-30 23:19:30 +00:00
Jirka Borovec
597dfa174c
build dockers XLA 1.7 ( #4891 )
...
* build XLA 1.7
* night XLA 1.7
* rename
* use 1.7
* tpu ver
2020-11-29 15:14:19 -04:00
Jirka Borovec
bddc6cd77a
pytest default color ( #4703 )
...
* pytest default color
* time
Co-authored-by: chaton <thomas@grid.ai>
2020-11-18 10:53:44 +00:00
Jirka Borovec
7940ea5aaf
CI: TPU drop install horovod ( #4622 )
...
Co-authored-by: chaton <thomas@grid.ai>
2020-11-13 11:33:52 +01:00
Jirka Borovec
bd6c413829
Conda: PT 1.8 ( #3833 )
...
* PT 1.8
* unfreeze PT
* drop nightly from full
* add PT 1.8 to workflow
* readme table
* cuda
* skip cuda
* test 1.8
* unfreeze torch vision
Co-authored-by: ydcjeff <ydcjeff@outlook.com>
Co-authored-by: chaton <thomas@grid.ai>
Co-authored-by: Nicki Skafte <skaftenicki@gmail.com>
2020-11-12 15:03:43 +01:00
Jeff Yang
23719e3c05
[dockers] install nvidia-dali-cudaXXX ( #4532 )
...
* [dockers] install nvidia-dali-cuda100
* Apply suggestions from code review
* build DALI
* build DALI
* build DALI
* dali from source
* dali from source
* use binaries
* qq
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Jirka Borovec <jirka@pytorchlightning.ai>
Co-authored-by: Sean Naren <sean.narenthiran@gmail.com>
2020-11-09 21:18:24 +06:30
Jeff Yang
1d594c5d0c
[docker] Lock cuda version ( #4453 )
...
* lock cuda version
* back to normal
2020-10-31 20:17:07 +06:30
Jeff Yang
0f584faa6b
PyTorch 1.7 Stable support ( #3821 )
...
* prepare for 1.7 support [ci skip]
* tpu [ci skip]
* test run 1.7
* all 1.7, needs to fix tests
* couple with torchvision
* windows try
* remove windows
* 1.7 is here
* on purpose fail [ci skip]
* return [ci skip]
* 1.7 docker
* back to normal [ci skip]
* change to some_val [ci skip]
* add seed [ci skip]
* 4 places [ci skip]
* fail on purpose [ci skip]
* verbose=True [ci skip]
* use filename to track
* use filename to track
* monitor epoch + changelog
* Update tests/checkpointing/test_model_checkpoint.py
Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>
Co-authored-by: Sean Naren <sean.narenthiran@gmail.com>
Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>
2020-10-30 15:42:14 +00:00
Jirka Borovec
ce8abd6255
Drone: use nightly build cuda docker images ( #3658 )
...
* upgrade PT version
* update docker
* docker
* try 1.5
* badge
* fix typo: dor -> for (#3918 )
* prune
* prune
* env
* echo
* try
* notes
* env
* env
* env
* notes
* docker
* prune
* maintainer
* CI
* update
* just 1.5
* CI
* CI
* CI
* CI
* CI
* CI
* CI
* CI
* CI
* CI
* CI
* docker
* CI
* CI
* CI
* CI
* CI
* CI
* CI
* CI
* CI
* push
* try
* prune
* CI
* CI
* CI
* CI
Co-authored-by: Klyukin Valeriy <mr.clyukin@gmail.com>
Co-authored-by: Jeff Yang <ydcjeff@outlook.com>
2020-10-26 10:47:09 +00:00
Jeff Yang
d83c4e4d69
Cache docker builds ( #3659 )
...
* parent faa357648f
author ydcjeff <ydcjeff@outlook.com> 1601049378 +0630
committer ydcjeff <ydcjeff@outlook.com> 1601469495 +0630
cache docker builds
lock horovod at 0.19.5
done [ci skip] [CI SKIP]
use --cache-from [ci skip]
typo and horovod [ci skip]
exclude pt 1.3 py3.8 [ci skip]
conda no cache [ci skip]
fix
* revert
* align with master [ci skip]
* retry
* remove empty continuation lines
* add comment
* fix build-args
2020-10-25 18:46:10 +06:30
chaton
829d90b257
activated color in all pytest runs ( #4254 )
...
* activated color in all pytest runs
* Update .drone.yml
Co-authored-by: Jeff Yang <ydcjeff@outlook.com>
Co-authored-by: Jeff Yang <ydcjeff@outlook.com>
2020-10-20 16:38:17 +02:00
Jirka Borovec
d3567c33a6
move base req. to root ( #4219 )
...
* move base req. to root
* check-manifest
* check-manifest
* manifest
* req
2020-10-18 20:40:18 +02:00
Jeff Yang
90929fa433
Fix apt repo issue for docker ( #3823 )
...
* fix docker repo issue
* docker
* docker
* docker
* no cudnn
* no cudnn
* try 16.04
Co-authored-by: Jirka Borovec <jirka@pytorchlightning.ai>
2020-10-05 23:18:14 -04:00
Jirka Borovec
1160270882
fix path in CI for release & python version in all dockers & duplicated badges ( #3765 )
...
* typo
* path
* check
* trigger
* fix conda
* pip ver
* fix cuda
* fix XLA
* fix xla
* ci
* docker
* BIULD
* unBIULD
* update
* py 3.8
* apex
* apex
2020-10-02 05:26:21 -04:00
Jirka Borovec
ab508dae0c
run TPU tests with multiple versions ( #3024 )
...
* rename
* multi build
* multi build
* copy
* copy
* copy
* copy
* copy
* copy
* clean
* note
* docker
* formatting
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: William Falcon <waf2107@columbia.edu>
2020-09-30 08:36:02 -04:00
Jirka Borovec
a0968e4bdf
fix PT version in CUDA docker images ( #3739 )
...
* upgrade PT version
* update docker
* docker
* try 1.5
* fix docker versions
* old
* badge
2020-09-30 08:33:22 -04:00
Jirka Borovec
a94728c99b
spec Horovod version ( #3661 )
...
* spec Horovod version
* MAKEFLAGS="-j2"
* tests
* CI
* docker
* CI
* docker
2020-09-26 19:30:25 +02:00
Jirka Borovec
0784cf3ab4
dockers nightly ( #3615 )
...
* dockers nightly
* typo
* Apply suggestions from code review
Co-authored-by: Jeff Yang <ydcjeff@outlook.com>
Co-authored-by: Jeff Yang <ydcjeff@outlook.com>
2020-09-25 15:58:01 +02:00
Jeff Yang
a2120130ed
Lightning docker image based on base-cuda ( #3637 )
...
* use lightning CI docker
* exclude py3.8 and torch1.3
* torch 1.7
* mergify
* Apply suggestions from code review
Co-authored-by: Jirka Borovec <jirka@pytorchlightning.ai>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2020-09-24 23:14:15 +02:00
Jirka Borovec
37a59be21b
build more docker configs ( #3533 )
...
* update build cases
* list
* matrix
* matrix
* builds
* docker
* -j1
* -q
* -q
* sep
* docker
* docker
* mergify
* -j1
* -j1
* horovod
* copy
2020-09-23 01:41:35 +02:00
Jeff Yang
8be79a9a96
stable, dev PyTorch in Dockerfile and conda gh actions ( #3074 )
...
* dockerfile and actions file
* dockerfile and actions file
* added pytorch conda cpu nightly
* added pytorch conda cpu nightly
* recopy base reqs
* gh action `include` torch nightly
* add pytorch nightly & conda gh badge
* rebase
* fix horovod
* proposal refactor
* Update .github/workflows/ci_pt-conda.yml
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
* Update .github/workflows/ci_pt-conda.yml
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
* update
* update
* fix cmd
* filled &&
* fix
* add -y
* torchvision >0.7 allowed
* explicitly install torchvision
* use HOROVOD_GPU_OPERATIONS env variable
* CI
* skip 1.7
* table
Co-authored-by: Jirka Borovec <jirka@pytorchlightning.ai>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2020-09-17 20:30:39 +02:00
Jirka Borovec
cbc4f6f8a4
add CI for building dockers ( #3383 )
...
* rename
* fix badges
* add docker build
* mergify
* update
* env
* ci
* times
* CI
* name
* comment
2020-09-10 18:38:29 -04:00
Jirka Borovec
9f2b29a7cd
build XLA with py3.6 ( #2863 )
...
* build py3.6
* info
* conda
* update
* version
* version
* builds
* builds
* builds
* builds
* builds
2020-08-15 15:39:44 -04:00
Jirka Borovec
a6e7aa7796
allow using apex with any PT version ( #2865 )
...
* wip
* setup
* type
* name
* wip
* docs
* imports
* fix if
* fix if
* use_amp
* Apply suggestions from code review
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
* Apply suggestions from code review
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
* fix tests
* Apply suggestions from code review
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
* fix tests
* todos
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2020-08-08 11:07:32 +02:00
Jirka Borovec
448be60701
update GPU to PT 1.5 ( #2779 )
...
* update gpu PT 1.6
* fix docker
* use PT 1.5
* Update tests/install_AMP.sh
Co-authored-by: Nathan Raw <nxr9266@g.rit.edu>
Co-authored-by: Nathan Raw <nxr9266@g.rit.edu>
2020-08-02 08:14:53 -04:00
Jirka Borovec
bc7a08fbe0
test dockers & add AMP in pt-1.6 ( #1584 )
...
* exist images
* names
* images
* args
* pt 1.6 dev
* circleci
* update
* refactor
* build
* fix
* MKL
2020-07-31 08:23:13 -04:00
zcain117
d0b8e850a4
integrate with CircleCI ( #2486 )
...
* add circleCI
* wip
* CircleCI setup that worked on my private repo. Use a working pytorch-lightning commit
* Fix the orb imports
* Update circleci header comment
* Try to pull the GITHUB_REF from the CI_PULL_REQUEST
* Use null instead of space for 'sed'
* Add TODO for codecov
* Remove echo of GKE_CLUSTER since it will be redacted by CircleCI.
* Try running codecov upload.
* Try using codecov orb
* Use pip install codecov
* Use codecov orb again since it should be approved
* dockers/tpu-tests/Dockerfile
* action
* suggestions
* drop suggestion
* suggestion
Co-authored-by: Jirka <jirka@pytorchlightning.ai>
2020-07-23 12:13:10 -04:00
Jirka Borovec
fb85d493d0
use XLA base image for TPU testing ( #2536 )
...
* drop py3.6
* use base image
* typo
* skip extra
* drop cache
2020-07-07 07:05:17 -04:00
Jirka Borovec
977df6ed31
Docker: building XLA base image ( #2494 )
...
* refactor
* add TPU base
* wip
* builds
* typo
* extras
* simple
* unzip
* rename
2020-07-06 14:21:36 -04:00