* Add `ShardedTensor` support in `LightningModule`
Co-authored-by: Yifu Wang <yifuwang@2012@gmail.com>
Co-authored-by: Sean Naren <sean@grid.ai>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: ananthsub <ananth.subramaniam@gmail.com>
* Skip test due to 'Python bus error'
* Debug NCCL
* Remove NCCL_DEBUG statement
* Revert "Skip test due to 'Python bus error'"
This reverts commit e0a3e8785d.
* fix
* add test
* changelog
* yapf
* patch os environ
* make a special test
* destroy pg
* debug
* revert
* revert
* problematic test
* skip
* try the fixture
* test
* update sensitive test
* update changelog
* remove comment
* update wrong test
* update test name
* parameterization
* Revert "parameterization"
This reverts commit b0542f43f59c5ce66800883b5e2f0c66a97408cc.
* remove conftest
* ignore test
* teardown
* fix merge
* deep speed parameterization
* uncomment test
* update chlog
* update changelog
* split tests
* update test
update test
update test
update test
* update test comments
* unroll test
* unroll test
* unroll test
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* increase shm
* sudo
* unroll ipu
* Revert "sudo"
This reverts commit 6cc68c1478.
* Revert "increase shm"
This reverts commit 8c27163483.
* x
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* find guilty test
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* POPTORCH_WAIT_FOR_IPU=1
* move test
* redo parameterize for ipu
* de-comment test
* move chlog
* Update tests/accelerators/test_accelerator_connector.py
Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com>
* Update tests/accelerators/test_accelerator_connector.py
Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com>
Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com>
* no cov
* no cov
* ReduceOp
* group
* reduce_op.sum
* Update sklearns.py
* formatting
* horovod
* Apply suggestions from code review
* horovod
* horovod
* horovod
* horovod
* ci
* print
* ci
* timeout
* timeout
* time
* fix
* distributed cpu
* pipes
* time
* cpu
* spawn
* spawn
* spawn
* tp
* separate
* os
* os
* npm
* Fix load_from_checkpoint() not working with URL on Windows
* Update CHANGELOG
* Update CHANGELOG.md
Co-authored-by: Peter Yu <2057325+yukw777@users.noreply.github.com>
* Apply suggestions from code review
* fix
* fix meta tags creating empty lines
* pyright
* node
* fix httpserver address
* drop tutils.default_trainer_options
* imports
* Better fix for load_from_checkpoint() not working with absolute path on Windows (#2294)
* Fix load_from_checkpoint() not working with URL on Windows
* Update CHANGELOG
* Update CHANGELOG.md
Co-authored-by: Peter Yu <2057325+yukw777@users.noreply.github.com>
* Apply suggestions from code review
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Peter Yu <2057325+yukw777@users.noreply.github.com>
* drop duplicate
Co-authored-by: Justus Schock <12886177+justusschock@users.noreply.github.com>
Co-authored-by: airium <airium@outlook.com>
Co-authored-by: Peter Yu <2057325+yukw777@users.noreply.github.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: AIRIUM <38249940+airium@users.noreply.github.com>
* allow loading checkpoints from urls
* tmpdir_server fixture
* test cases for loading checkpoints from url
* dir => root_dir
* default map_location to None
* test case for resume_from_checkpoint
* changelog
* doc update
* monkeypatch TORCH_HOME to avoid caching
* Use a threading server with random ports so that it is easier to clean up
* test fixes
* pep8 fix
* ThreadingHTTPServer support in 3.6
* pep8 fix
* fix changelog
* separate tests for urls
* typo
Co-authored-by: Peter Yu <2057325+yukw777@users.noreply.github.com>
* Apply suggestions from code review
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
* Run AMP tests in their own process
With opt_level="O1" (the default), AMP patches many
torch functions, which breaks any tests that run afterwards.
This patch introduces a pytest extension that lets
tests be marked with @pytest.mark.spawn so that they
are run in their own process using torch.multiprocessing.spawn
so that the main python interpreter stays un-patched.
Note that tests using DDP already run AMP in its own process,
so they don't need this annotation.
* Fix AMP tests
Since AMP defaults to O1 now, DP tests no longer throw exceptions.
Since AMP patches torch functions, CPU inference no longer works.
Skip prediction step for AMP tests.
* typo