Commit Graph

3447 Commits

Author SHA1 Message Date
William Falcon 1aa9d39506
Eval epoch can now log independently (#3843)
* ref: routed epoch outputs to logger

* ref: routed epoch outputs to logger

* ref: routed epoch outputs to logger

* ref: routed epoch outputs to logger
2020-10-04 13:36:35 -04:00
Jeff Yang b76fc5bae5
use docker for conda CI (#3841)
* use docker in conda CI

* update env if needed

* update with pip

* remove setting pytorch
2020-10-04 13:18:20 -04:00
Adrian Wälchli 1906867fd4
deprecation warning (#3844) 2020-10-04 13:17:09 -04:00
William Falcon 2c21f7d7e2
ref: adding compute environments (2/n) (#3842)
* ref: adding compute environments (2/n)

* ref: adding compute environments (2/n)

* ref: adding compute environments (2/n)

* ref: adding compute environments (2/n)
2020-10-04 08:48:46 -04:00
Rohit Gupta a628d181ee
Fix val_progress_bar total with num_sanity_val_steps (#3751)
* Fix val_progress_bar total with num_sanity_val_steps

* chlog

* Fix val_progress_bar total with num_sanity_val_steps

* move test

* replaced with sanity flag and suggestions
2020-10-04 08:32:18 -04:00
Lezwon Castelino 4da240ea1b
added broadcast option to tpu (#3814)
* added broadcast option to tpu

* add device

* moved tpu broadcast to tpu_backend

* removed Lightning dist

* decode bytes

* pep8 fix

* fix bug

* test for broadcast

* updated changelog
2020-10-04 07:47:33 -04:00
William Falcon 093535d433
ref: adding compute environments (1/n) (#3837)
* ref: adding compute environments (1/n)

* ref: adding compute environments (1/n)

* ref: adding compute environments (1/n)
2020-10-04 07:31:19 -04:00
Daniel Li a3503ce3fd
Explicitly point out where should we set the random seed (#3839)
* Explicitly point out where should we set the random seed

* Update docs/source/multi_gpu.rst

Co-authored-by: Jeff Yang <ydcjeff@outlook.com>

Co-authored-by: Qinru Li <q4li@eng.ucsd.edu>
Co-authored-by: William Falcon <waf2107@columbia.edu>
Co-authored-by: Jeff Yang <ydcjeff@outlook.com>
2020-10-04 07:30:45 -04:00
William Falcon 1f8ff7c48c
ref: callback system and init ddp (1/n) (#3836)
* refactored callback system and init ddp

* refactored callback system and init ddp

* refactored callback system and init ddp

* refactored callback system and init ddp
2020-10-03 23:39:17 -04:00
ananthsub b8a6408a11
Update trainer.py (#3834) 2020-10-03 22:18:05 -04:00
William Falcon 66aef10239
verified epoch logging (#3830)
* ref: fix epoch logging

* verified epoch logging

* verified epoch logging

* verified epoch logging

* verified epoch logging

* verified epoch logging

* verified epoch logging

* verified epoch logging

* verified epoch logging
2020-10-03 21:17:24 -04:00
William Falcon 35d1111994
[WIP] ref: decoupled ddp, ddp spawn (finish 3733) (#3819)
* ref: finish #3733

* remove deprecated test

* remove deprecated test

* remove deprecated test

* remove deprecated test

* remove deprecated test

* remove deprecated test

* remove deprecated test

* remove deprecated test

* remove deprecated test

* remove deprecated test

* remove deprecated test

* remove deprecated test

* remove deprecated test

* Update pytorch_lightning/accelerators/ddp_backend.py

Co-authored-by: ananthsub <ananth.subramaniam@gmail.com>

* remove deprecated test

* remove deprecated test

* remove deprecated test

Co-authored-by: ananthsub <ananth.subramaniam@gmail.com>
2020-10-03 14:05:31 -04:00
William Falcon 3903cf63c6
ref: training flag tests (val_check_interval) (#3825)
* added test_val_check_interval tests

* added test_val_check_interval tests

* added test_val_check_interval tests
2020-10-03 14:05:01 -04:00
ananthsub 8dd37e7c4a
Use fsspec in load to resolve more paths/URLs from storage backends (#3692)
* special case http for torch hub load

* Update CHANGELOG.md

* Update test.txt
2020-10-03 13:29:03 -04:00
William Falcon 0fb8c54fda
remove deprecated test (#3820) 2020-10-03 13:21:10 -04:00
William Falcon d9bc95f83e
ref: bug fix with logging val epoch end + monitor (#3812)
* ref: fix metric err

* ref: fix metric err

* ref: fix metric err

* ref: merge

* ref: merge

* ref: merge

* ref: merge

* ref: decoupled ddp2

* ref: decoupled ddp2

* ref: decoupled ddp2

* ref: decoupled ddp2

* ref: decoupled ddp2

* ref: clean up ddp before final fix

* ref: clean up ddp before final fix

* ref: clean up ddp before final fix

* ref: clean up ddp before final fix

* ref: clean up ddp before final fix

* ref: clean up ddp before final fix

* ref: clean up ddp before final fix

* ref: clean up ddp before final fix

* ref: clean up ddp before final fix

* ref: clean up ddp before final fix

* ref: clean up ddp before final fix
2020-10-03 12:33:29 -04:00
William Falcon ed1450a293
ref: clean up ddp before final fix (#3817)
* ref: clean up ddp before final fix

* ref: clean up ddp before final fix

* ref: clean up ddp before final fix

* ref: clean up ddp before final fix

* ref: clean up ddp before final fix
2020-10-03 12:01:02 -04:00
William Falcon 0838c6bfce
ref: decoupled ddp2 (#3816) 2020-10-03 09:02:35 -04:00
Jeff Yang 62320632d4
Some docs update (#3794)
* docs update

* docs update

* suggestions

* Update docs/source/introduction_guide.rst

Co-authored-by: William Falcon <waf2107@columbia.edu>
2020-10-03 08:15:07 -04:00
William Falcon a677833f84
ref: separate slurm from ddp (#3809)
* ref: separate slurm from ddp

* ref: separate te from ddp

* ref: merge

* ref: merge

* ref: merge
2020-10-02 23:08:34 -04:00
Brendan Fahy b14c4d4c70
handle fsspec inconsistency in PyArrowHDFS (#3805) 2020-10-02 22:35:42 -04:00
William Falcon 74484edecd
ref: separate te from ddp (#3810)
* ref: separate te from ddp

* ref: separate te from ddp

* ref: separate te from ddp
2020-10-02 21:00:51 -04:00
William Falcon a28528cc8b
ref: remove weight loading hack for ddp_cpu (#3808) 2020-10-02 19:28:50 -04:00
William Falcon afa43837a4
ref: part 8 of #3733 (#3806) 2020-10-02 18:46:18 -04:00
Jeff Yang 9942f3ebdf
Fix `on_train_batch_start` hook to end epoch early (#3700)
* init

* add test

* changelog and docs

* fix test

* Apply suggestion from code review

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2020-10-02 21:46:46 +02:00
ananthsub 3ab730e316
Swap torch.load for fsspec load in ddp spawn backend (#3787)
* Update ddp_spawn_backend.py

* Update ddp_cpu_spawn_backend.py

* log

Co-authored-by: Jirka Borovec <jirka@pytorchlightning.ai>
2020-10-02 21:00:01 +02:00
ananthsub 192fc018f3
Update model_checkpoint.py (#3801) 2020-10-02 14:49:46 -04:00
William Falcon 7c6ed1fa28
ref: part 7 of #3733 (#3802)
* ref: part 7 of #3733

* ref: part 7 of #3733
2020-10-02 14:23:27 -04:00
Jirka Borovec 22efce8f40
fix warning (#3800) 2020-10-02 13:51:02 -04:00
zcain117 0c12065efd
[TPU CI] Use timestamp+pythonVersion to form the docker image tag. (#3779)
* Use timestamp+pythonVersion to form the docker image tag.

* Remove temporary step to check new env var.
2020-10-02 16:22:47 +02:00
ananthsub 88ad4513c1
Use fsspec with OmegaConf saving in saving.py (#3782) 2020-10-02 15:37:37 +02:00
Nathan Raw 698f90164c
remove torch<1.3.0 warning from tb logger (#3784) 2020-10-02 15:36:55 +02:00
Jirka Borovec 62eabdd535
revert backend types (#3788)
* revert backend types

* todo

* todo
2020-10-02 06:18:44 -04:00
edenlightning ab7d9bd1a5
Add link to PL forum in GH questions template (#3708)
* Update how-to-question.md

* Update how-to-question.md

* Apply suggestions from code review

* typo

Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>
2020-10-02 12:05:46 +02:00
Jirka Borovec 1160270882
fix path in CI for release & python version in all dockers & duplicated badges (#3765)
* typo

* path

* check

* trigger

* fix conda

* pip ver

* fix cuda

* fix XLA

* fix xla

* ci

* docker

* BIULD

* unBIULD

* update

* py 3.8

* apex

* apex
2020-10-02 05:26:21 -04:00
Akihiro Nitta ebc1b23fa3
Use `raise .. from ..` to explicitly chain exceptions (#3750)
* Fix exception chaining

* names

* Change exception names for consistency

Co-authored-by: Nicki Skafte <skaftenicki@gmail.com>

* Change exception names for consistency

Co-authored-by: Nicki Skafte <skaftenicki@gmail.com>

Co-authored-by: Jirka Borovec <jirka@pytorchlightning.ai>
Co-authored-by: Nicki Skafte <skaftenicki@gmail.com>
2020-10-01 21:45:44 +02:00
William Falcon e17712e5c3
part 5 of #3733 (#3774)
* ref: part 4 of #3733

* ref: part 4 of #3733

* ref: part 4 of #3733
2020-10-01 12:34:12 -04:00
William Falcon 622c5c3982
ref: part 4 of #3733 (#3773)
* ref: part 4 of #3733

* ref: part 4 of #3733

* ref: part 4 of #3733

* ref: part 4 of #3733
2020-10-01 11:26:58 -04:00
Jeff Yang 128f9ee931
Fix for PyTorch 1.7 CI (#3768)
* changed to __jit_unsed_properties__
2020-10-01 16:37:00 +02:00
Nicki Skafte fe290280be
Metric aggregation testing (#3517)
* aggregation testing

* add more tests

* mse

* more tests

* fix tests

* fix doctest

* fix codefactor

* fix import error

* fix doctest

* revert docfix

* test for model integration

* fix integration test

* added test cases

* fix rmsle

* aggregation testing

* add more tests

* mse

* more tests

* fix tests

* fix doctest

* fix codefactor

* fix import error

* fix doctest

* revert docfix

* test for model integration

* fix integration test

* fix psnr

* add warning/valueerror to embedding similarity

* fixed f scores

* disable some test

* fix tests

* fixing codefactor

* fix pep8

* changelog

* fix doctest

* cleaning test

* fix pickle error

* pickle fix

* fix pickle error

* Apply suggestions from code review

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* code cleanup + changes based on suggestions

* update based on suggestion

* update based on suggestions

* Apply suggestions from code review

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

Co-authored-by: Nicki Skafte <nugginea@gmail.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2020-10-01 15:37:51 +02:00
William Falcon ac2b0f0f06
ref: continue #3733 (#3767)
* ref: #3733 part 2

* ref: #3733 part 2
2020-10-01 09:25:33 -04:00
William Falcon 440f837f6d
ref: part a of #3733 (#3766)
* ref: part a of #3733

* ref: part a of #3733
2020-10-01 08:15:23 -04:00
Nicki Skafte 9a7d1a1876
[metrics] Accuracy num_classes error fix (#3764)
* change accuracy error to warning

* changelog
2020-10-01 13:00:42 +02:00
Lezwon Castelino 8be002ccc7
skip best_model_path if checkpoint_callback is None (#2962)
* skip best_model_path if checkpoint_callback is None

* removed test
2020-10-01 06:57:26 -04:00
GimmickNG e4e60e9b82
Add datamodule parameter to lr_find() (#3425)
* Add datamodule parameter to lr_find()

* Fixed missing import

* Move datamodule parameter to end

* Add datamodule parameter test with auto_lr_find

* Change test for datamodule parameter

* Apply suggestions from code review

Co-authored-by: Nicki Skafte <skaftenicki@gmail.com>

* Fix lr_find documentation

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

* formatting

* Add description to datamodule param in lr_find

* pep8: remove trailing whitespace on line 105

* added changelog

Co-authored-by: Nicki Skafte <skaftenicki@gmail.com>
Co-authored-by: Nicki Skafte <nugginea@gmail.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2020-10-01 10:33:12 +02:00
William Falcon 7c61fc7c27
ref: fixes logging for eval steps (#3763)
* fixes logging for eval steps
2020-10-01 02:31:11 -04:00
Teddy Koker 5ec00ccd28
Added gradient clip test for native AMP (#3754)
* added gradient clip test for fp16

* pep8
2020-10-01 01:36:34 -04:00
William Falcon a38d108a68
add dist lib to enable syncing anything across devices (#3762)
* add dist lib to enable syncing anything across devices
2020-10-01 01:21:38 -04:00
William Falcon cf182e80fc
Finish Allow on_save_checkpoint... (#3688)
* Finish #3562

* Apply suggestions from code review

* Apply suggestions from code review

* fix tests

* Finish #3562

* Apply suggestions from code review

* Apply suggestions from code review

* fix tests

* fix structure

* fix structure

* make save_last test pass

* unnecessary global rank check

* fix test

* update test

* update test

* test

* test

* run save on all

* remove assert

* tracking saves

* check if fails

* test

* clean up

* adjust horovod test

* clean up

* remove unnecessary makdirs

* change

* undo

* debug

* debug

* debug

* debug

* mock

* undo debug code

* add extra assertions

* test

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Adrian Wälchli <adrian.waelchli@inf.unibe.ch>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2020-09-30 16:15:29 -04:00
ananthsub 1eb1d17e25
Add trainer attribute to datamodule (#3749)
* Split out changes from #3563 to make that PR easier to review. This formats the file according to the Black formatter

* Store a reference to the trainer on the datamodule

Fixes #3682

* Update data_connector.py

* Update data_connector.py

* Update test_datamodules.py

* Add attribute to datamodule for trainer
2020-10-01 00:41:19 +05:30