Commit Graph

791 Commits

Author SHA1 Message Date
William Falcon b34c7add23
Fixes #3668, #3887 as a bonus (#3888)
* Fixes #3668, #3887 as a bonus

* Fixes #3668, #3887 as a bonus
2020-10-05 21:30:41 -04:00
Nathan Raw 1954d7c87a
Write predictions in LightningModule instead of EvalResult (#3882)
*  add self.write_prediction

*  add self.write_prediction_dict to lightning module
2020-10-05 18:04:02 -04:00
Jean-Baptiste SCHIRATTI cea5f1f538
Fix for `load_from_checkpoint` (#2776)
* Fix.

* Fix #2550: allow to load model from checkpoint if self.save_hyperparameters() was not called.

* Fix? Cleaner way of not calling self.save_hyperparameters in EvalModelTemplate.

* Fix? `_load_model_state` cleanup

* Fix?

* Fix #2550: allow to load model from checkpoint if self.save_hyperparameters() was not called.

* Fix.

* Fix? Cleaner way of not calling self.save_hyperparameters in EvalModelTemplate.

* Fix? `_load_model_state` cleanup

* Fixed side effect in `test_load_model_from_checkpoint_extra_args`.

* Apply suggestions from code review

* fix

* try

* fixed missing arg in evalmodel

* fixed missing arg in evalmodel

* fix

* update

* fix loading

* add test

* prune

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Jirka Borovec <jirka@pytorchlightning.ai>
Co-authored-by: William Falcon <waf2107@columbia.edu>
2020-10-05 12:44:23 -04:00
Nrupatunga 7d47ed178b
[Bug-Fix]:properties `current_epoch` and `global_step` between model and trainer same always (#3785)
* make current_epoch and global_step to be same as trainer, after model restore.

* remove assignment here

* test

* minor modification

* Update pytorch_lightning/core/lightning.py

type check, better clarity

Co-authored-by: ananthsub <ananth.subramaniam@gmail.com>

* Update pytorch_lightning/core/lightning.py

type check, better clarity

Co-authored-by: ananthsub <ananth.subramaniam@gmail.com>

* comments for current_epoch and global_step properties

* Update tests/models/test_restore.py

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* update comments according to the changes made

* Update tests/models/test_restore.py

* add current_epoch, global_step to jit ignore list

* Add comments to CHANGELOG

* Update CHANGELOG.md

* Update tests/models/test_restore.py

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: ananthsub <ananth.subramaniam@gmail.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2020-10-05 11:10:40 -04:00
Jirka Borovec 6ac0958166
fix init nan for checkpointing (#3863)
* add test for checkpoint nan

* fix

* pep
2020-10-05 07:36:12 -04:00
William Falcon b014223f72
Fixes #2678 - enables training_step to return None (#3862)
* Fixes #2678 - enables training_step to return None

* Fixes #2678 - enables training_step to return None
2020-10-05 07:33:46 -04:00
William Falcon d787208e76
Fixes #2792 (#3857) 2020-10-04 23:25:02 -04:00
Adrian Wälchli ab5e9496d0
refactor (#3851) 2020-10-04 23:23:58 -04:00
William Falcon f58c760409
Fixes #2551 (#3858) 2020-10-04 23:02:35 -04:00
William Falcon 97e62b38cf
Fixed #2143 and many more :) (#3855) 2020-10-04 22:18:49 -04:00
William Falcon d9656d166c
fixed model checkpoint frequency (#3852)
* fixed model checkpoint frequency

* fixed model checkpoint frequency

* fixed model checkpoint frequency

* fixed model checkpoint frequency

* merged
2020-10-04 21:49:20 -04:00
Adrian Wälchli e0f8505394
Mocking loggers (part 2, neptune) (#3617)
* mock neptune base tests

* neptune doctest

* remove extra

* mock loggers

* typo

* mock import

* neptune not compatible with multigpu

* add back experiment
2020-10-04 21:20:06 -04:00
William Falcon 2bca89a752
added tbptt test for logging (#3850)
* added tbptt test for logging

* added tbptt test for logging
2020-10-04 19:38:42 -04:00
William Falcon 00f0d19a61
fixes #3798 (#3849)
* fix #3798

* added tbptt test for logging
2020-10-04 19:36:51 -04:00
Adrian Wälchli cc9781a0ad
Deprecate early_stop_callback Trainer argument (part 2) (#3845)
* update tests with EarlyStopping default

* imports

* revert legacy tests

* fix test

* revert

* revert
2020-10-04 17:36:47 -04:00
Carlos Mocholí 89cc12311f
Fix tbptt_reduce_fx when non-floating tensors are logged (#3796)
* Add failing test

* force all tbptt vals to be floats for reduce

Co-authored-by: William Falcon <waf2107@columbia.edu>
2020-10-04 17:10:25 -04:00
Rohit Gupta d3696052cf
Add back sanity checks (#3846)
* Add back sanity checks

* pep
2020-10-04 17:05:26 -04:00
William Falcon 70e792344a
test selecting the correct backend. temp backends while slurm and TE are decoupled (#3848)
* test selecting the correct backend. tem backends while slurm and TE are decoupled

* test selecting the correct backend. tem backends while slurm and TE are decoupled
2020-10-04 15:44:50 -04:00
William Falcon 1aa9d39506
Eval epoch can now log independently (#3843)
* ref: routed epoch outputs to logger

* ref: routed epoch outputs to logger

* ref: routed epoch outputs to logger

* ref: routed epoch outputs to logger
2020-10-04 13:36:35 -04:00
Rohit Gupta a628d181ee
Fix val_progress_bar total with num_sanity_val_steps (#3751)
* Fix val_progress_bar total with num_sanity_val_steps

* chlog

* Fix val_progress_bar total with num_sanity_val_steps

* move test

* replaced with sanity flag and suggestions
2020-10-04 08:32:18 -04:00
Lezwon Castelino 4da240ea1b
added broadcast option to tpu (#3814)
* added broadcast option to tpu

* add device

* moved tpu broadcast to tpu_backend

* removed Lightning dist

* decode bytes

* pep8 fix

* fix bug

* test for broadcast

* updated changelog
2020-10-04 07:47:33 -04:00
William Falcon 66aef10239
verified epoch logging (#3830)
* ref: fix epoch logging

* verified epoch logging

* verified epoch logging

* verified epoch logging

* verified epoch logging

* verified epoch logging

* verified epoch logging

* verified epoch logging

* verified epoch logging
2020-10-03 21:17:24 -04:00
William Falcon 35d1111994
[WIP] ref: decoupled ddp, ddp spawn (finish 3733) (#3819)
* ref: finish #3733

* remove deprecated test

* remove deprecated test

* remove deprecated test

* remove deprecated test

* remove deprecated test

* remove deprecated test

* remove deprecated test

* remove deprecated test

* remove deprecated test

* remove deprecated test

* remove deprecated test

* remove deprecated test

* remove deprecated test

* Update pytorch_lightning/accelerators/ddp_backend.py

Co-authored-by: ananthsub <ananth.subramaniam@gmail.com>

* remove deprecated test

* remove deprecated test

* remove deprecated test

Co-authored-by: ananthsub <ananth.subramaniam@gmail.com>
2020-10-03 14:05:31 -04:00
William Falcon 3903cf63c6
ref: training flag tests (val_check_interval) (#3825)
* added test_val_check_interval tests

* added test_val_check_interval tests

* added test_val_check_interval tests
2020-10-03 14:05:01 -04:00
William Falcon 0fb8c54fda
remove deprecated test (#3820) 2020-10-03 13:21:10 -04:00
William Falcon d9bc95f83e
ref: bug fix with logging val epoch end + monitor (#3812)
* ref: fix metric err

* ref: fix metric err

* ref: fix metric err

* ref: merge

* ref: merge

* ref: merge

* ref: merge

* ref: decoupled ddp2

* ref: decoupled ddp2

* ref: decoupled ddp2

* ref: decoupled ddp2

* ref: decoupled ddp2

* ref: clean up ddp before final fix

* ref: clean up ddp before final fix

* ref: clean up ddp before final fix

* ref: clean up ddp before final fix

* ref: clean up ddp before final fix

* ref: clean up ddp before final fix

* ref: clean up ddp before final fix

* ref: clean up ddp before final fix

* ref: clean up ddp before final fix

* ref: clean up ddp before final fix

* ref: clean up ddp before final fix
2020-10-03 12:33:29 -04:00
Jeff Yang 9942f3ebdf
Fix `on_train_batch_start` hook to end epoch early (#3700)
* init

* add test

* changelog and docs

* fix test

* Apply suggestion from code review

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2020-10-02 21:46:46 +02:00
Jirka Borovec 62eabdd535
revert backend types (#3788)
* revert backend types

* todo

* todo
2020-10-02 06:18:44 -04:00
Jirka Borovec 1160270882
fix path in CI for release & python version in all dockers & duplicated badges (#3765)
* typo

* path

* check

* trigger

* fix conda

* pip ver

* fix cuda

* fix XLA

* fix xla

* ci

* docker

* BIULD

* unBIULD

* update

* py 3.8

* apex

* apex
2020-10-02 05:26:21 -04:00
Akihiro Nitta ebc1b23fa3
Use `raise .. from ..` to explicitly chain exceptions (#3750)
* Fix exception chaining

* names

* Change exception names for consistency

Co-authored-by: Nicki Skafte <skaftenicki@gmail.com>

* Change exception names for consistency

Co-authored-by: Nicki Skafte <skaftenicki@gmail.com>

Co-authored-by: Jirka Borovec <jirka@pytorchlightning.ai>
Co-authored-by: Nicki Skafte <skaftenicki@gmail.com>
2020-10-01 21:45:44 +02:00
William Falcon e17712e5c3
part 5 of #3733 (#3774)
* ref: part 4 of #3733

* ref: part 4 of #3733

* ref: part 4 of #3733
2020-10-01 12:34:12 -04:00
William Falcon 622c5c3982
ref: part 4 of #3733 (#3773)
* ref: part 4 of #3733

* ref: part 4 of #3733

* ref: part 4 of #3733

* ref: part 4 of #3733
2020-10-01 11:26:58 -04:00
Nicki Skafte fe290280be
Metric aggregation testing (#3517)
* aggregation testing

* add more tests

* mse

* more tests

* fix tests

* fix doctest

* fix codefactor

* fix import error

* fix doctest

* revert docfix

* test for model integration

* fix integration test

* added test cases

* fix rmsle

* aggregation testing

* add more tests

* mse

* more tests

* fix tests

* fix doctest

* fix codefactor

* fix import error

* fix doctest

* revert docfix

* test for model integration

* fix integration test

* fix psnr

* add warning/valueerror to embedding similarity

* fixed f scores

* disable some test

* fix tests

* fixing codefactor

* fix pep8

* changelog

* fix doctest

* cleaning test

* fix pickle error

* pickle fix

* fix pickle error

* Apply suggestions from code review

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* code cleanup + changes based on suggestions

* update based on suggestion

* update based on suggestions

* Apply suggestions from code review

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

Co-authored-by: Nicki Skafte <nugginea@gmail.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2020-10-01 15:37:51 +02:00
William Falcon ac2b0f0f06
ref: continue #3733 (#3767)
* ref: #3733 part 2

* ref: #3733 part 2
2020-10-01 09:25:33 -04:00
William Falcon 440f837f6d
ref: part a of #3733 (#3766)
* ref: part a of #3733

* ref: part a of #3733
2020-10-01 08:15:23 -04:00
Nicki Skafte 9a7d1a1876
[metrics] Accuracy num_classes error fix (#3764)
* change accuracy error to warning

* changelog
2020-10-01 13:00:42 +02:00
GimmickNG e4e60e9b82
Add datamodule parameter to lr_find() (#3425)
* Add datamodule parameter to lr_find()

* Fixed missing import

* Move datamodule parameter to end

* Add datamodule parameter test with auto_lr_find

* Change test for datamodule parameter

* Apply suggestions from code review

Co-authored-by: Nicki Skafte <skaftenicki@gmail.com>

* Fix lr_find documentation

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

* formatting

* Add description to datamodule param in lr_find

* pep8: remove trailing whitespace on line 105

* added changelog

Co-authored-by: Nicki Skafte <skaftenicki@gmail.com>
Co-authored-by: Nicki Skafte <nugginea@gmail.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2020-10-01 10:33:12 +02:00
Teddy Koker 5ec00ccd28
Added gradient clip test for native AMP (#3754)
* added gradient clip test for fp16

* pep8
2020-10-01 01:36:34 -04:00
William Falcon a38d108a68
add dist lib to enable syncing anything across devices (#3762)
* add dist lib to enable syncing anything across devices
2020-10-01 01:21:38 -04:00
William Falcon cf182e80fc
Finish Allow on_save_checkpoint... (#3688)
* Finish #3562

* Apply suggestions from code review

* Apply suggestions from code review

* fix tests

* Finish #3562

* Apply suggestions from code review

* Apply suggestions from code review

* fix tests

* fix structure

* fix structure

* make save_last test pass

* unnecessary global rank check

* fix test

* update test

* update test

* test

* test

* run save on all

* remove assert

* tracking saves

* check if fails

* test

* clean up

* adjust horovod test

* clean up

* remove unnecessary makdirs

* change

* undo

* debug

* debug

* debug

* debug

* mock

* undo debug code

* add extra assertions

* test

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Adrian Wälchli <adrian.waelchli@inf.unibe.ch>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2020-09-30 16:15:29 -04:00
Adrian Wälchli c73032e39d
Make ModelCheckpoint(save_top_k=-1) track the best models (#3735)
* fix topk=-1 tracking best

* update test

* clean up

* add changelog

* enable loading best topk in trainer.test()

* make trivial

* return right away

* make windows test path happy
2020-09-30 08:34:02 -04:00
Jirka Borovec 31a36f04df
define distributed as a type (#3740)
* define type

* miss

* Apply suggestions from code review

Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>

* miss

* warn

Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>
2020-09-30 08:33:01 -04:00
Adrian Wälchli 9405c880af
log/save_interval based on global step (#3667)
* log interval based on global step

* test

* test

* test

* test

* pep

* pep

* added changelog

* pep

* merge

* remove unused arg
2020-09-30 12:26:27 +02:00
William Falcon b3be8022bd
tests for val step flow and logging (#3731)
* ref: test val epoch end

* ref: test val epoch end

* ref: test val epoch end

* ref: test log dict

* ref: test log dict

* ref: test log dict

* ref: test log dict
2020-09-29 22:12:56 -04:00
ananthsub 3dcf7130c5
Support checkpoint hooks on data module (#3563)
* Split out changes from #3563 to make that PR easier to review. This formats the file according to the Black formatter

* Store a reference to the trainer on the datamodule

Fixes #3682

* Update data_connector.py

* Update data_connector.py

* Update test_datamodules.py

* Split out changes from #3563 to make that PR easier to review. This formats the file according to the Black formatter

* support checkpoint hooks for datamodule

refactor on_{save/load}_checkpoint to a separate hook class that both the lightning module and data module inherit
add spots in callback connector to call new datamodule hooks if available

* hooks formatting

* Update hooks.py

* Update checkpoint_connector.py

* Update lightning.py

* update based on upstream/master

checkout upstream/master

* Update checkpoint_connector.py

* add tests

* undo format revert

* Updated CHANGELOG.md

* add checkpoint hooks

* add Dict type

* import CheckpointHooks
2020-09-29 19:51:44 +02:00
William Falcon c14928a72a
ref: test val flow steps (#3723)
* ref: test val epoch end

* ref: test val epoch end

* ref: test val epoch end
2020-09-29 11:42:38 -04:00
Maxim Grechkin 7bb139816a
Add a more direct test of multi-gpu training working (#2084)
* Add a more direct test of multi-gpu training working

* Update tests/base/develop_pipelines.py

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2020-09-29 15:38:09 +02:00
Carlos Mocholí 3b2efe5b2a
Fix ModelCheckpoint period (#3630)
* Fix ModelCheckpoint period

* Remove comma

* Minor changes

* skip check

* Revert "skip check"

Already pushed to master

This reverts commit 00d9e77b81.

Co-authored-by: Jirka Borovec <jirka@pytorchlightning.ai>
2020-09-29 15:36:45 +02:00
William Falcon f42ea303c9
ref: enable self.log for eval loop metrics (#3715)
* ref: test val epoch end

* ref: test val epoch end

* ref: test val epoch end

* ref: test val epoch end

* ref: test val epoch end

* ref: test val epoch end
2020-09-29 02:00:28 -04:00
William Falcon c41ea86b35
ref: move backends back to individual files (1/5) (ddp_cpu) (#3712)
* ref: make each backend independent for easier debugging and independent debugging

* ref: make each backend independent for easier debugging and independent debugging

* ref: make each backend independent for easier debugging and independent debugging

* ref: make each backend independent for easier debugging and independent debugging

* ref: make each backend independent for easier debugging and independent debugging

* ref: make each backend independent for easier debugging and independent debugging

* ref: test val epoch end

* ref: test val epoch end
2020-09-29 01:59:18 -04:00