Commit Graph

231 Commits

Author SHA1 Message Date
Sean Naren f0ab74dc2f
Expose scaler in amp plugin (#4737) 2020-11-18 22:30:47 +00:00
chaton 4018237c30
[FEAT] Add lambda closure to manual_optimizer_step (#4618)
* added lambda_closure

* move to types

* add 2 new tests

* make example more complex

* add complex example to doc

* added more tests

* resolve doc

* typo

* update

* update tpu optimizer_step

* Apply suggestions from code review

* Update pytorch_lightning/core/lightning.py

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* update

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2020-11-12 19:22:06 +00:00
chaton 514cb22bd7
[Fix] Move log value to cpu. (#4592)
* move value to cpu to save memory

* update

* move to cpu

* try something

* update

* update

* add back out_dict.update({k: v})

* add move_metrics_to_cpu

* update

* Update pytorch_lightning/utilities/memory.py

Co-authored-by: Justus Schock <12886177+justusschock@users.noreply.github.com>

* resolve comments

* Update pytorch_lightning/core/step_result.py

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* Update pytorch_lightning/trainer/connectors/logger_connector/logger_connector.py

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

Co-authored-by: Justus Schock <12886177+justusschock@users.noreply.github.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Sean Naren <sean.narenthiran@gmail.com>
2020-11-10 21:13:41 +00:00
chaton 7e08b0d710
[bug-fix] DDP and automatic_optimization=False (#4485)
* resolve bug

* add self._running_manual_optim

* update

* update tests

* update lightning module

* resolve bug

* update tests

* update

* resolve pep8

* update

* replace by `ddp_spawn`

* temporary fix

* update

* update

* move update to training_loop

* make both ddp_spawn

* introduce `manual_optimizer_step`

* update changelog

* added changelog wrong place

* add force_optimizer_step

* update docstring for tests

* update optimizer_step

* update zero_grad

* resolve flake8

* move update into manual_optimizer_step

* add zero_grad

* remove zero_grad tests

* remove manual_backward in AMP, it doesn't help

* update

* loosen tests

* update

* update doc

* add TODO

* Removed unnecessary get model from native amp

* Remove try except with pytest raise

* Add seed, clean up imports, remove try catch to reproduce error

* update code

* update test

* revert back

* formatting

* Update pytorch_lightning/core/lightning.py

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

Co-authored-by: SeanNaren <sean@grid.ai>
Co-authored-by: Sean Naren <sean.narenthiran@gmail.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2020-11-10 19:44:51 +00:00
chaton 9c8701f2e2
[feat] Logging refactor 2/n - train (#4495)
* update logging

* solve more bugs

* replace Mapping by Dict

* update on comments

* resolve pep8

* Apply suggestions from code review

Co-authored-by: ananthsub <ananth.subramaniam@gmail.com>

* Update pytorch_lightning/trainer/connectors/logger_connector/epoch_result_store.py

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* update on comments

* typo

* update for coverage

* update test

* update

* Update tests/models/test_hooks.py

Co-authored-by: Sean Naren <sean.narenthiran@gmail.com>

* Update tests/models/test_hooks.py

Co-authored-by: Sean Naren <sean.narenthiran@gmail.com>

* update on comments

* remove deepcopy

* remove useless look for

* another small optim

* extra optim

* remove lastest optim, can be source of bug

* resolve bug

* add docstring

* optimize coverage

* Update pytorch_lightning/trainer/connectors/logger_connector/epoch_result_store.py

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* Update pytorch_lightning/trainer/connectors/logger_connector/epoch_result_store.py

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* Update tests/trainer/logging_tests/test_distributed_logging.py

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* Update pytorch_lightning/trainer/evaluation_loop.py

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* Update tests/trainer/logging/test_logger_connector.py

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* Update tests/trainer/logging_tests/test_train_loop_logging_1_0.py

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* update on comments

* update

* update on comments

* update parity speed

* get it down to 0.65

* update

* 0.8 max_dif

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: ananthsub <ananth.subramaniam@gmail.com>
Co-authored-by: Sean Naren <sean.narenthiran@gmail.com>
Co-authored-by: William Falcon <waf2107@columbia.edu>
2020-11-05 22:27:04 +00:00
Rohit Gupta 360b3d8844
Disable training when limit_train_batches=0 (#4371)
* Disable training when limit_train_batches=0

* chlog

* pep

* limit_train_batches

* BoringModel

Co-authored-by: Roger Shieh <sh.rog@protonmail.ch>
2020-11-03 12:10:35 +05:30
Rohit Gupta ad2556b669
Disable saving checkpoints if not trained (#4372)
* Disable saving checkpoints if not trained

* chlog

* update test

* fix

Co-authored-by: chaton <thomas@grid.ai>
2020-11-03 11:38:32 +05:30
chaton 958aa1aee7
[test] Accumulated gradient optimization tests (#4477)
* adding tests

* wip

* update

* Update tests/trainer/test_trainer.py

Co-authored-by: Sean Naren <sean.narenthiran@gmail.com>

Co-authored-by: Sean Naren <sean.narenthiran@gmail.com>
2020-11-02 23:44:11 +00:00
chaton ac3f7393fd
[FEAT] logging refactors 1/n (#4439)
* introducing new logging object

* typo

* typo

* Update pytorch_lightning/trainer/logging.py

Co-authored-by: Justus Schock <12886177+justusschock@users.noreply.github.com>

* Update pytorch_lightning/trainer/logging.py

Co-authored-by: Justus Schock <12886177+justusschock@users.noreply.github.com>

* update on comments

* update on comments

* add more doctstring

* Update pytorch_lightning/core/lightning.py

Co-authored-by: Sean Naren <sean.narenthiran@gmail.com>

* resolve on comments

* solve pyright

* Update pytorch_lightning/trainer/connectors/logger_connector/epoch_result_store.py

Co-authored-by: Justus Schock <12886177+justusschock@users.noreply.github.com>

* update on comments

* Update pytorch_lightning/trainer/connectors/logger_connector/epoch_result_store.py

Co-authored-by: Sean Naren <sean.narenthiran@gmail.com>

* update on comments

Co-authored-by: Justus Schock <12886177+justusschock@users.noreply.github.com>
Co-authored-by: Sean Naren <sean.narenthiran@gmail.com>
2020-11-02 20:51:43 +00:00
chaton 102fa9ee7d
[BUGFIX] AMP + Precision unscale grad (#4441)
* move unscale within Native plugin

* remove gradient tracking from lightning backward

* forgot trainer.fit

* typo

* update

* cleanup

* set to 1.6

* typo

* skip if below 1.6 strict

* update changelog

* remove useless code

* Update tests/plugins/test_amp_plugin.py

Co-authored-by: Sean Naren <sean.narenthiran@gmail.com>

* Update tests/plugins/test_amp_plugin.py

Co-authored-by: Sean Naren <sean.narenthiran@gmail.com>

* update changelog

* Update CHANGELOG.md

Co-authored-by: Sean Naren <sean.narenthiran@gmail.com>
Co-authored-by: Jeff Yang <ydcjeff@outlook.com>
2020-11-02 16:36:48 +00:00
Justus Schock bbd81dfd55
Skips DDP parameter sync (#4301)
* ddp no-sync

* Update pytorch_lightning/trainer/training_loop.py

Co-authored-by: ananthsub <ananth.subramaniam@gmail.com>

* Update training_loop.py

* factor __enter__ and __exit__ out to separate context manager

* delete _updated_model_last_step

Co-authored-by: justusschock <justusschock@pc125.lfb.rwth-aachen.de>
Co-authored-by: Teddy Koker <teddy.koker@gmail.com>
Co-authored-by: ananthsub <ananth.subramaniam@gmail.com>
Co-authored-by: chaton <thomas@grid.ai>
Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>
2020-10-29 23:01:37 +05:30
Rohit Gupta b26c71eadf
Add optimizer hooks in callbacks (#4379)
* Add optimizer hooks in callbacks

* optimizer param

* update test

Co-authored-by: Nicki Skafte <skaftenicki@gmail.com>
2020-10-28 13:15:22 +01:00
chaton 3abfec8962
[HOTFIX] ModelCheckpoint - Don't increase current_epoch and global_step if not trained (#4291)
* add two tests w/wo tempdir

* resolve flake8

* this test is failing

* update bug report

* resolve bug and add test

* remove bug_report

* resolve flake8

* resolve bug

* resolve pep8

* resolve pep8

Co-authored-by: Teddy Koker <teddy.koker@gmail.com>
2020-10-23 11:17:50 +01:00
Sean Naren 9823f97a84
Protect functions not to be accessed by user (#4305) 2020-10-22 15:15:04 +01:00
Sean Naren 065cc94112
Fix bug comparing max_steps to global step which inits at 0 (#4278)
* Fix bug comparing max_steps to global step which inits at 0

* Added test to ensure accumulate grad batch works with max steps

* check fix with TODO test

* correct call counts

* Add check to ensure we've finished accumulation of this global step before exiting loop in conjuction with max steps

* Remove + 1 check in test as this was incorrect

* Update incorrect expected outputs in lr finder test

* Added brackets for clarity

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2020-10-22 13:58:59 +01:00
Justus Schock 0ec4107697
Optimizer closure (#4190)
* closure for all optimizers

* rename hook and take care of alternating backwards

* add comment

* training_loop_fix

* closure whenever possible

* training_loop

* simple tests that count backward calls

* fix test to work with closure

* remove debugging statement

* better place

* check grads after backward

* start fixing manual optimization

* skip step when result returned by closure was None

* fix gradient clipping test to work with closure

* attribute dict result only for automatic optimization

* adjust backward calls in accelerator

* adjust where to call gradient clipping

* adjust backward calls in tests

* Apply suggestions from code review

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* pass kwargs to xla optimizer

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2020-10-21 19:34:29 +01:00
William Falcon 72f19768c8
remove duplicate metric vs step log for train loop (#4173)
* remove duplicate metric vs step log

* remove duplicate metric vs step log

* remove duplicate metric vs step log

* fix ddp index issue
2020-10-15 10:47:00 -04:00
Jirka Borovec f064682786
save initial arguments (#4163)
* save initial arguments

* typing

* chlog

* .
2020-10-15 08:30:49 -04:00
William Falcon bf2067a609
enabled manual returns (#4089) 2020-10-12 10:06:17 -04:00
William Falcon 0281b077d8
ref: decouple apex second attemp part 10/n (#4064)
* ref: decouple apex second attemp part 9/n

* ref: decouple apex second attemp part 9/n

* ref: decouple apex second attemp part 9/n
2020-10-10 20:05:05 -04:00
William Falcon 5ce9fc6bb3
ref: decouple apex second attemp part 7/n (#4061)
* ref: decouple apex second attemp part 7/n

* ref: decouple apex second attemp part 7/n

* ref: decouple apex second attemp part 7/n
2020-10-10 16:44:15 -04:00
William Falcon d1bbb449a3
ref: decouple apex second attemp part 5/n (#4058) 2020-10-10 14:35:25 -04:00
William Falcon ce2edf1192
ref: decouple apex second attemp part 4/n (#4056)
* ref: decouple apex second attemp part 4/n

* ref: decouple apex second attemp part 4/n

* Update lightning.py

* ref: decouple apex second attemp part 4/n
2020-10-10 12:19:22 -04:00
William Falcon 7285613974
ref: decouple apex second attemp part 2/n (#4054)
* ref: decouple apex second attemp part 2/n

* ref: decouple apex second attemp part 2/n
2020-10-10 10:24:20 -04:00
Nrupatunga fcfa587492
Bugfix/update trainer properties (#3975)
* make current_epoch and global_step to be same as trainer, after model restore.

* remove assignment here

* test

* minor modification

* merge with parent's master

* [bug-fix]: update trainer properties

* minor comment fix

* minor comment fix

* reset train loader in `on_train_epoch_start` hook

* makes sure the changes work

* minor chane

* update changelog

* adding unit test for reload_dataloaders_every_epoch arg

* modified changelog, to add PR number

* revert imports

* changes to unit test

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2020-10-08 10:20:55 -04:00
William Falcon 048a816be3
added tests for the training epoch end (#3967) 2020-10-07 22:27:36 -04:00
William Falcon 4c0d063c86
outputs in __batch_end hooks (#3966)
* train_batch_end outputs

* added tests for the output hooks
2020-10-07 21:48:38 -04:00
William Falcon 2cf17a3718
Adds tests to make sure logging doesn't happen multiple times (#3899)
* Makes sure logging doesn't ever happen from non-root zero

* Makes sure logging doesn't ever happen from non-root zero

* Makes sure logging doesn't ever happen from non-root zero

* added bug report model

* fix local model

* fix local model

* fix local model

* fix local model
2020-10-06 12:43:51 -04:00
Teddy Koker 9600926619
Rename log_save_interval, row_log_interval (#3748)
* Rename row_log_interval -> log_every_n_steps
log_save_interval -> flush_logs_every_n_steps

* Changelog

* fixed title underline length

* typo

* Update pytorch_lightning/trainer/trainer.py

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* Update pytorch_lightning/trainer/trainer.py

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* pep8 + deprecation test

* 'todo: remove in 1.1 comment'

* 1.1 -> 0.11

* log

* docs

* depr API

* add depr tests

* note

* miss

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Jirka Borovec <jirka@pytorchlightning.ai>
2020-10-06 10:27:06 -04:00
Nrupatunga 7d47ed178b
[Bug-Fix]:properties `current_epoch` and `global_step` between model and trainer same always (#3785)
* make current_epoch and global_step to be same as trainer, after model restore.

* remove assignment here

* test

* minor modification

* Update pytorch_lightning/core/lightning.py

type check, better clarity

Co-authored-by: ananthsub <ananth.subramaniam@gmail.com>

* Update pytorch_lightning/core/lightning.py

type check, better clarity

Co-authored-by: ananthsub <ananth.subramaniam@gmail.com>

* comments for current_epoch and global_step properties

* Update tests/models/test_restore.py

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* update comments according to the changes made

* Update tests/models/test_restore.py

* add current_epoch, global_step to jit ignore list

* Add comments to CHANGELOG

* Update CHANGELOG.md

* Update tests/models/test_restore.py

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: ananthsub <ananth.subramaniam@gmail.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2020-10-05 11:10:40 -04:00
William Falcon b014223f72
Fixes #2678 - enables training_step to return None (#3862)
* Fixes #2678 - enables training_step to return None

* Fixes #2678 - enables training_step to return None
2020-10-05 07:33:46 -04:00
William Falcon d9656d166c
fixed model checkpoint frequency (#3852)
* fixed model checkpoint frequency

* fixed model checkpoint frequency

* fixed model checkpoint frequency

* fixed model checkpoint frequency

* merged
2020-10-04 21:49:20 -04:00
William Falcon d9bc95f83e
ref: bug fix with logging val epoch end + monitor (#3812)
* ref: fix metric err

* ref: fix metric err

* ref: fix metric err

* ref: merge

* ref: merge

* ref: merge

* ref: merge

* ref: decoupled ddp2

* ref: decoupled ddp2

* ref: decoupled ddp2

* ref: decoupled ddp2

* ref: decoupled ddp2

* ref: clean up ddp before final fix

* ref: clean up ddp before final fix

* ref: clean up ddp before final fix

* ref: clean up ddp before final fix

* ref: clean up ddp before final fix

* ref: clean up ddp before final fix

* ref: clean up ddp before final fix

* ref: clean up ddp before final fix

* ref: clean up ddp before final fix

* ref: clean up ddp before final fix

* ref: clean up ddp before final fix
2020-10-03 12:33:29 -04:00
Jeff Yang 9942f3ebdf
Fix `on_train_batch_start` hook to end epoch early (#3700)
* init

* add test

* changelog and docs

* fix test

* Apply suggestion from code review

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2020-10-02 21:46:46 +02:00
William Falcon ac2b0f0f06
ref: continue #3733 (#3767)
* ref: #3733 part 2

* ref: #3733 part 2
2020-10-01 09:25:33 -04:00
William Falcon a38d108a68
add dist lib to enable syncing anything across devices (#3762)
* add dist lib to enable syncing anything across devices
2020-10-01 01:21:38 -04:00
Adrian Wälchli 9405c880af
log/save_interval based on global step (#3667)
* log interval based on global step

* test

* test

* test

* test

* pep

* pep

* added changelog

* pep

* merge

* remove unused arg
2020-09-30 12:26:27 +02:00
William Falcon cdd7266cd8
ref: enable self.log from val step (#3701)
* .log in eval

* ref

* ref: enable self.log in val step
2020-09-28 10:49:07 -04:00
William Falcon ddd11075bd
[WIP] ref: deprecated results obj, added support for simpler comms (1/n) (#3681)
* ref: deprecated results obj, added support for simpler comms. Decouples logging from loops

* ref: deprecated results obj, added support for simpler comms. Decouples logging from loops

* ref: deprecated results obj, added support for simpler comms. Decouples logging from loops

* ref: deprecated results obj, added support for simpler comms. Decouples logging from loops

* ref: deprecated results obj, added support for simpler comms. Decouples logging from loops

* ref: deprecated results obj, added support for simpler comms. Decouples logging from loops

* fix global step err

* fix global step err

* fix global step err

* fix global step err

* fix global step err

* fix typing err

* fix str

* fix typing err
2020-09-27 23:19:46 -04:00
Adrian Wälchli f37e9e8a83
Fix global step increment on training_epoch_end (#3673)
* fix

* fix global step err

* fix global step err

* fix global step err

* fix global step err

* fix global step err

* fix global step err

Co-authored-by: William Falcon <waf2107@columbia.edu>
2020-09-27 20:19:51 -04:00
Carlos Mocholí ed12e422a4
Fix incorrect "Saving latest checkpoint" warning (#3588)
* Fix incorrect "Saving latest checkpoint" warning

* Replace warning with info. Run PyCharm's optimize imports

* Remove unused class variable. Refactor logic. Improve test

* Fix De Morgan's
2020-09-25 14:18:06 +02:00
William Falcon 21cfdf6874
ref: result 1/n (make monitor default to checkpoint_on to simplify re… (#3571)
* ref: result 1/n (make monitor default to checkpoint_on to simplify result syntax)

* ref: result 1/n (make monitor default to checkpoint_on to simplify result syntax)

* ref: result 1/n (make monitor default to checkpoint_on to simplify result syntax)

* ref: result 1/n (make monitor default to checkpoint_on to simplify result syntax)

* ref: result 1/n (make monitor default to checkpoint_on to simplify result syntax)

* ref: result 1/n (make monitor default to checkpoint_on to simplify result syntax)

* ref: result 1/n (make monitor default to checkpoint_on to simplify result syntax)

* ref: result 1/n (make monitor default to checkpoint_on to simplify result syntax)

* ref: result 1/n (make monitor default to checkpoint_on to simplify result syntax)

* ref: result 1/n (make monitor default to checkpoint_on to simplify result syntax)

* ref: result 1/n (make monitor default to checkpoint_on to simplify result syntax)

* ref: result 1/n (make monitor default to checkpoint_on to simplify result syntax)

* ref: result 1/n (make monitor default to checkpoint_on to simplify result syntax)

* ref: result 1/n (make monitor default to checkpoint_on to simplify result syntax)

* ref: result 1/n (make monitor default to checkpoint_on to simplify result syntax)

* ref: result 1/n (make monitor default to checkpoint_on to simplify result syntax)

* ref: result 1/n (make monitor default to checkpoint_on to simplify result syntax)

* ref: result 1/n (make monitor default to checkpoint_on to simplify result syntax)

* ref: result 1/n (make monitor default to checkpoint_on to simplify result syntax)

* Update pytorch_lightning/callbacks/model_checkpoint.py

Co-authored-by: ananthsub <ananth.subramaniam@gmail.com>

* ref: result 1/n (make monitor default to checkpoint_on to simplify result syntax)

* force crash when max_epochs < epochs in a checkpoint

Co-authored-by: ananthsub <ananth.subramaniam@gmail.com>
2020-09-20 22:58:43 -04:00
Adrian Wälchli 4ed96b2eb4
fix gradient norm tracking for row_log_interval > 1 (#3489)
* fix + test

* changelog

* Apply suggestions from code review

Co-authored-by: Tim Chard <timchard@hotmail.com>

* improve test

Co-authored-by: Tim Chard <timchard@hotmail.com>
Co-authored-by: Nicki Skafte <skaftenicki@gmail.com>
2020-09-15 18:41:27 +02:00
William Falcon 09679aee32
Silenced some warnings. verified ddp refactors (#3483)
* ref: ddp verify

* ref: ddp verify

* ref: ddp verify

* ref: ddp verify

* ref: ddp verify

* ref: ddp verify

* ref: ddp verify

* ref: ddp verify

* ref: ddp verify

* ref: ddp verify

* Update ddp_base_backend.py
2020-09-13 21:10:37 -04:00
William Falcon 59d8472548
ref: slurm connector 1/n (#3476)
* ref: slurm connector 1/n

* ref: slurm connector 1/n

* ref: slurm connector 1/n

* ref: slurm connector 1/n
2020-09-12 11:07:15 -04:00
William Falcon 4724cdf5e0 ref: checkpoint connector methods 3/n 2020-09-12 07:05:21 -04:00
William Falcon ef20310873
ref: move specific accelerator code x/n (#3457)
* ref: organize args x/n

* ref: move specific accelerator code x/n

* ref: move specific accelerator code x/n

* ref: move specific accelerator code x/n
2020-09-11 10:56:21 -04:00
William Falcon 70af47db84
ref: organize args 4/n (#3456) 2020-09-10 21:58:47 -04:00
Rohit Gupta a1ea681c47
Fix batch_outputs with optimizer frequencies (#3229)
* Fix batch_outputs with optimizers frequencies

* optimizers

* fix batch_outputs with optimizer frequencies

* clean test

* suggestion

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* chlog

* failing doctest

* failing doctest

* update doctest

* chlog

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2020-09-10 23:01:20 +02:00
William Falcon 3281586ab4
ref: organize args 3/n (#3449)
* ref: organize args 3/n

* ref: organize args 3/n

* ref: organize args 3/n

* ref: organize args 3/n

* ref: organize args 3/n

* ref: organize args 3/n
2020-09-10 13:21:04 -04:00