lightning

Commit Graph

Author	SHA1	Message	Date
Sean Naren	f0ab74dc2f	Expose scaler in amp plugin (#4737 )	2020-11-18 22:30:47 +00:00
chaton	4018237c30	[FEAT] Add lambda closure to manual_optimizer_step (#4618 ) * added lambda_closure * move to types * add 2 new tests * make example more complex * add complex example to doc * added more tests * resolve doc * typo * update * update tpu optimizer_step * Apply suggestions from code review * Update pytorch_lightning/core/lightning.py Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> * update Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>	2020-11-12 19:22:06 +00:00
chaton	514cb22bd7	[Fix] Move log value to cpu. (#4592 ) * move value to cpu to save memory * update * move to cpu * try something * update * update * add back out_dict.update({k: v}) * add move_metrics_to_cpu * update * Update pytorch_lightning/utilities/memory.py Co-authored-by: Justus Schock <12886177+justusschock@users.noreply.github.com> * resolve comments * Update pytorch_lightning/core/step_result.py Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> * Update pytorch_lightning/trainer/connectors/logger_connector/logger_connector.py Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: Justus Schock <12886177+justusschock@users.noreply.github.com> Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: Sean Naren <sean.narenthiran@gmail.com>	2020-11-10 21:13:41 +00:00
chaton	7e08b0d710	[bug-fix] DDP and automatic_optimization=False (#4485 ) * resolve bug * add self._running_manual_optim * update * update tests * update lightning module * resolve bug * update tests * update * resolve pep8 * update * replace by `ddp_spawn` * temporary fix * update * update * move update to training_loop * make both ddp_spawn * introduce `manual_optimizer_step` * update changelog * added changelog wrong place * add force_optimizer_step * update docstring for tests * update optimizer_step * update zero_grad * resolve flake8 * move update into manual_optimizer_step * add zero_grad * remove zero_grad tests * remove manual_backward in AMP, it doesn't help * update * loosen tests * update * update doc * add TODO * Removed unnecessary get model from native amp * Remove try except with pytest raise * Add seed, clean up imports, remove try catch to reproduce error * update code * update test * revert back * formatting * Update pytorch_lightning/core/lightning.py Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: SeanNaren <sean@grid.ai> Co-authored-by: Sean Naren <sean.narenthiran@gmail.com> Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>	2020-11-10 19:44:51 +00:00
chaton	9c8701f2e2	[feat] Logging refactor 2/n - train (#4495 ) * update logging * solve more bugs * replace Mapping by Dict * update on comments * resolve pep8 * Apply suggestions from code review Co-authored-by: ananthsub <ananth.subramaniam@gmail.com> * Update pytorch_lightning/trainer/connectors/logger_connector/epoch_result_store.py Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> * update on comments * typo * update for coverage * update test * update * Update tests/models/test_hooks.py Co-authored-by: Sean Naren <sean.narenthiran@gmail.com> * Update tests/models/test_hooks.py Co-authored-by: Sean Naren <sean.narenthiran@gmail.com> * update on comments * remove deepcopy * remove useless look for * another small optim * extra optim * remove lastest optim, can be source of bug * resolve bug * add docstring * optimize coverage * Update pytorch_lightning/trainer/connectors/logger_connector/epoch_result_store.py Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> * Update pytorch_lightning/trainer/connectors/logger_connector/epoch_result_store.py Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> * Update tests/trainer/logging_tests/test_distributed_logging.py Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> * Update pytorch_lightning/trainer/evaluation_loop.py Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> * Update tests/trainer/logging/test_logger_connector.py Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> * Update tests/trainer/logging_tests/test_train_loop_logging_1_0.py Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> * update on comments * update * update on comments * update parity speed * get it down to 0.65 * update * 0.8 max_dif Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: ananthsub <ananth.subramaniam@gmail.com> Co-authored-by: Sean Naren <sean.narenthiran@gmail.com> Co-authored-by: William Falcon <waf2107@columbia.edu>	2020-11-05 22:27:04 +00:00
Rohit Gupta	360b3d8844	Disable training when limit_train_batches=0 (#4371 ) * Disable training when limit_train_batches=0 * chlog * pep * limit_train_batches * BoringModel Co-authored-by: Roger Shieh <sh.rog@protonmail.ch>	2020-11-03 12:10:35 +05:30
Rohit Gupta	ad2556b669	Disable saving checkpoints if not trained (#4372 ) * Disable saving checkpoints if not trained * chlog * update test * fix Co-authored-by: chaton <thomas@grid.ai>	2020-11-03 11:38:32 +05:30
chaton	958aa1aee7	[test] Accumulated gradient optimization tests (#4477 ) * adding tests * wip * update * Update tests/trainer/test_trainer.py Co-authored-by: Sean Naren <sean.narenthiran@gmail.com> Co-authored-by: Sean Naren <sean.narenthiran@gmail.com>	2020-11-02 23:44:11 +00:00
chaton	ac3f7393fd	[FEAT] logging refactors 1/n (#4439 ) * introducing new logging object * typo * typo * Update pytorch_lightning/trainer/logging.py Co-authored-by: Justus Schock <12886177+justusschock@users.noreply.github.com> * Update pytorch_lightning/trainer/logging.py Co-authored-by: Justus Schock <12886177+justusschock@users.noreply.github.com> * update on comments * update on comments * add more doctstring * Update pytorch_lightning/core/lightning.py Co-authored-by: Sean Naren <sean.narenthiran@gmail.com> * resolve on comments * solve pyright * Update pytorch_lightning/trainer/connectors/logger_connector/epoch_result_store.py Co-authored-by: Justus Schock <12886177+justusschock@users.noreply.github.com> * update on comments * Update pytorch_lightning/trainer/connectors/logger_connector/epoch_result_store.py Co-authored-by: Sean Naren <sean.narenthiran@gmail.com> * update on comments Co-authored-by: Justus Schock <12886177+justusschock@users.noreply.github.com> Co-authored-by: Sean Naren <sean.narenthiran@gmail.com>	2020-11-02 20:51:43 +00:00
chaton	102fa9ee7d	[BUGFIX] AMP + Precision unscale grad (#4441 ) * move unscale within Native plugin * remove gradient tracking from lightning backward * forgot trainer.fit * typo * update * cleanup * set to 1.6 * typo * skip if below 1.6 strict * update changelog * remove useless code * Update tests/plugins/test_amp_plugin.py Co-authored-by: Sean Naren <sean.narenthiran@gmail.com> * Update tests/plugins/test_amp_plugin.py Co-authored-by: Sean Naren <sean.narenthiran@gmail.com> * update changelog * Update CHANGELOG.md Co-authored-by: Sean Naren <sean.narenthiran@gmail.com> Co-authored-by: Jeff Yang <ydcjeff@outlook.com>	2020-11-02 16:36:48 +00:00
Justus Schock	bbd81dfd55	Skips DDP parameter sync (#4301 ) * ddp no-sync * Update pytorch_lightning/trainer/training_loop.py Co-authored-by: ananthsub <ananth.subramaniam@gmail.com> * Update training_loop.py * factor __enter__ and __exit__ out to separate context manager * delete _updated_model_last_step Co-authored-by: justusschock <justusschock@pc125.lfb.rwth-aachen.de> Co-authored-by: Teddy Koker <teddy.koker@gmail.com> Co-authored-by: ananthsub <ananth.subramaniam@gmail.com> Co-authored-by: chaton <thomas@grid.ai> Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>	2020-10-29 23:01:37 +05:30
Rohit Gupta	b26c71eadf	Add optimizer hooks in callbacks (#4379 ) * Add optimizer hooks in callbacks * optimizer param * update test Co-authored-by: Nicki Skafte <skaftenicki@gmail.com>	2020-10-28 13:15:22 +01:00
chaton	3abfec8962	[HOTFIX] ModelCheckpoint - Don't increase current_epoch and global_step if not trained (#4291 ) * add two tests w/wo tempdir * resolve flake8 * this test is failing * update bug report * resolve bug and add test * remove bug_report * resolve flake8 * resolve bug * resolve pep8 * resolve pep8 Co-authored-by: Teddy Koker <teddy.koker@gmail.com>	2020-10-23 11:17:50 +01:00
Sean Naren	9823f97a84	Protect functions not to be accessed by user (#4305 )	2020-10-22 15:15:04 +01:00
Sean Naren	065cc94112	Fix bug comparing max_steps to global step which inits at 0 (#4278 ) * Fix bug comparing max_steps to global step which inits at 0 * Added test to ensure accumulate grad batch works with max steps * check fix with TODO test * correct call counts * Add check to ensure we've finished accumulation of this global step before exiting loop in conjuction with max steps * Remove + 1 check in test as this was incorrect * Update incorrect expected outputs in lr finder test * Added brackets for clarity Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>	2020-10-22 13:58:59 +01:00
Justus Schock	0ec4107697	Optimizer closure (#4190 ) * closure for all optimizers * rename hook and take care of alternating backwards * add comment * training_loop_fix * closure whenever possible * training_loop * simple tests that count backward calls * fix test to work with closure * remove debugging statement * better place * check grads after backward * start fixing manual optimization * skip step when result returned by closure was None * fix gradient clipping test to work with closure * attribute dict result only for automatic optimization * adjust backward calls in accelerator * adjust where to call gradient clipping * adjust backward calls in tests * Apply suggestions from code review Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * pass kwargs to xla optimizer Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>	2020-10-21 19:34:29 +01:00
William Falcon	72f19768c8	remove duplicate metric vs step log for train loop (#4173 ) * remove duplicate metric vs step log * remove duplicate metric vs step log * remove duplicate metric vs step log * fix ddp index issue	2020-10-15 10:47:00 -04:00
Jirka Borovec	f064682786	save initial arguments (#4163 ) * save initial arguments * typing * chlog * .	2020-10-15 08:30:49 -04:00
William Falcon	bf2067a609	enabled manual returns (#4089 )	2020-10-12 10:06:17 -04:00
William Falcon	0281b077d8	ref: decouple apex second attemp part 10/n (#4064 ) * ref: decouple apex second attemp part 9/n * ref: decouple apex second attemp part 9/n * ref: decouple apex second attemp part 9/n	2020-10-10 20:05:05 -04:00
William Falcon	5ce9fc6bb3	ref: decouple apex second attemp part 7/n (#4061 ) * ref: decouple apex second attemp part 7/n * ref: decouple apex second attemp part 7/n * ref: decouple apex second attemp part 7/n	2020-10-10 16:44:15 -04:00
William Falcon	d1bbb449a3	ref: decouple apex second attemp part 5/n (#4058 )	2020-10-10 14:35:25 -04:00
William Falcon	ce2edf1192	ref: decouple apex second attemp part 4/n (#4056 ) * ref: decouple apex second attemp part 4/n * ref: decouple apex second attemp part 4/n * Update lightning.py * ref: decouple apex second attemp part 4/n	2020-10-10 12:19:22 -04:00
William Falcon	7285613974	ref: decouple apex second attemp part 2/n (#4054 ) * ref: decouple apex second attemp part 2/n * ref: decouple apex second attemp part 2/n	2020-10-10 10:24:20 -04:00
Nrupatunga	fcfa587492	Bugfix/update trainer properties (#3975 ) * make current_epoch and global_step to be same as trainer, after model restore. * remove assignment here * test * minor modification * merge with parent's master * [bug-fix]: update trainer properties * minor comment fix * minor comment fix * reset train loader in `on_train_epoch_start` hook * makes sure the changes work * minor chane * update changelog * adding unit test for reload_dataloaders_every_epoch arg * modified changelog, to add PR number * revert imports * changes to unit test Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>	2020-10-08 10:20:55 -04:00
William Falcon	048a816be3	added tests for the training epoch end (#3967 )	2020-10-07 22:27:36 -04:00
William Falcon	4c0d063c86	outputs in __batch_end hooks (#3966 ) * train_batch_end outputs * added tests for the output hooks	2020-10-07 21:48:38 -04:00
William Falcon	2cf17a3718	Adds tests to make sure logging doesn't happen multiple times (#3899 ) * Makes sure logging doesn't ever happen from non-root zero * Makes sure logging doesn't ever happen from non-root zero * Makes sure logging doesn't ever happen from non-root zero * added bug report model * fix local model * fix local model * fix local model * fix local model	2020-10-06 12:43:51 -04:00
Teddy Koker	9600926619	Rename log_save_interval, row_log_interval (#3748 ) * Rename row_log_interval -> log_every_n_steps log_save_interval -> flush_logs_every_n_steps * Changelog * fixed title underline length * typo * Update pytorch_lightning/trainer/trainer.py Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> * Update pytorch_lightning/trainer/trainer.py Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> * pep8 + deprecation test * 'todo: remove in 1.1 comment' * 1.1 -> 0.11 * log * docs * depr API * add depr tests * note * miss Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: Jirka Borovec <jirka@pytorchlightning.ai>	2020-10-06 10:27:06 -04:00
Nrupatunga	7d47ed178b	[Bug-Fix]:properties `current_epoch` and `global_step` between model and trainer same always (#3785 ) * make current_epoch and global_step to be same as trainer, after model restore. * remove assignment here * test * minor modification * Update pytorch_lightning/core/lightning.py type check, better clarity Co-authored-by: ananthsub <ananth.subramaniam@gmail.com> * Update pytorch_lightning/core/lightning.py type check, better clarity Co-authored-by: ananthsub <ananth.subramaniam@gmail.com> * comments for current_epoch and global_step properties * Update tests/models/test_restore.py Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> * update comments according to the changes made * Update tests/models/test_restore.py * add current_epoch, global_step to jit ignore list * Add comments to CHANGELOG * Update CHANGELOG.md * Update tests/models/test_restore.py Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> Co-authored-by: ananthsub <ananth.subramaniam@gmail.com> Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>	2020-10-05 11:10:40 -04:00
William Falcon	b014223f72	Fixes #2678 - enables training_step to return None (#3862 ) * Fixes #2678 - enables training_step to return None * Fixes #2678 - enables training_step to return None	2020-10-05 07:33:46 -04:00
William Falcon	d9656d166c	fixed model checkpoint frequency (#3852 ) * fixed model checkpoint frequency * fixed model checkpoint frequency * fixed model checkpoint frequency * fixed model checkpoint frequency * merged	2020-10-04 21:49:20 -04:00
William Falcon	d9bc95f83e	ref: bug fix with logging val epoch end + monitor (#3812 ) * ref: fix metric err * ref: fix metric err * ref: fix metric err * ref: merge * ref: merge * ref: merge * ref: merge * ref: decoupled ddp2 * ref: decoupled ddp2 * ref: decoupled ddp2 * ref: decoupled ddp2 * ref: decoupled ddp2 * ref: clean up ddp before final fix * ref: clean up ddp before final fix * ref: clean up ddp before final fix * ref: clean up ddp before final fix * ref: clean up ddp before final fix * ref: clean up ddp before final fix * ref: clean up ddp before final fix * ref: clean up ddp before final fix * ref: clean up ddp before final fix * ref: clean up ddp before final fix * ref: clean up ddp before final fix	2020-10-03 12:33:29 -04:00
Jeff Yang	9942f3ebdf	Fix `on_train_batch_start` hook to end epoch early (#3700 ) * init * add test * changelog and docs * fix test * Apply suggestion from code review Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>	2020-10-02 21:46:46 +02:00
William Falcon	ac2b0f0f06	ref: continue #3733 (#3767 ) * ref: #3733 part 2 * ref: #3733 part 2	2020-10-01 09:25:33 -04:00
William Falcon	a38d108a68	add dist lib to enable syncing anything across devices (#3762 ) * add dist lib to enable syncing anything across devices	2020-10-01 01:21:38 -04:00
Adrian Wälchli	9405c880af	log/save_interval based on global step (#3667 ) * log interval based on global step * test * test * test * test * pep * pep * added changelog * pep * merge * remove unused arg	2020-09-30 12:26:27 +02:00
William Falcon	cdd7266cd8	ref: enable self.log from val step (#3701 ) * .log in eval * ref * ref: enable self.log in val step	2020-09-28 10:49:07 -04:00
William Falcon	ddd11075bd	[WIP] ref: deprecated results obj, added support for simpler comms (1/n) (#3681 ) * ref: deprecated results obj, added support for simpler comms. Decouples logging from loops * ref: deprecated results obj, added support for simpler comms. Decouples logging from loops * ref: deprecated results obj, added support for simpler comms. Decouples logging from loops * ref: deprecated results obj, added support for simpler comms. Decouples logging from loops * ref: deprecated results obj, added support for simpler comms. Decouples logging from loops * ref: deprecated results obj, added support for simpler comms. Decouples logging from loops * fix global step err * fix global step err * fix global step err * fix global step err * fix global step err * fix typing err * fix str * fix typing err	2020-09-27 23:19:46 -04:00
Adrian Wälchli	f37e9e8a83	Fix global step increment on training_epoch_end (#3673 ) * fix * fix global step err * fix global step err * fix global step err * fix global step err * fix global step err * fix global step err Co-authored-by: William Falcon <waf2107@columbia.edu>	2020-09-27 20:19:51 -04:00
Carlos Mocholí	ed12e422a4	Fix incorrect "Saving latest checkpoint" warning (#3588 ) * Fix incorrect "Saving latest checkpoint" warning * Replace warning with info. Run PyCharm's optimize imports * Remove unused class variable. Refactor logic. Improve test * Fix De Morgan's	2020-09-25 14:18:06 +02:00
William Falcon	21cfdf6874	ref: result 1/n (make monitor default to checkpoint_on to simplify re… (#3571 ) * ref: result 1/n (make monitor default to checkpoint_on to simplify result syntax) * ref: result 1/n (make monitor default to checkpoint_on to simplify result syntax) * ref: result 1/n (make monitor default to checkpoint_on to simplify result syntax) * ref: result 1/n (make monitor default to checkpoint_on to simplify result syntax) * ref: result 1/n (make monitor default to checkpoint_on to simplify result syntax) * ref: result 1/n (make monitor default to checkpoint_on to simplify result syntax) * ref: result 1/n (make monitor default to checkpoint_on to simplify result syntax) * ref: result 1/n (make monitor default to checkpoint_on to simplify result syntax) * ref: result 1/n (make monitor default to checkpoint_on to simplify result syntax) * ref: result 1/n (make monitor default to checkpoint_on to simplify result syntax) * ref: result 1/n (make monitor default to checkpoint_on to simplify result syntax) * ref: result 1/n (make monitor default to checkpoint_on to simplify result syntax) * ref: result 1/n (make monitor default to checkpoint_on to simplify result syntax) * ref: result 1/n (make monitor default to checkpoint_on to simplify result syntax) * ref: result 1/n (make monitor default to checkpoint_on to simplify result syntax) * ref: result 1/n (make monitor default to checkpoint_on to simplify result syntax) * ref: result 1/n (make monitor default to checkpoint_on to simplify result syntax) * ref: result 1/n (make monitor default to checkpoint_on to simplify result syntax) * ref: result 1/n (make monitor default to checkpoint_on to simplify result syntax) * Update pytorch_lightning/callbacks/model_checkpoint.py Co-authored-by: ananthsub <ananth.subramaniam@gmail.com> * ref: result 1/n (make monitor default to checkpoint_on to simplify result syntax) * force crash when max_epochs < epochs in a checkpoint Co-authored-by: ananthsub <ananth.subramaniam@gmail.com>	2020-09-20 22:58:43 -04:00
Adrian Wälchli	4ed96b2eb4	fix gradient norm tracking for row_log_interval > 1 (#3489 ) * fix + test * changelog * Apply suggestions from code review Co-authored-by: Tim Chard <timchard@hotmail.com> * improve test Co-authored-by: Tim Chard <timchard@hotmail.com> Co-authored-by: Nicki Skafte <skaftenicki@gmail.com>	2020-09-15 18:41:27 +02:00
William Falcon	09679aee32	Silenced some warnings. verified ddp refactors (#3483 ) * ref: ddp verify * ref: ddp verify * ref: ddp verify * ref: ddp verify * ref: ddp verify * ref: ddp verify * ref: ddp verify * ref: ddp verify * ref: ddp verify * ref: ddp verify * Update ddp_base_backend.py	2020-09-13 21:10:37 -04:00
William Falcon	59d8472548	ref: slurm connector 1/n (#3476 ) * ref: slurm connector 1/n * ref: slurm connector 1/n * ref: slurm connector 1/n * ref: slurm connector 1/n	2020-09-12 11:07:15 -04:00
William Falcon	4724cdf5e0	ref: checkpoint connector methods 3/n	2020-09-12 07:05:21 -04:00
William Falcon	ef20310873	ref: move specific accelerator code x/n (#3457 ) * ref: organize args x/n * ref: move specific accelerator code x/n * ref: move specific accelerator code x/n * ref: move specific accelerator code x/n	2020-09-11 10:56:21 -04:00
William Falcon	70af47db84	ref: organize args 4/n (#3456 )	2020-09-10 21:58:47 -04:00
Rohit Gupta	a1ea681c47	Fix batch_outputs with optimizer frequencies (#3229 ) * Fix batch_outputs with optimizers frequencies * optimizers * fix batch_outputs with optimizer frequencies * clean test * suggestion Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * chlog * failing doctest * failing doctest * update doctest * chlog Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>	2020-09-10 23:01:20 +02:00
William Falcon	3281586ab4	ref: organize args 3/n (#3449 ) * ref: organize args 3/n * ref: organize args 3/n * ref: organize args 3/n * ref: organize args 3/n * ref: organize args 3/n * ref: organize args 3/n	2020-09-10 13:21:04 -04:00

1 2 3 4 5

231 Commits