lightning

Commit Graph

Author	SHA1	Message	Date
William Falcon	9a503de6af	Replace docs gifs with videos snippets so user can play at own speed (#2966 ) * update docs	2020-08-13 18:52:47 -04:00
Jeff Yang	07c023c32f	fix(docs): docstring for amp_backend (#2960 ) * fix(docs): docstring for amp_backend * fix(docs): early_stop_checkpoint -> early_stop_callback * docs Co-authored-by: ananyahjha93 <ananya@pytorchlightning.ai>	2020-08-13 23:25:56 +02:00
SiddhantRanade	88bfed371e	Fix enforce_datamodule_dataloader_override() for iterable datasets (#2957 ) This function has the if statement `if (train_dataloader or val_dataloaders) and datamodule:`. The issue is similar to that in https://github.com/PyTorchLightning/pytorch-lightning/pull/1560. The problem is that the `if(dl)` translates to `if(bool(dl))`, but there's no dataloader.__bool__ so bool() uses dataloader.__len__ > 0. But... dataloader.__len__ uses IterableDataset.__len__ for IterableDatasets for which __len__ is undefined. The fix is also the same, the `if dl` should be replaced by `if dl is not None`. Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>	2020-08-13 23:06:17 +02:00
William Falcon	2c935d048e	track batch size (#2954 )	2020-08-13 12:40:54 -04:00
Jirka Borovec	4354690e55	add apex test (#2921 ) * add apex test * rename * level * events * wrap * evt * miss * apex * apex * apex * apex * apex * apex * Update tests/models/test_amp.py Co-authored-by: William Falcon <waf2107@columbia.edu> * notes * notes Co-authored-by: William Falcon <waf2107@columbia.edu>	2020-08-13 10:03:13 -04:00
William Falcon	6c5a0a172f	Resultd (#2947 ) * updated docs	2020-08-13 09:58:05 -04:00
Jirka Borovec	519b97effd	Clean save (#2933 ) * thr deterministic=True * clean * clean * Apply suggestions from code review Co-authored-by: Vadym Stupakov <vadim.stupakov@gmail.com> * Apply suggestions from code review Co-authored-by: Vadym Stupakov <vadim.stupakov@gmail.com>	2020-08-13 07:26:33 -04:00
William Falcon	a46130cdc1	add weighted average to results obj (#2930 ) * track batch size in result obj	2020-08-12 08:02:00 -04:00
Brendan Fahy	56396abe98	fix checkpointing to remote file paths (#2925 )	2020-08-12 06:31:17 -04:00
William Falcon	d13e5c9e53	document lightiningmodule better (#2920 ) * updated docs	2020-08-11 19:39:43 -04:00
Brendan Fahy	97e6f35b34	fix missing return statement. Do not normalize remote paths (#2894 ) * fix missing return statement. Do not normalize remote paths * Update pytorch_lightning/utilities/cloud_io.py Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * Add some documentation that we now support s3 and hdfs paths * suggestion from code review Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>	2020-08-09 22:38:43 +00:00
Uladzislau Sazanovich	e9846dd758	Add tracking of basic states in Trainer [wip - to-be-merged after v0.9] (#2541 ) * Add initial tracking of states in Trainer. * Add INTERRUPTED state, improve tests, move state switching from callback to a trainer. * Move part of a trainer state switching to a decorator. * Add documentation. * Fix docs, rename state enum, restore state to previous on exit if None, add tests for decorator only. * Fix callback typing. Co-authored-by: William Falcon <waf2107@columbia.edu>	2020-08-09 06:24:09 -04:00
Brendan Fahy	6e77181ec7	Squashed commit of the following: (#2164 ) commit 29fb0506cd38a15c359e369cc8bc4435916b0c78 Author: Brendan Fahy <bmfahy@gmail.com> Date: Sat Aug 8 19:35:30 2020 +0000 fix checking for version for docs to build commit 467fd640db02275972c7111af031c86bb59333e9 Author: Brendan Fahy <bmfahy@gmail.com> Date: Sat Aug 8 18:56:05 2020 +0000 remove no local test commit a7cc9f88de00feec1a5406874d05313c42bd004c Author: Brendan Fahy <bmfahy@gmail.com> Date: Sat Aug 8 18:46:44 2020 +0000 fix commit 3fdbb729da79ae9348c83410a138666bad467951 Author: Brendan Fahy <bmfahy@gmail.com> Date: Sat Aug 8 18:23:30 2020 +0000 revert requirements commit 9b8686bd83e2bc243cf329e26f1c667c6949cf67 Author: Brendan Fahy <bmfahy@gmail.com> Date: Sat Aug 8 18:16:42 2020 +0000 make it a fixture commit eec74953d24c8b25268d3b6dde3cc4affdd5cb8f Author: Brendan Fahy <bmfahy@gmail.com> Date: Sat Aug 8 18:01:32 2020 +0000 fix up the testing commit 896d94a0e60083d52c81db2a036b7f1e015cad11 Author: Brendan Fahy <bmfahy@gmail.com> Date: Sat Aug 8 17:47:28 2020 +0000 fix some tests commit 6d22bde19767bf2b71dfd44839b01efdf6888f83 Merge: 6175d4e2 `6ebe0d72` Author: Brendan Fahy <bmfahy@gmail.com> Date: Sat Aug 8 10:20:47 2020 +0000 Merge remote-tracking branch 'origin/master' into tb_use_gfile commit 6175d4e26b15a43c412c26d501762cd0b570616a Author: Brendan Fahy <bmfahy@gmail.com> Date: Fri Aug 7 10:16:36 2020 +0000 Use tensorboard.compat.gfile to support remote writing	2020-08-09 06:08:44 -04:00
William Falcon	256059a1d0	tracks all outputs including TBPTT and multiple optimizers (#2890 ) * pl 0.9 update * pl 0.9 update * pl 0.9 update * pl 0.9 update * pl 0.9 update * pl 0.9 update * pl 0.9 update * pl 0.9 update * pl 0.9 update * pl 0.9 update * pl 0.9 update * pl 0.9 update * pl 0.9 update * pl 0.9 update * pl 0.9 update * pl 0.9 update * pl 0.9 update * pl 0.9 update * pl 0.9 update * pl 0.9 update * pl 0.9 update	2020-08-09 06:00:15 -04:00
Adrian Wälchli	1bb268ad8a	Clarify what gpus=0 means in docs (#2876 ) * docs clarify what gpus=0 means * add example suggested by @ydcjeff	2020-08-08 11:50:08 -04:00
Adrian Wälchli	f798cffd02	save last model after saving top_k when save_last=True (#2881 ) * save_last should be last * changelog * seed, docs * retrigger ci * compare filenames * move constants * fix test * epoch, global step * improve test	2020-08-08 06:02:43 -04:00
Jirka Borovec	a6e7aa7796	allow using apex with any PT version (#2865 ) * wip * setup * type * name * wip * docs * imports * fix if * fix if * use_amp * Apply suggestions from code review Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * Apply suggestions from code review Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * fix tests * Apply suggestions from code review Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * fix tests * todos Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>	2020-08-08 11:07:32 +02:00
Santiago Castro	fed0ac838b	Fix Trainer arg name in docs (#2879 ) * Fix Trainer arg name in docs * Fix a PR comment	2020-08-08 07:52:35 +02:00
Jirka Borovec	b7d72706c3	clean imports (#2867 ) * clean imports * miss	2020-08-08 00:33:51 +02:00
Jirka Borovec	f8c058215f	simplify tests & cleaning (#2588 ) * simplify * tmpdir * revert * clean * accel * types * test * edit test acc Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * Update test acc Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>	2020-08-07 23:22:05 +02:00
Iz Beltagy	2cc60c625e	fix set_epoch on TPUs (#2740 ) * fix https://github.com/PyTorchLightning/pytorch-lightning/issues/2622 * Update training_loop.py	2020-08-07 09:31:30 -04:00
William Falcon	f82d7feb6c	updated hooks (#2850 ) * modified hooks * modified hooks * modified hooks * modified hooks * modified hooks * modified hooks * modified hooks * modified hooks * modified hooks	2020-08-07 09:29:57 -04:00
ananthsub	b39f4798a6	Add support to Tensorboard logger for OmegaConf hparams (#2846 ) * Add support to Tensorboard logger for OmegaConf hparams Address https://github.com/PyTorchLightning/pytorch-lightning/issues/2844 We check if we can import omegaconf, and if the hparams are omegaconf instances. if so, we use OmegaConf.merge to preserve the typing, such that saving hparams to yaml actually triggers the OmegaConf branch * avalaible * chlog * test Co-authored-by: Jirka Borovec <jirka@pytorchlightning.ai>	2020-08-07 09:13:21 -04:00
Rohit Gupta	a642349228	Support limit_mode_batches (int) for infinite dataloader (#2840 ) * Support limit_mode_batches(int) for infinite dataloader * flake8 * revert and update * add and update tests * pep8 * chlog * Update CHANGELOG.md Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * Add suggestions by @awaelchli * docs * Apply suggestions from code review Co-authored-by: Ethan Harris <ewah1g13@soton.ac.uk> * Apply suggestions from code review * fix * max * check * add and update tests * max * check * check * check * chlog * tests * update exception message * Apply suggestions from code review Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> Co-authored-by: Ethan Harris <ewah1g13@soton.ac.uk> Co-authored-by: Jirka Borovec <jirka@pytorchlightning.ai>	2020-08-07 13:02:36 +02:00
Nima Sarang	793036d29c	Support returning python scalars in DP (#1935 ) * Override the default gather method to support scalars * add computing average of a list * bug: change if to elif * add some tests * change style * change documentation * use apply_to_collection in DP gather * use apply_to_collection in DP gather * fix warning msg * override gather method in DP * add tests for python scalars * add python scalars to docstring * Update message * override gather method in DP * formatting * chlog Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: Jirka Borovec <jirka@pytorchlightning.ai>	2020-08-07 09:18:29 +02:00
Nicki Skafte	9a402461da	Bugfix: Lr finder and hparams compatibility (#2821 ) * fix hparams lr finder bug * add tests for new functions * better tests * fix codefactor * fix styling * fix tests * fix codefactor * Apply suggestions from code review * modified hook Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: William Falcon <waf2107@columbia.edu> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>	2020-08-07 00:34:48 +02:00
William Falcon	b507c42c47	clarify batch hooks (#2842 ) * modified hook * modified hook * modified hook * modified hook * modified hook * modified hook * modified hook * modified hook * modified hook * modified hook * modified hook * modified hook * modified hook	2020-08-05 20:01:30 -04:00
Ananya Harsh Jha	a5f2b89ed0	updated sync bn (#2838 ) * updated sync bn * updated sync bn * updated sync bn * updated sync bn * updated sync bn * updated sync bn * updated sync bn * updated sync bn * added ddp_spawn test * updated test * clean * clean Co-authored-by: Jirka Borovec <jirka@pytorchlightning.ai>	2020-08-06 01:12:11 +02:00
William Falcon	5d0f0325d8	Revert "Support limit_mode_batches (int) for infinite dataloader" (#2839 ) * Revert "Support limit_mode_batches (int) for infinite dataloader (#2787)" This reverts commit `de9c9f0864`. * Update training_tricks.py	2020-08-05 15:57:26 -04:00
Ruotian(RT) Luo	bef27c58ed	save apex scaler states (#2828 )	2020-08-05 13:43:50 -04:00
Ruotian(RT) Luo	6034d5e37d	fix apex gradient clipping (#2829 )	2020-08-05 13:42:21 -04:00
Ananya Harsh Jha	e31c520c21	add support for sync_bn (#2801 ) * initial commit for sync_bn * updated changelog * tests * tests * ddp tests hanging with script tests * updated trainer * updated params * test * passingtests * passing tests * passing tests * passing tests * tests * removed apex * doc * doc * doc * doc * docs * tests * tests * tests	2020-08-05 13:29:05 -04:00
Rohit Gupta	de9c9f0864	Support limit_mode_batches (int) for infinite dataloader (#2787 ) * Support limit_mode_batches(int) for infinite dataloader * flake8 * revert and update * add and update tests * pep8 * chlog * Update CHANGELOG.md Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * Add suggestions by @awaelchli * docs * Apply suggestions from code review Co-authored-by: Ethan Harris <ewah1g13@soton.ac.uk> * Apply suggestions from code review * fix * max * check Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: Ethan Harris <ewah1g13@soton.ac.uk> Co-authored-by: Jirka Borovec <jirka@pytorchlightning.ai>	2020-08-05 17:04:49 +00:00
Nicki Skafte	b2a7d7580c	Docs for auto_select_gpu (#2836 ) * added docs * Update docs/source/multi_gpu.rst Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com> * testcode change to example Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>	2020-08-05 12:28:33 +00:00
Nathan Raw	1c4244e1ff	🐛 fix dm prepare_data call (#2811 )	2020-08-03 19:39:01 -04:00
Rohit Gupta	6b9c548bab	docs update and follow up of #2789 (#2797 ) * docs update and follow up of #2789 * pep8 * Update trainer.py * Update trainer.py Co-authored-by: edenlightning <66261195+edenlightning@users.noreply.github.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>	2020-08-03 19:57:21 +00:00
Vlad Lialin	ed8a01afb0	More clear docstring for `val_check_interval` (#2802 ) * More clear docstring for `val_check_interval` * Update trainer.py	2020-08-03 09:13:05 -04:00
William Falcon	a0c4365278	Gpu idx (#2796 ) * ddp refactor	2020-08-02 08:13:31 -04:00
Jirka Borovec	b01ad75700	missing chlogs (#2672 ) * missing * miss * Apply suggestions from code review Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * miss * note * notes * update CI testing with pip upgrade (#2380) * try pt1.5 * cpu * upgrade * tpu * user * [blocked by #2380] freeze GPU PT 1.4 (#2780) * freeze * user Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>	2020-08-02 12:34:36 +02:00
bkhakshoor	96eb6ebacd	fix shell injection vulnerability in subprocess call (#2786 )	2020-08-01 23:25:57 -04:00
Rohit Gupta	8baec1a191	Fix shuffle for distributed sampler (#2789 ) * Fix shuffle for distributed sampler * add test * test * chlog * update test * update test * update test * assertions via callback * define callback outside for pickling * skip ddp test on windows Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>	2020-08-01 23:22:57 -04:00
Iz Beltagy	38fce2ea68	fix selecting GPUs using CUDA_VISIBLE_DEVICES (#2739 ) * fix https://github.com/PyTorchLightning/pytorch-lightning/issues/2407 * Update pytorch_lightning/trainer/distrib_data_parallel.py Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>	2020-08-01 23:21:15 -04:00
Nathan Raw	036bcea499	Call DataModule hooks implicitly in trainer (#2755 ) * ✨ call dm hooks in trainer implicitly * ✅ update tests * 📝 remove unused stage arg from dm docs * ✅ update tests * ✅ update tests * 🚧 include stage in datamodule.setup * 📝 docs * 📝 docs * added more dm tests * added more dm tests * 🐛 call dm.setup everywhere * 🔥 pickle tests now implied by accelerator tests * 🎨 set dm as attr of trainer * 🐛 . * 🚧 wip * add can prepare test * add can prepare test * verified setup in fit * fixed setup call * fixed setup call * fixed setup call Co-authored-by: William Falcon <waf2107@columbia.edu>	2020-08-01 20:17:57 -04:00
siahuat0727	78a07e5f2d	Fix doc typo (#2773 )	2020-07-31 07:42:47 -04:00
Jirka Borovec	06e8910f06	pytorch 1.6 (#2745 ) * pt 1.6 * don't use the new zipfile serialization for now * quick flake8 fixes * remove unnecessary f * coalesce strings * remove comma * remove extra commas * Apply suggestions from code review Co-authored-by: Peter Yu <2057325+yukw777@users.noreply.github.com> * set _use_new_zipfile_serialization to False only for pytorch 1.6.0 * remove unnecessary comments * flake8 fixes * use pkg_resources instead of packaging * readme * format * version * chlog Co-authored-by: Peter Yu <peter@asapp.com> Co-authored-by: Peter Yu <2057325+yukw777@users.noreply.github.com>	2020-07-31 11:18:32 +02:00
Jirka Borovec	949734489a	remove deprecated in v0.9 (#2760 ) * remove deprecated in v0.9 * data_loader * import * hook * args	2020-07-30 23:19:28 +02:00
Phil	2f0fb34496	Speed up gradient clipping and allow parameters on multiple devices. (#2767 ) The speed up is achieved by: - Moving the "where" out of the loop (and replacing with min for simplicity). - Replacing manual sum and pow with torch.norm. Even though this results in unnessecary computation (computing pow(root)) this is still a lot faster. - Preallocating the output gives a slight speed up. Note that calling .to for all parameters results in a small speed penalty (~4 ms in my case) but allows parameters on different devices. Overall this reduces the time used for gradient clipping from 206ms to 74 ms for my model (Resnet50 + few additional vars, all vars on GPU).	2020-07-30 11:53:24 -04:00
Tejasvi S Tomar	8ab5bcda3d	Misleading exception raised during batch scaling (#2223 ) * Misleading exception raised during batch scaling Use batch_size from `model.hparams.batch_size` instead of `model.batch_size` * Improvements considering #1896 * Apply suggestions from code review Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>	2020-07-29 18:47:11 -04:00
Peter Yu	b7f613ba6d	Correct CWD for ddp subprocesses when using Hydra (#2719 ) * when hydra is enabled, set the cwd of subprocesses to the original cwd for ddp * move imports up * clean up imports	2020-07-28 16:33:28 -04:00
Iz Beltagy	c047676fae	fix https://github.com/PyTorchLightning/pytorch-lightning/issues/2635 (#2738 )	2020-07-28 16:29:46 -04:00
Stas Bekman	2bd39c66af	make the error message readable (#2729 ) * make the error message readable make the error message readable by adding spaces, fixing a type "his -> this", * cleanup * Update pytorch_lightning/trainer/auto_mix_precision.py * Apply suggestions from code review Co-authored-by: Ethan Harris <ewah1g13@soton.ac.uk> Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: Ethan Harris <ewah1g13@soton.ac.uk>	2020-07-28 16:28:22 -04:00
Jirka Borovec	0fe933e23d	fixing TPU tests (#2632 ) * init * rename * tpu_core_idx * idx 8 * idxs * @pl_multi_process_test * assert * assert * deamon * no close * imort * msg * use_single_gpu * dataset * idx * fix idx * dataset * format * add pickable * typo * apex * typo * wip * wip * wip * wip * wip * wip * wip * wip * docs * typo * tests * tests * tests * tests * tests * tests * tests * tests * tests * tests * tests * tests * tests * tests * tests * tests * tests * docs * docs * Apply suggestions from code review Co-authored-by: Ethan Harris <ewah1g13@soton.ac.uk> Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com> * Apply suggestions from code review Co-authored-by: Ethan Harris <ewah1g13@soton.ac.uk> * docs * Apply suggestions from code review Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com> Co-authored-by: Ethan Harris <ewah1g13@soton.ac.uk> Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>	2020-07-27 19:07:09 -04:00
Rohit Gupta	84c507c4df	Fix max_batches with fast_dev_run. (#2581 ) * Fix fast_dev_run to run for all val_dataloaders * fast_dev_run check * changelog * explicit * limit_batches with fast_dev_run in init * add test * whitespace and comment fix * comment and assertion * added tests * Fix fast_dev_run to run for all val_dataloaders * fast_dev_run check * changelog * explicit * limit_batches with fast_dev_run in init * add test * whitespace and comment fix * comment and assertion * added tests * added tests * added tests * added tests * update rtol * Revert "update rtol" This reverts commit `4320329540`. * added tests Co-authored-by: William Falcon <waf2107@columbia.edu>	2020-07-27 17:56:55 -04:00
Adrian Wälchli	d03953260d	Fix weights_save_path when logger is used + simplify path handling + better docs (#2681 ) * fix weights_save path and drop ckpt_path * add tests * unused import * update docs * changelog * pep8 * fix horovod test * make backward compatible * perform same test for all loggers * fix for when logger=False and weights_save_path is set * update changelog * update docs * update tests * do not set save dir dynamically * remove duplicate test * remove duplicated tests * update tests * update tests * remove remaining ckpt_path references * move defaults to init as suggested by @Borda * test deprecation	2020-07-27 12:53:11 -04:00
William Falcon	afd7f67047	cpu backend (#2712 ) * cpu backend * cpu backend * cpu backend	2020-07-25 22:55:09 -04:00
William Falcon	0d96b2698a	refactor 4 (#2711 ) * refactor ddp_spawn	2020-07-25 22:06:18 -04:00
William Falcon	4dbd761a1c	refactor 3/n (#2709 ) * reactor into gpu accelerator * reactor into gpu accelerator * reactor into gpu accelerator * reactor into gpu accelerator * reactor into gpu accelerator * reactor into gpu accelerator * reactor into gpu accelerator * reactor into gpu accelerator * reactor into gpu accelerator * reactor into gpu accelerator * reactor into gpu accelerator * reactor into gpu accelerator * reactor into gpu accelerator * reactor into gpu accelerator * reactor into gpu accelerator * reactor into gpu accelerator * reactor into gpu accelerator * reactor into gpu accelerator * reactor into gpu accelerator * reactor into gpu accelerator * reactor into gpu accelerator	2020-07-25 20:56:50 -04:00
William Falcon	b34217e410	Refactor 2/n (#2708 ) * reactor into gpu accelerator * reactor into gpu accelerator * reactor into gpu accelerator * reactor into gpu accelerator * reactor into gpu accelerator * reactor into gpu accelerator * reactor into gpu accelerator	2020-07-25 17:31:34 -04:00
William Falcon	071e09fe38	refactor 1/n for v1.0.0 (#2704 ) * reactor into gpu accelerator * reactor into gpu accelerator * reactor into gpu accelerator * reactor into gpu accelerator * reactor into gpu accelerator * reactor into gpu accelerator * reactor into gpu accelerator * reactor into gpu accelerator * reactor into gpu accelerator * reactor into gpu accelerator * reactor into gpu accelerator * reactor into gpu accelerator	2020-07-25 14:38:51 -04:00
Nathan Raw	9076551aec	Enable val/test loop disabling + datamodule tests (#2692 ) * 🎨 warn instead of error out on loaders * 🐛 test misconfiguration should still fail * 🚧 . * updated docs with new result obj * updated docs with new result obj * updated docs with new result obj * updated docs with new result obj * updated docs with new result obj * updated docs with new result obj * updated docs with new result obj * updated docs with new result obj * updated docs with new result obj * updated docs with new result obj * updated docs with new result obj * updated docs with new result obj * updated docs with new result obj * updated docs with new result obj * updated docs with new result obj * updated docs with new result obj * updated docs with new result obj * updated docs with new result obj * updated docs with new result obj * updated docs with new result obj * updated docs with new result obj * updated docs with new result obj * updated docs with new result obj * updated docs with new result obj * updated docs with new result obj * updated docs with new result obj * updated docs with new result obj * updated docs with new result obj * updated docs with new result obj * updated docs with new result obj * updated docs with new result obj * updated docs with new result obj * updated docs with new result obj * updated docs with new result obj * updated docs with new result obj Co-authored-by: William Falcon <waf2107@columbia.edu>	2020-07-25 12:57:40 -04:00
Rohit Gupta	cb0c6ad51a	fix setup call while testing (#2624 ) * fix setup call while testing * changelog * drop if condition * add test to check setup call * flake8 * update test to check model stage Co-authored-by: William Falcon <waf2107@columbia.edu>	2020-07-24 23:57:31 -04:00
Rohit Gupta	8599b670c5	Fix logging interval (#2694 )	2020-07-24 20:59:00 -04:00
Nathan Raw	1caf8beb2c	Datamodule (#2668 ) * ✨ Add copy of pl_bolts datamodule to lightning * ✨ add datamodule to necessary init files * 🚧 add datamodule property to LightningModule * 🚧 . * 🎨 Let DataModule do its own thing * 🚧 add back setup and run both hooks implicitly * 🚧 . * 🐛 fix add_argparse_args * 💄 apply black formatting and isort * 📝 docstrings * 📝 . * 📝 . * 🐛 overwrite cls prepare_data instead of instance * 📝 . * ✅ add some tests * Update datamodule.py * Update datamodule.py * Update datamodule.py Co-authored-by: William Falcon <waf2107@columbia.edu>	2020-07-24 11:42:15 -04:00
Ananya Harsh Jha	6780214b27	quick-fix for --gpus flag bug (#2674 ) * quick-fix for --gpus flag bug * warning added * warning added * set on_gpu using data_parallel_device_ids * self.on_gpu repositioned Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>	2020-07-24 08:26:05 +00:00
Travis Addair	1369012bc7	Horovod: adjust base LR used by schedulers to scale with the number of workers (#2626 ) * Horovod: Adjust base LR used by schedulers to match that of the optimizer after scaling by number of workers * Added unit test * Removed debug statements * Updated changelog * Apply suggestions from code review Co-authored-by: William Falcon <waf2107@columbia.edu> Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>	2020-07-23 12:14:57 -04:00
Adrian Wälchli	1e68968ed7	support num_sanity_val_steps=-1 (#2246 ) * support sanity_val_step=-1 * fix list size * simplification * simplify * add test for num_sanity_val_steps=-1 * update test * update docs * extend tests to multiple dataloaders * changelog * Update tests/trainer/test_trainer.py Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> * improve test * refactor the sanity check decision * fix merge * Update trainer.py Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: William Falcon <waf2107@columbia.edu>	2020-07-23 07:07:03 -04:00
William Falcon	62ce00f96c	EvalResult support for val loop (PR 3/5) (#2651 ) * add EvalResult to support to val/test loops	2020-07-22 13:53:10 -04:00
William Falcon	6d10ac2ac8	Structured results (train loop only. val loop separate PR) (PR 2/5) (#2615 ) * r * r * r * patched optimizer closure with sr * patched optimizer closure with sr * patched optimizer closure with sr * added train step structured result * added train step structured result * added train step structured result * added train step structured result * added train step structured result * added train step structured result * added train step structured result * added train step structured result * added train step structured result * added train step structured result * added train step structured result * added train step structured result * added train step structured result * added train step structured result * added train step structured result * added train step structured result * added train step structured result * added train step structured result * added train step structured result * added train step structured result * added autoreduce for train step * added auto reduce on train * added auto reduce on train * added auto reduce on train * added auto reduce on train * added auto reduce on train * added auto reduce on train * added hooks * added hooks * added hooks * added hooks * added hooks * finished tests for structured results on train epoch * finished tests for structured results on train epoch * finished tests for structured results on train epoch * finished tests for structured results on train epoch * finished tests for structured results on train epoch * finished tests for structured results on train epoch * finished tests for structured results on train epoch * finished tests for structured results on train epoch * finished tests for structured results on train epoch * finished tests for structured results on train epoch * finished tests for structured results on train epoch * finished tests for structured results on train epoch * finished tests for structured results on train epoch * finished tests for structured results on train epoch * finished tests for structured results on train epoch * finished tests for structured results on train epoch * finished tests for structured results on train epoch * finished tests for structured results on train epoch * finished tests for structured results on train epoch * finished tests for structured results on train epoch * finished tests for structured results on train epoch * finished tests for structured results on train epoch * finished tests for structured results on train epoch * finished tests for structured results on train epoch * finished tests for structured results on train epoch * finished tests for structured results on train epoch * finished tests for structured results on train epoch * finished tests for structured results on train epoch * finished tests for structured results on train epoch * finished tests for structured results on train epoch * finished tests for structured results on train epoch * finished tests for structured results on train epoch * finished tests for structured results on train epoch * finished tests for structured results on train epoch * finished tests for structured results on train epoch * finished tests for structured results on train epoch * finished tests for structured results on train epoch * finished tests for structured results on train epoch * finished tests for structured results on train epoch * finished tests for structured results on train epoch * finished tests for structured results on train epoch * finished tests for structured results on train epoch * finished tests for structured results on train epoch * finished tests for structured results on train epoch * finished tests for structured results on train epoch * finished tests for structured results on train epoch * finished tests for structured results on train epoch * finished tests for structured results on train epoch * finished tests for structured results on train epoch * finished tests for structured results on train epoch * finished tests for structured results on train epoch * finished tests for structured results on train epoch * finished tests for structured results on train epoch * finished tests for structured results on train epoch * finished tests for structured results on train epoch * finished tests for structured results on train epoch * finished tests for structured results on train epoch * finished tests for structured results on train epoch * finished tests for structured results on train epoch * finished tests for structured results on train epoch * finished tests for structured results on train epoch * finished tests for structured results on train epoch * finished tests for structured results on train epoch * finished tests for structured results on train epoch * finished tests for structured results on train epoch * finished tests for structured results on train epoch * finished tests for structured results on train epoch * finished tests for structured results on train epoch * finished tests for structured results on train epoch * finished tests for structured results on train epoch * finished tests for structured results on train epoch * finished tests for structured results on train epoch * finished tests for structured results on train epoch * finished tests for structured results on train epoch * finished tests for structured results on train epoch * finished tests for structured results on train epoch * finished tests for structured results on train epoch * finished tests for structured results on train epoch * finished tests for structured results on train epoch * finished tests for structured results on train epoch * finished tests for structured results on train epoch * finished tests for structured results on train epoch * finished tests for structured results on train epoch * finished tests for structured results on train epoch * finished tests for structured results on train epoch * cache * finished tests for structured results on train epoch * finished tests for structured results on train epoch * finished tests for structured results on train epoch * finished tests for structured results on train epoch * finished tests for structured results on train epoch * Update pytorch_lightning/callbacks/early_stopping.py Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> * Update pytorch_lightning/callbacks/early_stopping.py Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> * Update pytorch_lightning/callbacks/early_stopping.py Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> * Update pytorch_lightning/callbacks/model_checkpoint.py * Update pytorch_lightning/core/step_result.py * finished tests for structured results on train epoch * finished tests for structured results on train epoch * Apply suggestions from code review Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> Co-authored-by: Justus Schock <12886177+justusschock@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Justus Schock <12886177+justusschock@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Justus Schock <12886177+justusschock@users.noreply.github.com> * simple * finished tests for structured results on train epoch * simple * simple * revert * finished tests for structured results on train epoch * finished tests for structured results on train epoch * Update tests/base/deterministic_model.py Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> * finished tests for structured results on train epoch * docstring typos * finished tests for structured results on train epoch * finished tests for structured results on train epoch * finished tests for structured results on train epoch * finished tests for structured results on train epoch * finished tests for structured results on train epoch * finished tests for structured results on train epoch * finished tests for structured results on train epoch * finished tests for structured results on train epoch * finished tests for structured results on train epoch * finished tests for structured results on train epoch * finished tests for structured results on train epoch * finished tests for structured results on train epoch * finished tests for structured results on train epoch * finished tests for structured results on train epoch * finished tests for structured results on train epoch * finished tests for structured results on train epoch * finished tests for structured results on train epoch * Update pytorch_lightning/core/step_result.py Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * Update pytorch_lightning/overrides/data_parallel.py Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> Co-authored-by: Jirka <jirka@pytorchlightning.ai> Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> Co-authored-by: Justus Schock <12886177+justusschock@users.noreply.github.com>	2020-07-20 19:00:20 -04:00
Jaesun Park	d14d3b2a34	Fix typo (#2646 )	2020-07-20 14:13:56 -04:00
ananthsub	ed581eb64f	Fix local rank zero casting (#2640 ) * Fix local rank zero casting The environment variable 'LOCAL_RANK' can be a string, causing the `if rank_zero_only.rank == 0` check to fail * Update distributed.py address comment	2020-07-18 20:12:06 -04:00
William Falcon	aaa1553e35	tests for val loop flow (#2605 ) * add tests for single scalar return from training * add tests for single scalar return from training * add tests for single scalar return from training * fixing val step only * fixing val step only * fixing val step only * fixing val step only * fixing val step only * fixing val step only * fixing val step only * fixing val step only * fixing val step only * fixing val step only * fixing val step only * fixing val step only * fixing val step only * fixing val step only * fixing val step only * fixing val step only * fixing val step only * fixing val step only * fixing val step only * fixing val step only * fixing val step only * fixing val step only	2020-07-14 14:20:45 -04:00
William Falcon	1d565e175d	add tests for single scalar return from training (#2587 ) * add tests for single scalar return from training * add tests for single scalar return from training * add tests for single scalar return from training * add tests for single scalar return from training * add tests for single scalar return from training	2020-07-11 17:43:00 -04:00
William Falcon	e068af9ea8	Ampt (#2572 ) * remove grad scaling tpu * remove grad scaling tpu * remove grad scaling tpu * remove grad scaling tpu * remove grad scaling tpu * remove grad scaling tpu * remove grad scaling tpu * remove grad scaling tpu * remove grad scaling tpu	2020-07-09 21:28:11 -04:00
William Falcon	f35337adba	Fixes .test() for ddp (#2570 ) * enable none checkpoint * enable none checkpoint * enable none checkpoint * enable none checkpoint * enable none checkpoint * enable none checkpoint * enable none checkpoint * enable none checkpoint * enable none checkpoint * enable none checkpoint * enable none checkpoint * enable none checkpoint * enable none checkpoint * enable none checkpoint * enable none checkpoint * enable none checkpoint * enable none checkpoint * enable none checkpoint * enable none checkpoint * enable none checkpoint * enable none checkpoint * enable none checkpoint * enable none checkpoint * enable none checkpoint * enable none checkpoint * enable none checkpoint * enable none checkpoint * enable none checkpoint * enable none checkpoint * enable none checkpoint * enable none checkpoint * enable none checkpoint * enable none checkpoint * enable none checkpoint * enable none checkpoint * enable none checkpoint * enable none checkpoint * enable none checkpoint * enable none checkpoint * enable none checkpoint * enable none checkpoint * enable none checkpoint * enable none checkpoint * enable none checkpoint * enable none checkpoint * enable none checkpoint * enable none checkpoint * enable none checkpoint * enable none checkpoint * enable none checkpoint * enable none checkpoint * enable none checkpoint * enable none checkpoint * enable none checkpoint * enable none checkpoint * enable none checkpoint * enable none checkpoint * enable none checkpoint * enable none checkpoint * enable none checkpoint * enable none checkpoint * enable none checkpoint * enable none checkpoint	2020-07-09 18:36:36 -04:00
William Falcon	b73812648f	don't pass tpu weights back on test (#2566 ) * enable none checkpoint * enable none checkpoint	2020-07-09 12:11:56 -04:00
William Falcon	4bbcfa04a3	.fit() returns last not best weights in ddp_spawn (#2565 ) * added base tests for tpu * added base tests for tpu * enable none checkpoint * enable none checkpoint * enable none checkpoint * enable none checkpoint * enable none checkpoint * enable none checkpoint * enable none checkpoint * enable none checkpoint * enable none checkpoint * enable none checkpoint	2020-07-09 11:36:21 -04:00
Jan Sellner	e1bc208f66	Fix default value in documentation (#2452 )	2020-07-09 07:16:50 -04:00
Hayden Housen	992a7e2a41	Start accumulate gradients schedule at epoch 0 (continued) (#2513 ) * Start accumulate gradients schedule at epoch 0 * Undo change in #2375 * Update test_trainer.py::test_gradient_accumulation_scheduling * Fix pep8 formatting * Remove 'Datasets/' folder * Split args for readability * Fix pep8 formatting	2020-07-09 07:11:07 -04:00
Espen Haugsdal	b3ebfec863	Fix argparse default value bug (#2526 ) * Add failing test for bug * Fix bug	2020-07-09 07:10:30 -04:00
William Falcon	11069c8784	Fix ddp tests + .test() (#2512 ) * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * fix deprecation warnings * added base tests for tpu * added base tests for tpu * Update pytorch_lightning/trainer/trainer.py Co-authored-by: Jeremy Jordan <13970565+jeremyjordan@users.noreply.github.com> * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu * added base tests for tpu Co-authored-by: Jirka <jirka@pytorchlightning.ai> Co-authored-by: Jeremy Jordan <13970565+jeremyjordan@users.noreply.github.com>	2020-07-07 12:24:56 -04:00
Jeremy Jordan	a91b06ed1e	fix worker warning (#2504 ) * fix worker warning * improve tests * suggestion Co-authored-by: Jirka <jirka@pytorchlightning.ai>	2020-07-06 15:45:43 +02:00
vr140	96b32bee04	[tiny] Fix training_dataloader usage to be train_dataloader instead. (#2521 ) Co-authored-by: Vijay Rajaram <vrajaram3@gatech.edu>	2020-07-06 10:44:44 +02:00
William Falcon	9924c76faa	Amp2 (#2505 ) * fix tpu hang * fix tpu hang * fix tpu hang * fix tpu hang * fix tpu hang * fix tpu hang * fix tpu hang * fix tpu hang * fix tpu hang * fix tpu hang * fix tpu hang * fix tpu hang * fix tpu hang * fix tpu hang * fix tpu hang * fix tpu hang * fix tpu hang * fix tpu hang * fix tpu hang * fix tpu hang * fix tpu hang * fix tpu hang * fix tpu hang * fix tpu hang * fix tpu hang * fix tpu hang * fix tpu hang * fix tpu hang * fix tpu hang * fix tpu hang * fix tpu hang * fix tpu hang * fix tpu hang * fix tpu hang * fix tpu hang * fix tpu hang * fix tpu hang * fix tpu hang * fix tpu hang * fix tpu hang	2020-07-04 22:52:49 -04:00
Jirka Borovec	fc61c200c0	DDp interpreter (#2482 ) * interpreter * chlog	2020-07-03 13:23:30 -04:00
William Falcon	0697dd306d	Fixes #2455 (#2463 )	2020-07-02 07:18:58 -04:00
William Falcon	afdfba1dc6	removed auto val reduce (#2462 )	2020-07-02 07:04:18 -04:00
Adrian Wälchli	927f305f7e	Warn user when IterableDataset has __len__ defined (#2437 ) * add warning when getting checking len * added test * changelog * pep * do not show warning below 1.4 * try version parse * comments * xfail * Update requirements/base.txt Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> * Update pytorch_lightning/trainer/data_loading.py Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> * version Co-authored-by: William Falcon <waf2107@columbia.edu> Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: Jirka <jirka@pytorchlightning.ai>	2020-07-01 07:53:19 -04:00
William Falcon	325852c6df	enabled no returns from eval (#2446 ) * enabled no returns from eval * fixed docs * fixed docs * fixed docs * fixed docs * fixed docs * fixed docs * fixed docs * fixed docs * fixed docs * fixed docs * fixed docs * fixed docs	2020-07-01 07:38:00 -04:00
Jirka Borovec	e268061614	Pure package & base tests (#2418 ) * base tests * pil * wip * wip * wip * ignore * ignore * win * link * win * cpu * Apply suggestions from code review Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>	2020-06-30 19:35:54 -04:00
Adrian Wälchli	145670f893	fix logging on rank 0 only (#2425 ) * fix and test for ddp block logging rank > 0 * rename * use the dummy logger * dummy logger test * set the logger in model * decorator for rank zero experiment * simplify check * simplify * fix problem with None in checkpoint path * revert configure logger * unused import * offline * try rank 0 decorator in checkpoint * try fix test * imgs * add asserts to make sure log zero only saves checkpoints * add asserts to make sure log zero only saves checkpoints * add asserts to make sure log zero only saves checkpoints * add asserts to make sure log zero only saves checkpoints * add asserts to make sure log zero only saves checkpoints * fix tpu tests * fix tpu tests Co-authored-by: William Falcon <waf2107@columbia.edu>	2020-06-30 18:09:16 -04:00
William Falcon	04e68f022f	fix tpu tests	2020-06-30 17:20:35 -04:00
William Falcon	fc26078e39	fix tpu tests	2020-06-30 17:20:18 -04:00
Oliver Neumann	1a54ed6ad9	Checking ipywidgets is installed for ensure tqdm working (#2417 ) * Adding importing ipywidgets before importing tqdm.auto to make sure ipywidgets is installed. * Updated CHANGELOG.md * Updated ipywidgets importing checks to @awaelchli comments. Co-authored-by: William Falcon <waf2107@columbia.edu>	2020-06-30 16:59:35 -04:00
William Falcon	309ed75c5d	added reduce ddp results on eval (#2434 ) * added reduce ddp results on eval * added reduce ddp results on eval * added reduce ddp results on eval * added reduce ddp results on eval * added reduce ddp results on eval * added reduce ddp results on eval * added reduce ddp results on eval * added reduce ddp results on eval * added reduce ddp results on eval * added reduce ddp results on eval * added reduce ddp results on eval	2020-06-30 16:15:35 -04:00
William Falcon	e8bb4165b7	Fix apex scaling with decoupled backward (#2433 ) * fix outputs * fix outputs * fix outputs * fix outputs * fix outputs * fix outputs * fix outputs * fix outputs * fix outputs * fix outputs * fix outputs * fix outputs * fix outputs * fix outputs * fix outputs * fix outputs * fix outputs * fix outputs * fix outputs	2020-06-30 14:51:39 -04:00
William Falcon	a42a0e16dd	Fixes train outputs (#2428 ) * fix outputs * fix outputs	2020-06-30 10:03:49 -04:00
William Falcon	593837e1da	fix amp wrong call	2020-06-29 06:46:19 -04:00
Adrian Wälchli	25ee51bc57	Continue Jeremy's early stopping PR #1504 (#2391 ) * add state_dict for early stopping * move best attr after monitor_op defined * improve early stopping and model checkpoint callbacks * fix formatting * fix attr init order * clean up setting of default_root_dir attr * logger needs default root dir set first * reorg trainer init * remove direct references to checkpoint callback * more fixes * more bugfixes * run callbacks at epoch end * update tests to use on epoch end * PR cleanup * address failing tests * refactor for homogeneity * fix merge conflict * separate tests * tests for early stopping bug regressions * small fixes * revert model checkpoint change * typo fix * fix tests * update train loop * cannot pass an int as default_save_path * refactor log message * fix test case * appease the linter * fix some doctests * move config to callback * fixes from rebase * fixes from rebase * chlog * docs * reformat * formatting * fix * fix * fixes from rebase * add new test for patience * Update pytorch_lightning/callbacks/model_checkpoint.py Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> * Update pytorch_lightning/callbacks/model_checkpoint.py Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> * Update tests/callbacks/test_early_stopping.py Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> * fix formatting * remove enable_early_stop attribute * add state_dict for early stopping * move best attr after monitor_op defined * improve early stopping and model checkpoint callbacks * fix formatting * fix attr init order * clean up setting of default_root_dir attr * logger needs default root dir set first * reorg trainer init * remove direct references to checkpoint callback * more fixes * more bugfixes * run callbacks at epoch end * update tests to use on epoch end * PR cleanup * address failing tests * refactor for homogeneity * fix merge conflict * separate tests * tests for early stopping bug regressions * small fixes * revert model checkpoint change * typo fix * fix tests * update train loop * fix test case * appease the linter * fix some doctests * move config to callback * fixes from rebase * fixes from rebase * chlog * docs * reformat * formatting * fix * fix * fixes from rebase * add new test for patience * Update pytorch_lightning/callbacks/model_checkpoint.py Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> * Update pytorch_lightning/callbacks/model_checkpoint.py Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> * Update tests/callbacks/test_early_stopping.py Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> * fix formatting * remove enable_early_stop attribute * fix test with new epoch indexing * fix progress bar totals * fix off by one error (see #2289) epoch starts at 0 now * added missing imports * fix hpc_save folderpath * fix formatting * fix tests * small fixes from a rebase * fix * tmpdir * tmpdir * tmpdir * wandb * fix merge conflict * add back evaluation after training * test_resume_early_stopping_from_checkpoint TODO * undo the horovod check * update changelog * remove a duplicate test from merge error * try fix dp_resume test * add the logger fix from master * try remove default_root_dir * try mocking numpy * try import numpy in docs test * fix wandb test * pep 8 fix * skip if no amp * dont mock when doctesting * install extra * fix the resume ES test * undo conf.py changes * revert remove comet pickle from test * Update CHANGELOG.md Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> * Update weights_loading.rst * Update weights_loading.rst * Update weights_loading.rst * renamed flag * renamed flag * revert the None check in logger experiment name/version * add the old comments * _experiment * test chckpointing on DDP * skip the ddp test on windows * cloudpickle * renamed flag * renamed flag * parentheses for clarity * apply suggestion max epochs Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: Jeremy Jordan <jtjordan@ncsu.edu> Co-authored-by: Jirka <jirka@pytorchlightning.ai> Co-authored-by: Jeremy Jordan <13970565+jeremyjordan@users.noreply.github.com> Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: William Falcon <waf2107@columbia.edu>	2020-06-28 21:36:46 -04:00
William Falcon	66ffbaddf5	updates teardown to account for ddp (#2389 ) * remove warnings * remove warnings * added doc lines * added doc lines	2020-06-28 07:01:04 -04:00
Adrian Wälchli	d910cc5200	docs: dont mock imports when running sphinx doctest (#2396 ) * skip if no amp * dont mock when doctesting * install extra	2020-06-27 23:31:06 -04:00
Jirka Borovec	51711c265a	fix loading model with kwargs (#2387 ) * test * fix * fix	2020-06-27 16:38:03 -04:00
William Falcon	90f641af0d	fixes logger crash on ddp (#2388 ) * remove warnings * remove warnings * remove warnings * remove warnings * remove warnings * remove warnings * remove warnings * remove warnings * remove warnings * remove warnings	2020-06-27 15:08:22 -04:00
Jirka Borovec	0be78d13aa	native amp (#2373 ) * native amp * typo * imports * apex	2020-06-26 21:45:13 -04:00
Jirka Borovec	f1c96930b1	repair CI for Win (#2358 ) * no cov * no cov * ReduceOp * group * reduce_op.sum * Update sklearns.py * formatting * horovod * Apply suggestions from code review * horovod * horovod * horovod * horovod * ci * print * ci * timeout * timeout * time * fix * distributed cpu * pipes * time * cpu * spawn * spawn * spawn * tp * separate * os * os * npm * Fix load_from_checkpoint() not working with URL on Windows * Update CHANGELOG * Update CHANGELOG.md Co-authored-by: Peter Yu <2057325+yukw777@users.noreply.github.com> * Apply suggestions from code review * fix * fix meta tags creating empty lines * pyright * node * fix httpserver address * drop tutils.default_trainer_options * imports * Better fix for load_from_checkpoint() not working with absolute path on Windows (#2294) * Fix load_from_checkpoint() not working with URL on Windows * Update CHANGELOG * Update CHANGELOG.md Co-authored-by: Peter Yu <2057325+yukw777@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: Peter Yu <2057325+yukw777@users.noreply.github.com> * drop duplicate Co-authored-by: Justus Schock <12886177+justusschock@users.noreply.github.com> Co-authored-by: airium <airium@outlook.com> Co-authored-by: Peter Yu <2057325+yukw777@users.noreply.github.com> Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> Co-authored-by: AIRIUM <38249940+airium@users.noreply.github.com>	2020-06-26 21:38:25 -04:00
Jirka Borovec	a5f45787ea	fix get dataloader size (#2375 ) * get dataloader size * pyright	2020-06-26 15:38:48 -04:00
William Falcon	cbb2427f0d	changed apex level (#2362 )	2020-06-25 18:54:32 -04:00
William Falcon	0a092f6683	making optimization steps for hooks (#2363 ) *simplified optimizer step and zero grad overriding	2020-06-25 16:02:16 -04:00
William Falcon	d22181714a	fix 2333 (#2360 )	2020-06-25 11:10:17 -04:00
William Falcon	c275e1fc91	swaps lr sched order (#2356 ) * swaps lr sched order * Update optimizers.py * added amdim encoder choice	2020-06-25 09:21:41 -04:00
William Falcon	598f5140c5	refactor training loop (#2336 ) * refactoring training epoch * refactored training epoch * refactored training epoch * refactored training epoch * refactored training epoch * refactored training epoch * fixes slurm weights saving * fixes slurm weights saving	2020-06-23 23:38:22 -04:00
William Falcon	c09b2ffb91	test (#2341 ) * fixes rank zero issue	2020-06-23 21:57:45 -04:00
William Falcon	a915280427	fixes slurm weights saving (#2339 )	2020-06-23 20:16:34 -04:00
Lezwon Castelino	9446390779	fix TPU parsing and TPU tests (#2094 ) * added tpu params test * added tests * removed xla imports * added test cases for TPU * fix pep 8 issues * refactorings and comments * add message to MisconfigurationException Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * test if device is set correctly * added TPU device check removed mark.spawn * removed device selection * remove xla_device call * readded spawn due to test failures * add TODO for tpu check * Apply suggestions from code review * Apply suggestions from code review * flake8 * added tpu args to cli tests * added support for tpu_core selection via cli * fixed flake formatting * replaced default_save_path with default_root_dir * added check for data type for tpu_cores * fixed flake indent * protected * protected * added tpu params test * added tests * removed xla imports * test if device is set correctly * added support for tpu_core selection via cli * replaced default_save_path with default_root_dir * added check for data type for tpu_cores * chlog * fixed tpu cores error * rebased with latest changes * flake fix * Update pytorch_lightning/trainer/distrib_parts.py added suggesstion Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: Jirka <jirka@pytorchlightning.ai>	2020-06-23 12:06:57 -04:00
Adrian Wälchli	e085e93dd3	Add missing test for "multiple dataloader + percent_check fix" (#2226 ) * Init fix num_batches * Fix num_batches in case of multiple dataloaders * Apply suggestions from code review * Changes based on suggestions * Flake8 * Add test to check num_batches * generalize dataloader percent check test * fix formatting * remove hparams * tests * CHANGELOG * Update CHANGELOG.md * max_batches can be int * conflict and rebase * add back the test fix fix message 0.0 works Revert "fix message" This reverts commit 839cacf8b8610f4e697e654ef6f3d2501bf23984. * update changelog * Update CHANGELOG.md * Fix num batches in case of multiple dataloaders and percent_check (#1920) * git conflict Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * missing union * doc update suggestion by @rohitgr7 * extend test * changelog * docs add note about multiple loaders * update changelog * remove unused variable Co-authored-by: rohitgr7 <rohitgr1998@gmail.com> Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>	2020-06-23 11:21:24 -04:00
Adrian Wälchli	bdee1cd106	update docs for "overfit_batches" (#2324 ) * update docs * Apply suggestions from code review Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>	2020-06-23 11:19:38 -04:00
William Falcon	0f073819d3	refactored training_batch + tests to verify correctness (#2328 ) * refactored training_bath * refactored training_bath * refactored training_bath * refactored training_bath * refactored training_bath * refactored training_bath * refactored training_bath * refactored training_bath * refactored training_bath * refactored training_bath * refactored training_bath	2020-06-23 11:17:10 -04:00
Jirka Borovec	4b90b79080	check omegaconf gpus (#2273 ) * check omegaconf gpus * test * test * Apply suggestions from code review Co-authored-by: Omry Yadan <omry@fb.com> Co-authored-by: Omry Yadan <omry@fb.com>	2020-06-19 23:42:11 -04:00
Jirka Borovec	f278ac42c8	Revert/Fix: epoch indexing from 1, to be from 0 (#2289 ) * Revert "deprecated: epoch indexing from 1 (#2206)" This reverts commit `f94b919b` * chlog * grad index * Apply suggestions from code review * tests * fix * test	2020-06-19 23:39:53 -04:00
thschaaf	554fb4754c	Bugfix/_has_len (#2293 ) * deal with NotImplementedError raised by torchtext * deal with NotImplementedError raised by torchtext * Added tests for dataloader which raise NotImplementedError in __len__() * Fixed some typos Co-authored-by: Thomas Schaaf <tschaaf@cs.cmu.edu>	2020-06-19 23:38:15 -04:00
Jirka Borovec	e0b7fed92e	deprecated Trainer proc_rank (#2269 ) * deprecated * test	2020-06-19 15:46:27 -04:00
William Falcon	8d51279703	[refactor results 1] - refactor backward (#2276 ) * move backward * refactor backward to remove 16 bit from user override * refactor backward to remove 16 bit from user override * Update pytorch_lightning/core/hooks.py Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>	2020-06-19 15:44:44 -04:00
Jirka Borovec	54acc79f31	continue 0.8.x (#2264 ) * cleaning * docs * docs * types * mixins * mixins * docs * typo	2020-06-19 11:00:46 -04:00
William Falcon	e8f58b5ed6	Merge branch 'master' of https://github.com/PyTorchLightning/pytorch-lightning	2020-06-19 01:11:29 -04:00
William Falcon	3c8c2e3deb	fix missing arg	2020-06-19 01:11:22 -04:00
William Falcon	a6f94a6f43	remove tpu barrier (#2260 )	2020-06-19 00:57:00 -04:00
William Falcon	57d5f6e74a	Barrier (#2257 ) * remove barriers * remove barriers * remove barriers * remove barriers * remove barriers * remove barriers * remove barriers * remove barriers * remove barriers * remove barriers	2020-06-19 00:42:20 -04:00
William Falcon	b5a2f1ec44	fix setup and on fit calls (#2252 )	2020-06-18 21:45:09 -04:00
William Falcon	b7fc092bf4	made fx public (#2247 ) * made fx public * made fx public * made fx public	2020-06-18 20:20:29 -04:00
William Falcon	68a1e52292	added barrier (#2245 ) * added barrier * blank line * added barrier * added barrier * made fx public Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>	2020-06-18 20:15:02 -04:00
Jirka Borovec	a2d3ee80ad	final cleanup for v0.8.0 (#2181 ) * final clean for v0.8.0 * chlog * chlog * date * rename stage * date * missing	2020-06-18 07:21:44 -04:00
William Falcon	476911d60c	Pid port + duplicate rank_zero logging (#2231 ) * init the port using a seed that matches process id for ddp * init the port using a seed that matches process id for ddp * init the port using a seed that matches process id for ddp * init the port using a seed that matches process id for ddp * init the port using a seed that matches process id for ddp * init the port using a seed that matches process id for ddp * init the port using a seed that matches process id for ddp Co-authored-by: Zhaofeng Wu <zfw7@cs.washington.edu>	2020-06-18 00:19:06 -04:00
William Falcon	15cf6a86c2	Tpu logging (#2230 ) * add tpu view * add tpu view * add tpu view * add tpu view * add tpu view	2020-06-17 22:45:09 -04:00
William Falcon	34816e9ec4	adds setup+teardown hook (#2229 ) * allow regression metrics to import * allow regression metrics to import * allow regression metrics to import * allow regression metrics to import * allow regression metrics to import * allow regression metrics to import * allow regression metrics to import * allow regression metrics to import * allow regression metrics to import	2020-06-17 19:49:58 -04:00
William Falcon	2411c3be70	replace train_percent_check with limit_train_batches (#2220 ) * drop train_percent_check * drop train_percent_check * drop train_percent_check * drop train_percent_check * drop train_percent_check * drop train_percent_check * drop train_percent_check * drop train_percent_check * drop train_percent_check * drop train_percent_check * drop train_percent_check * drop train_percent_check * drop train_percent_check * drop train_percent_check * drop train_percent_check * drop train_percent_check * drop train_percent_check * chlog * deprecated * deprecated * deprecated * tests * tests * Apply suggestions from code review * tests * hydra support * tests * hydra support * hydra support * hydra support * tests * typo * typo * Update test_dataloaders.py * docs * docs * docs * docs Co-authored-by: Jirka <jirka@pytorchlightning.ai> Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>	2020-06-17 13:42:28 -04:00
Nilansh Rajput	25c7465591	Enable gradients at train start (#2200 ) * Enable gradients at train start * Update training_loop.py Co-authored-by: William Falcon <waf2107@columbia.edu>	2020-06-17 10:52:58 -04:00
William Falcon	2330d32531	hydra ddp support + generalized ddp gpu flags (#2212 ) * hydra ddp support * hydra ddp support * hydra ddp support * hydra support * hydra support * hydra support	2020-06-17 10:33:25 -04:00
William Falcon	04c794ca72	[WIP] Rename overfit_pct to overfit_batches (and fix) and val_percent_check and test_percent_check (and fix) (#2213 ) * fixed percent check for val/test * fixed percent check for val/test * fixed percent check for val/test * fixed percent check for val/test * overfit_pct now uses train loaders for val and test and does not shuffle * overfit_pct now uses train loaders for val and test and does not shuffle * overfit_pct now uses train loaders for val and test and does not shuffle * overfit_pct now uses train loaders for val and test and does not shuffle * overfit_pct now uses train loaders for val and test and does not shuffle * overfit_pct now uses train loaders for val and test and does not shuffle * overfit_pct now uses train loaders for val and test and does not shuffle * overfit_pct now uses train loaders for val and test and does not shuffle * overfit_pct now uses train loaders for val and test and does not shuffle * overfit_pct now uses train loaders for val and test and does not shuffle * overfit_pct now uses train loaders for val and test and does not shuffle * overfit_pct now uses train loaders for val and test and does not shuffle * overfit_pct now uses train loaders for val and test and does not shuffle * overfit_pct now uses train loaders for val and test and does not shuffle * overfit_pct now uses train loaders for val and test and does not shuffle * overfit_pct now uses train loaders for val and test and does not shuffle * overfit_pct now uses train loaders for val and test and does not shuffle * overfit_pct now uses train loaders for val and test and does not shuffle * overfit_pct now uses train loaders for val and test and does not shuffle * overfit_pct now uses train loaders for val and test and does not shuffle * overfit_pct now uses train loaders for val and test and does not shuffle * overfit_pct now uses train loaders for val and test and does not shuffle * overfit_pct now uses train loaders for val and test and does not shuffle * add on fit_start on fit_end hooks * add on fit_start on fit_end hooks * add on fit_start on fit_end hooks Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com> Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>	2020-06-17 08:03:28 -04:00
William Falcon	97dfd3a80a	Revert "Misleading exception raised during batch scaling (#1973 )" (#2219 ) This reverts commit `f8103f9c7d`.	2020-06-17 08:01:53 -04:00
Tejasvi S Tomar	f8103f9c7d	Misleading exception raised during batch scaling (#1973 ) * Misleading exception raised during batch scaling Use batch_size from `model.hparams.batch_size` instead of `model.batch_size` * Improvements considering #1896 * Apply suggestions from code review Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com> Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>	2020-06-17 08:01:04 -04:00
William Falcon	e1f238a097	add on fit_start on fit_end hooks (#2217 ) * add on fit_start on fit_end hooks * add on fit_start on fit_end hooks * add on fit_start on fit_end hooks	2020-06-17 07:37:16 -04:00
Jirka Borovec	e289e45120	test: save hparams to yaml (#2198 ) * save hparams to yaml * import * resolves * req * Update requirements/base.txt Co-authored-by: Omry Yadan <omry@fb.com> Co-authored-by: Omry Yadan <omry@fb.com>	2020-06-16 06:34:55 -04:00
Jirka Borovec	f94b919b96	deprecated: epoch indexing from 1 (#2206 ) * epoch indexing from 1 * chlog * fix tests * fix tests * self.min_epochs	2020-06-16 06:33:41 -04:00
Adrian Wälchli	7dc58bd286	Refactor model summary + generalize example input array (#1773 ) * squash variant a variant b add test revert rename add changelog docs move changelog entry to top use hooks wip wipp layer summary clean up, refactor type hints rename remove obsolete code rename unused imports simplify formatting of table and increase readability doctest superclass object update examples print unknown sizes more docs and doctest testing unknown layers add rnn test remove main restore train mode test device wip device constant simplify model forward transfer return summary object in method extend tests fix summary for empty module extend tests refactor and added hook variant a variant b add test revert rename add changelog docs move changelog entry to top remove hardcoded string simplify test unknown shapes and all others comments for tests fix hparams attribute * update default * unused import * clean up * replace hardcoded strings * fix doctest * fix top/full * black * fix rnn test * fix rnn * update debugging docs update docs typo update docs update docs * add changelog * extract constant * setter and getter * move parity models to test folder * parameterize mode	2020-06-15 17:05:58 -04:00
Adrian Wälchli	22d9464e56	HenryJia: auto-move data decorator (#1905 ) * First attempt at auto-moving data for inference * Correct my copypaste errors * Correct for if device is CPU * Get rid of the WIP code I accidentally added * Add tests * Make tests more foolproof * Make sure we stick with pep8 formatting * Clarify docs a little * Apply suggestions from code review * Get everything working again hopefully * refactor and added hook variant a variant b add test revert rename add changelog docs * move changelog entry to top * Move data transfer to utilities * Add back in warnings for autotransfer * Get rid of the test code I ended up accidentally commiting again * Add docs any changelog * Correct PR number in Changelog * Correct changelog * Update data.py * Update test_cpu.py * make a decorator * type hint * changelog * changelog * remove old function * import * test for decorator * fix test * remove old test * doctest * apply decorator directly * convert doctest to code block * prevent side effects in tests * fix merge * update forward docs * update docs * added docs in section "deployment / prediction" * update changelog Co-authored-by: Hengjian Jia <henryjia18@gmail.com> Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: William Falcon <waf2107@columbia.edu>	2020-06-15 17:04:32 -04:00
Peter Yu	37e7582486	Add ckpt_path option to LightningModule.test() (#2190 ) * Add ckpt_path option to LightningModule.test() If ckpt_path is "best" (default), it loads the best weights saved by ModelCheckpoint for the test loop. If ckpt_path is a path to a checkpoint file, it loads the weights from the file for the test loop. If ckpt_path is None, it uses the weights from the end of training for the test loop. If model parameter is set, ckpt_path is ignored. * Update test_set.rst Co-authored-by: William Falcon <waf2107@columbia.edu>	2020-06-15 08:02:37 -04:00
William Falcon	48a76a785d	Performance docs (#2191 ) * add workers fix * add workers fix	2020-06-15 08:02:19 -04:00
Simon-Martin Schröder	fd1693e289	Handle KeyboardInterrupt during training (#2134 ) * Handle KeyboardInterrupt during training Fixes #2079. * chlog * Fix whitespace * Update callback_hook.py * Update base.py * Update training_loop.py * Update test_trainer.py * Update CHANGELOG.md Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * Update CHANGELOG.md * on_keyboard_interrupt Co-authored-by: Jirka <jirka@pytorchlightning.ai> Co-authored-by: William Falcon <waf2107@columbia.edu> Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>	2020-06-15 12:35:26 +02:00
William Falcon	bd3a1f7dd4	Fast defaults (#2185 ) * log row might be a bottleneck depending on network. 50 unblocks this and is small enough for small datasets * log row might be a bottleneck depending on network. 50 unblocks this and is small enough for small datasets	2020-06-14 20:17:49 -04:00
Jirka Borovec	c0903b800d	past checkpoints (#2160 ) * past checkpoints * omegaConf save * enforce type * resolve=True Co-authored-by: Omry Yadan <omry@fb.com> * test omegaconf * tests * test past Co-authored-by: Omry Yadan <omry@fb.com>	2020-06-14 11:36:45 -04:00
William Falcon	5fd01b0e68	Finish Ananthsub patch 1 (enable prepare_data from correct processes). clarify local vs global rank (#2166 ) * [trainer] Call prepare_data once per node in DDP/DDP2 training * refactored DDP routes * renamed proc_rank to local_rank * renamed proc_rank to local_rank * renamed proc_rank to local_rank * renamed proc_rank to local_rank * renamed proc_rank to local_rank * renamed proc_rank to local_rank * renamed proc_rank to local_rank * renamed proc_rank to local_rank * renamed proc_rank to local_rank * renamed proc_rank to local_rank * renamed proc_rank to local_rank * renamed proc_rank to local_rank * spawn message * spawn message * spawn message * fixes * fixes * fixes * fixes * fixes * Update trainer.py Co-authored-by: ananthsub <ananth.subramaniam@gmail.com>	2020-06-13 12:00:14 -04:00
William Falcon	9df2b2090d	sets default ddp mode to spawn (#2168 ) * set ddp_spawn as default * spawn message * spawn message * spawn message * spawn message * spawn message * spawn message * spawn message * spawn message	2020-06-13 03:47:45 -04:00
Jirka Borovec	2674976f2c	remove deprecated API for v0.8 (#2073 ) * remove deprecated API * chlog * times * missed * formatting check * missing * missing * miss * fix docs build error * fix pep whitespace error * docs * wip * amp_level * amp_level Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>	2020-06-12 14:37:52 -04:00
Udit Arora	08573d0f7e	Fix some pyright member access errors in training module (#2121 ) * Fix pyright member access errors in training module * Fix Trainer instantiation error due to inheritence order * Add GH workflow for pyright * Fix more pyright errors in trainer module * Add pyrightconfig and setup python environment in type-check workflow * Exclude pyrightconfig.json * suggestions Co-authored-by: Jirka <jirka@pytorchlightning.ai>	2020-06-12 17:23:18 +02:00
Peter Yu	06cd849538	Allow loading checkpoints from urls (#1667 ) * allow loading checkpoints from urls * tmpdir_server fixture * test cases for loading checkpoints from url * dir => root_dir * default map_location to None * test case for resume_from_checkpoint * changelog * doc update * monkeypatch TORCH_HOME to avoid caching * Use a threading server with random ports so that it is easier to clean up * test fixes * pep8 fix * ThreadingHTTPServer support in 3.6 * pep8 fix * fix changelog * separate tests for urls * typo Co-authored-by: Peter Yu <2057325+yukw777@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>	2020-06-11 17:12:48 -04:00
Pattarawat Chormai	3be557dc5b	document: fix callback signature (#2113 )	2020-06-09 07:10:44 -04:00
William Falcon	479ab49d03	temporarily fixes early stopping bug (#2119 ) * fixes early stopping bug * fixes early stopping bug * fixes early stopping bug * fixes early stopping bug * fixe docs * fixe docs * fixe docs * fixe docs * fixe docs * fixe docs * fixe docs * fixe docs * fixe docs * fixe docs * fixe docs * fixe docs * fixe docs * fixe docs * fixe docs * fixe docs * fixe docs * fixe docs * fixe docs * fixe docs * added test	2020-06-08 19:28:26 -04:00
William Falcon	3260e59b27	Adds back the slow spawn ddp implementation that people want (#2115 ) * training batch clean up * training batch clean up * training batch clean up * training batch clean up * training batch clean up * training batch clean up * training batch clean up * training batch clean up * training batch clean up * training batch clean up * training batch clean up * training batch clean up * training batch clean up * training batch clean up * training batch clean up * training batch clean up * training batch clean up * training batch clean up * training batch clean up * training batch clean up * adding spawn * adding spawn * adding spawn * adding spawn * adding spawn * adding spawn * adding spawn * adding spawn	2020-06-08 17:55:25 -04:00
William Falcon	0bd7780adc	Fixes CPU and hanging GPU crash (#2118 ) * training batch clean up * training batch clean up * training batch clean up	2020-06-08 16:30:20 -04:00
Jirka Borovec	d2967d9305	update hparams, allow OmegaConf (#2047 ) * DictConf * inits * Apply suggestions from code review Co-authored-by: Omry Yadan <omry@fb.com> * wip * wip * wip * wip * wip * wip * wip * wip * wip * wip * atrib * wip * wip * wip * added hparams test * wip * wip * wip * wip * wip * wip * wip * wip * wip * wip * wip * wip * wip * wip * wip * wip * wip * wip * wip * wip * wip * Update test_hparams.py * added hparams test * added hparams test * pep8 * pep8 * pep8 * docs * wip * wip * clean * review @omry * Update docs/source/hyperparameters.rst Co-authored-by: Omry Yadan <omry@fb.com> Co-authored-by: Omry Yadan <omry@fb.com> Co-authored-by: William Falcon <waf2107@columbia.edu>	2020-06-08 07:19:34 -04:00
Wah Loon Keng	6e993c608b	correct trainer.fit production example (#2068 ) trainer.fit uses the parameter `val_dataloaders` but in the documentation it is `val_dataloader`, which is invalid.	2020-06-04 11:24:12 -04:00
Adrian Wälchli	8211256c46	data transfer model hook (+ refactor) (#1756 ) * refactor and added hook variant a variant b add test revert rename add changelog docs * resolve merge duplication * overridden typo * fix test * tpu id * raise if TPU not available * re-use apply_to_collection function for parsing collections * comment * make utility function available to user * documentation * move changelog entry to top * fix tpu transfer call * fix call * remove hardcoded string * improve test * call model hook by default * Apply suggestions from code review * rename utility function Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>	2020-06-02 21:45:19 -04:00
Devashish Shankar	ade3f36b7a	Raise an error when lightning replaces an existing sampler (#2020 ) * Raise an error when lightning replaces an existing sampler Currently, Trainer replaces the existing sampler with DistributedSampler if running distributing training and `replace_sampler_ddp=True` (default behaviour). If a user has configured an existing sampler, this would lead to widely different results if running a distributed vs non-distributed training. This PR fixes this by raising an Error if user has configured a sampler and uses `replace_sampler_ddp=True`. The recommended behavior from now on is to either remove the sampler or set `replace_sampler_ddp=False` * Fix tests * Simpler fix * Fix tests * Make inner method protected * Apply suggestions from code review Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>	2020-06-02 18:52:04 -04:00
Ivan Nazarov	e85a646a41	Mistake in parameters' grad norm tracking (#2012 ) * fix grad norm formula * grad-norm tracker test * fixed seed and explicit rtol in grad norm tracking test * a docstring for grad-norms and forced cast to float of norm_type * support for inf-norm * renamed the grad norm test * docs * fixed language in docstring * Apply suggestions from code review Co-authored-by: Jirka <jirka@pytorchlightning.ai> Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>	2020-06-02 18:51:09 -04:00
Adrian Wälchli	a699003e67	Update/merge multi-gpu docs (#2021 ) * merge multi-gpu docs * extend slurm docs * update links to elastic * format docs and type hints in distrib parts * reference multi-gpu/slurm in trainer args docs * fix doctest * typo * doctest * Apply suggestions from code review Co-authored-by: Lucas Vazquez <lucasgouvaz@gmail.com> * wall time * Update docs/source/slurm.rst Co-authored-by: Lucas Vazquez <lucasgouvaz@gmail.com> * fix title * update docs for weights summary * update changelog Co-authored-by: Lucas Vazquez <lucasgouvaz@gmail.com>	2020-06-02 18:50:08 -04:00
Lezwon Castelino	943c4b20af	slow tpu train (#2033 ) * use parallel loader * Revert "use parallel loader" This reverts commit ed6e7583 * select tpu id for pl * condition if tpu_id is None * added info to changelog * Revert "condition if tpu_id is None" This reverts commit `1fb6e586` * Apply suggestions from code review Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>	2020-06-02 18:48:05 -04:00
William Falcon	82a20296e3	Replaces ddp .spawn with subprocess (#2029 ) * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * replace ddp spawn with subprocess * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix * hot fix	2020-06-01 11:00:32 -04:00
Fabio Natanael Kepler	8b9b923ca8	Keep track of the best model's path saved by ModelCheckpoint (#1799 ) * Add an additional attribute to ModelCheckpoint to keep track of the best model's path Currently, only the best metric value is directly tracked. This new attribute will help in uses cases where the trained model needs to be used or tracked right after training. * Add small description and usage example to docs * Fix PEP8 issues * Fix doctest example * Fix expected output in doctest * Apply suggestions from code review * Show example as code block instead of doctest * Apply suggestions from code review * Update CHANGELOG.md * Rename `ModelCheckpoint.best` to `ModelCheckpoint.best_model_score` Also rename `ModelCheckpoint.best_model` (added in this PR) to `ModelCheckpoint.best_model_path`, for consistency, and `kth_best_model` to `kth_best_model_path`. * Update pytorch_lightning/trainer/training_io.py Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> * Add warning when loading checkpoint from an old version Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>	2020-05-31 08:47:13 -04:00
Yassine Alouini	d8dc0a7228	Few typo correction (#2011 )	2020-05-31 00:39:56 -04:00
Justus Schock	ceecf1cea9	Graceful shutdown on python interpreter exit (#1631 ) * Fraceful shutdown on python interpreter exit * Update CHANGELOG.md * Update training_loop.py * Update training_loop.py * Update CHANGELOG.md Co-Authored-By: Jirka Borovec <Borda@users.noreply.github.com> * pep8, move to constant * Update training_loop.py * Update training_loop.py * Update training_loop.py * pep8, move to constant * pep8 * timeout Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: Jirka <jirka.borovec@seznam.cz>	2020-05-29 16:20:04 +02:00
Loïc Grobol	c3cf33d1de	Fix root node resolution (#1954 )	2020-05-27 22:50:37 -04:00
Mateusz Pieniak	3af4994d5a	Removing unecessary early stopping calls (#1863 ) * Removing unecessary early stopping calls * Update CHANGELOG.md Co-authored-by: Mateusz Pieniak <mateusz.pieniak@evidenceprime.com> Co-authored-by: William Falcon <waf2107@columbia.edu>	2020-05-26 19:06:06 -04:00
Jirka Borovec	5e8c5abf63	fix default arg (#1927 ) * fix default * formatting errors * update * flake8	2020-05-26 19:04:42 -04:00
Jirka Borovec	ca815698f5	Revert "Remove unused param tpu_core_idx (#1948 )" (#1963 ) This reverts commit `d0ec11b9d6`.	2020-05-26 19:02:51 -04:00
William Falcon	460ab5485e	Gen ddp support (#1961 ) * updated docs * added mixed * added mixed	2020-05-26 19:02:30 -04:00
Rohit Gupta	d0ec11b9d6	Remove unused param tpu_core_idx (#1948 )	2020-05-25 16:04:53 -04:00
Adrian Wälchli	34237cfcaf	handle unknown args passed to Trainer.from_argparse_args (#1932 ) * filter valid args * error on unknown manual args * added test * changelog * update docs and doctest * simplify * doctest * doctest * doctest * better test with mock check for init call * fstring * extend test * skip test on 3.6 not working Co-authored-by: William Falcon <waf2107@columbia.edu>	2020-05-25 16:01:29 -04:00
William Falcon	f46a7bae77	updated docs (#1941 )	2020-05-25 15:59:32 -04:00
Federico Baldassarre	65b4352930	early stopping checks on_validation_end (#1458 ) * Fixes PyTorchLightning/pytorch-lightning#490 `EarlyStopping` should check the metric of interest `on_validation_end` rather than `on_epoch_end`. In a normal scenario, this does not cause a problem, but in combination with `check_val_every_n_epoch>1` in the `Trainer` it results in a warning or in a `RuntimeError` depending on `strict`. * Highlighted that ES callback runs on val epochs in docstring * Updated EarlyStopping in rst doc * Update early_stopping.py * Update early_stopping.rst * Update early_stopping.rst * Update early_stopping.rst * Update early_stopping.rst * Apply suggestions from code review Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * Update docs/source/early_stopping.rst * fix doctest indentation warning * Train loop calls early_stop.on_validation_end * chlog Co-authored-by: William Falcon <waf2107@columbia.edu> Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> Co-authored-by: Jirka <jirka@pytorchlightning.ai>	2020-05-25 17:33:00 +00:00
Adrian Wälchli	8ca8336ce5	protect progress bar callback (#1855 ) * wip protected progress bar settings * remove callback attr from LRfinder * whitespace * changelog	2020-05-25 07:49:23 -04:00
Nicki Skafte	a34eb9e169	Fix logger bug and prepare data bug (#1933 ) * tests, fix logger bug and prepare data bug * add CHANGELOG.md Co-authored-by: Nicki Skafte <nugginea@gmail.com>	2020-05-25 07:43:56 -04:00
William Falcon	caa9c6760b	replace Hparams by init args (#1896 ) * remove the need for hparams * remove the need for hparams * remove the need for hparams * remove the need for hparams * replace self.hparams * replace self.hparams * replace self.hparams * replace self.hparams * replace self.hparams * replace self.hparams * replace self.hparams * replace self.hparams * replace self.hparams * replace self.hparams * replace self.hparams * replace self.hparams * replace self.hparams * replace self.hparams * replace self.hparams * replace self.hparams * replace self.hparams * replace self.hparams * replace self.hparams * replace self.hparams * replace self.hparams * replace self.hparams * replace self.hparams * replace self.hparams * replace self.hparams * replace self.hparams * replace self.hparams * replace self.hparams * replace self.hparams * replace self.hparams * replace self.hparams * replace self.hparams * fixed * fixed * fixed * fixed * fixed * fixed * fixed * fixed * fixed * fixed * fixed * fixed * fixed * fixed * finished moco * basic * testing * todo * recurse * hparams * persist * hparams * chlog * tests * tests * tests * tests * tests * tests * review * saving * tests * tests * tests * docs * finished moco * hparams * review * Apply suggestions from code review Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * hparams * overwrite * transform * transform * transform * transform * cleaning * cleaning * tests * examples * examples * examples * Apply suggestions from code review Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * chp key * tests * Apply suggestions from code review * class * updated docs * updated docs * updated docs * updated docs * save * wip * fix * flake8 Co-authored-by: Jirka <jirka@pytorchlightning.ai> Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>	2020-05-24 18:59:08 -04:00
Nicki Skafte	8f6b7a2b4f	Fix user warning produced by apex + scheduler combination (#1873 ) * fix user error produced by apex + scheduler combination * add changelog * added reinit to every configure_apex call * fix styling Co-authored-by: Nicki Skafte <nugginea@gmail.com>	2020-05-22 07:19:37 -04:00
Maxim Grechkin	98f7842970	Allow dataloaders without sampler field present (#1907 ) * Allow dataloaders without sampler field present Sometimes we have a custom dataloader that doesn't have a sampler, better to check that the field is there before reading it. * chlog Co-authored-by: Jirka <jirka@pytorchlightning.ai>	2020-05-20 20:57:12 +00:00
Kevin Trebing	3459a54667	Changed order of `update_learning_rates()` and `run_training_teardown()`. (#1891 )	2020-05-19 13:16:26 -04:00
Rohit Gupta	ac76dfcf62	Remove NaNs from loss in LRFinder (#1862 ) * Remove NaNs from loss in LRFinder * np.isfinite * chlog * add test * chlog Co-authored-by: Jirka <jirka@pytorchlightning.ai>	2020-05-19 08:39:19 +02:00
Ashraful Islam	981169cacc	add warning for shuffling in test/val (#1865 )	2020-05-18 09:53:02 -04:00
Lezwon Castelino	7c7e50ca47	Allow user to select individual TPU core to train on (#1729 ) * added tpu_id added tpu_id to mixins * train on individual tpu * parallel loader if tpu_id is None * removed progress_bar_refresh_rate * chlog * replaced num_tpu_cores with tpu_cores * set tpu_id to None if int * changed num_tpu_cores to tpu_cores in docs * updated docs * updated __init__.py removed self.tpu_id for ParallelLoader * Update pytorch_lightning/trainer/__init__.py * check if tpu_cores is a list Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> * xla device conditional * num_tpu_cores deprecation * removed duplicate warning * fixed pep8 error * Revert "removed duplicate warning" This reverts commit `8adb0a9b` * deprecated api update * fixed recursion error * fixed tests * fixed flake errors * removed current_tpu_index * Update CHANGELOG.md * Update trainer.py Co-authored-by: Jirka <jirka.borovec@seznam.cz> Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: William Falcon <waf2107@columbia.edu>	2020-05-17 16:30:54 -04:00
Fabio Natanael Kepler	8c4c7b105e	Fix `save_weights_only` flag in ModelCheckpoint (#1780 ) * Add flag to `dump_checkpoint` for only including weights `ModelCheckpoint` then passes `self.save_weights_only` to the save function. * Fix tests and add changelog entry * Add check and descriptive message when training state is restored from a weights only checkpoint Also add a test for making sure `ModelCheckpoint.save_weights_only` works as expected. * Fix weights-only test to properly match expected exception * Apply suggestions from code review Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>	2020-05-17 09:24:17 -04:00
Adrian Wälchli	769a459d27	remove extra kwargs from Trainer init (#1820 ) * remove kwargs * remove useless test * rename unknown trainer flag * trainer inheritance and test * blank line * test for unknown arg * changelog	2020-05-17 09:14:54 -04:00
Rohit Gupta	56d521a317	Fix test configuration check and testing (#1804 ) * Fix test configuration check and testing * Fix test configuration check and testing * Remove check_testing_configuration during test * Fix docstring * fix function name * remove conflicts	2020-05-17 08:22:44 -04:00
Adrian Wälchli	4cdebf9a64	remove obsolete self._device in Trainer (#1849 ) * remove unused device attribute * dtype * move on_gpu to model	2020-05-17 08:20:51 -04:00
William Falcon	b84b02400a	enable any dict and namespace in hparams (#1847 )	2020-05-15 15:08:16 -04:00
Justus Schock	c05077fae3	Enable non-blocking for gpu device transfer (#1843 ) * Update distrib_parts.py * Update CHANGELOG.md	2020-05-14 17:56:40 -04:00
Jirka Borovec	bee0392c37	extend arg parser (#1842 ) * extend arg parser * flake8 * tests * example * fix test	2020-05-14 17:56:11 -04:00
Nicki Skafte	88f816ed06	dummy logger (#1836 ) Co-authored-by: Nicki Skafte <nugginea@gmail.com>	2020-05-14 10:34:11 -04:00
William Falcon	53d9316a56	fixes ddp bugs (#1819 ) * debug * debug * debug * debug * debug * debug * debug * debug * debug * debug * debug * debug * debug * debug * debug * debug * debug * debug * debug * debug * debug * debug * debug * debug * debug * debug * debug * debug * debug * debug * debug * debug * debug * debug * debug * debug * debug	2020-05-13 19:17:04 -04:00
William Falcon	648d516668	✨ Use store_true for bool args (#1822 ) * ✨ Use store_true for bool args * debug Co-authored-by: Nate Raw <nxr9266@g.rit.edu>	2020-05-13 19:12:06 -04:00
Ashwin Bharambe	0e71705a0a	[checkpoint logic] Fix bug which doesn't account for NoneType for `model.hparams` (#1817 ) The intention of the code is to output a warning message when `hparams` is null or not set. Instead the code now fatals when `model.hparams = None`. Prevent that.	2020-05-13 17:14:11 -04:00
Nicki Skafte	663b90035c	Bugfix: accumulation and suggestion for learning rate finder (#1801 ) * fix suggestion being too naive * fix accumulation error and added new tests * fix styling * update CHANGELOG.md * update based on review * fix tests * Apply suggestions from code review * Apply suggestions from code review * Apply suggestions from code review * Apply suggestions from code review Co-authored-by: Nicki Skafte <nugginea@gmail.com> Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>	2020-05-13 14:40:44 -04:00
Ashwin Bharambe	aefc5314bc	[ddp] Support multi-node distributed execution under torchelastic (#1811 ) The changes are quite local and limited in nature -- viz., checking for some indicator environment variables. We check for (SLURM_LOCALID, NODE_RANK, GROUP_RANK) in order. If multiple are found set, a warning is logged. This patch also fixes a minor bug with comparing the `WORLD_SIZE` environment variable. This can be a string type.	2020-05-13 14:06:59 -04:00
So Uchida	22d7d03118	Replace meta_tags.csv with hparams.yaml (#1271 ) * Add support for hierarchical dict * Support nested Namespace * Add docstring * Migrate hparam flattening to each logger * Modify URLs in CHANGELOG * typo * Simplify the conditional branch about Namespace Co-Authored-By: Jirka Borovec <Borda@users.noreply.github.com> * Update CHANGELOG.md Co-Authored-By: Jirka Borovec <Borda@users.noreply.github.com> * added examples section to docstring * renamed _dict -> input_dict * mata_tags.csv -> hparams.yaml * code style fixes * add pyyaml * remove unused import * create the member NAME_HPARAMS_FILE * improve tests * Update tensorboard.py * pass the local test w/o relavents of Horovod * formatting * update dependencies * fix dependencies * Apply suggestions from code review * add savings * warn * docstrings * tests * Apply suggestions from code review * saving * Apply suggestions from code review * use default * remove logging * typo fixes * update docs * update CHANGELOG * clean imports * add blank lines * Update pytorch_lightning/core/lightning.py Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * Update pytorch_lightning/core/lightning.py Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * back to namespace * add docs * test fix * update dependencies * add space Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>	2020-05-13 15:05:15 +02:00
Jirka Borovec	10ce1c0256	device property (#1791 ) * device property * add/copy properties * inherit * rename * Apply suggestions from code review Co-authored-by: Justus Schock <12886177+justusschock@users.noreply.github.com> * dtype * prop * pt api Co-authored-by: Justus Schock <12886177+justusschock@users.noreply.github.com>	2020-05-12 23:18:39 -04:00
Adrian Wälchli	8978794730	add missing flag (#1805 )	2020-05-12 17:06:38 -04:00
Oliver Neumann	9059d21042	Missing profiler attribute in add_argparse_args() ArgumentParser (#1794 ) * Fixed typing annotation by adding boolean type. After that Profiler flag will be added to argparse. * Updated CHANGELOG.md * Updated git_init_arguments_and_types() to pass doctests. * Added doctest example to add_argparse_parser()	2020-05-12 08:53:26 -04:00
kumuji	619f984c36	Option to provide seed to random generators to ensure reproducibility (#1572 ) * Option to provide seed to random generators to ensure reproducibility I added small function in utilities which imports torch, numpy, python random and sets seed for all of the libraries to ensure reproducibility of results. * Apply recommendations from core contributors on seeding 1. Moved the seeding code to another file 2. Make deterministic as a parameter for trainer class 3. Add assertions for seeding numpy 4. Added warnings 5. torch.manual_seed should be enough for seeding torch * Revert "Apply recommendations from core contributors on seeding" This reverts commit a213c8e6882eec8a9e7408b9418926d2db7c5461. * Revert "Revert "Apply recommendations from core contributors on seeding"" This reverts commit 59b2da53c62878de7aab0aa3feb3115e105eea06. * Change in test, for correct seeding * Allow seed equal to 0 * Allow seed to be uint32.max * Added deterministic to benchmarks * Cuda manual seed as in benchmark seeding * Seeding should be done before model initialization * cuda manual_seed is not necessary * Fixing seed test_cpu_lbfgs On some seeds seems like lbfgs doesn't converge. So I fixed the seed during testing. * rebasing issue with old reproducibility.py * Improved documentation and ability to seed before initializing Train class * Change in docs * Removed seed from trainer, update for documentation * Typo in the docs * Added seed_everything to _all_ * Fixing old changes * Model initialization should be earlier then Trainer * Update pytorch_lightning/trainer/__init__.py From Example to testcode Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> * Fixing according to the contributors suggestions * Moving horovod deterministic to Trainer class * deterministic flag affects horovod docs update * Improved static typing * Added deterministic to test runners of horovod It is failing on some versions, not very predictable * static seeds for horovod tests * Change for reset_seed function in tests * Seeding horovod using reset_seed from tutils * Update pytorch_lightning/trainer/__init__.py * chlog * Update trainer.py * change "testcode" to "Example" in trainer init documentation * Update pytorch_lightning/trainer/seed.py, first line in comment Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: Jirka <jirka.borovec@seznam.cz> Co-authored-by: William Falcon <waf2107@columbia.edu>	2020-05-12 07:53:20 -04:00
William Falcon	10b16dbfab	made ddp the default if no backend specified with multiple GPUs (#1789 ) * made ddp the default if no backend specified with multiple GPUs * fix * spawn Co-authored-by: Jirka <jirka.borovec@seznam.cz>	2020-05-12 06:54:23 -04:00
Travis Addair	acab068c74	Join Horovod workers at the end of trainer.fit() to prevent race conditions following training (#1786 ) * Join Horovod workers at the end of trainer.fit() to prevent race conditions following training * flake8 * flake8 Co-authored-by: Jirka <jirka.borovec@seznam.cz>	2020-05-12 09:15:25 +00:00
William Falcon	7b60d49432	fixed native amp + ddp (#1788 ) * fixed native amp + ddp * fixed native amp + ddp	2020-05-12 00:25:06 -04:00
Jeremy Jordan	1df0d2dc97	set logger level for package (#1718 ) * move logging config to trainer class init * alternate logging config	2020-05-12 00:14:35 -04:00
William Falcon	4b30ef6480	Device (#1790 ) * added self.device * added docs	2020-05-12 00:09:48 -04:00
Kevin Chen	de1fdd8d3b	Removed test_dataloader call in check_testing_model_configuration (#1670 ) * Removed test_dataloader call * Check if test_dataloader is actually overriden * Fixed method spelling * Replaced lambdas * Replaced None with super method * Fixed testpass	2020-05-12 00:08:07 -04:00
William Falcon	5bb6b41b78	dataloaders with fast_dev_run (#1787 ) * dataloaders with fast_dev_run * dataloaders with fast_dev_run * dataloaders with fast_dev_run * fix * pep 8	2020-05-11 23:32:44 -04:00
Fabio Natanael Kepler	d120f97896	Fix saving native AMP scaler state (#1777 ) Saving was introduced in #1561.	2020-05-11 21:38:37 -04:00
William Falcon	eeb411144f	enable fast_dev_run without a validation loop (#1779 ) * fix val dataloader * Update evaluation_loop.py	2020-05-11 11:30:22 -04:00
Alexander Kreuzer	ee17c7c9c8	Fixed error message and test docstring (#1698 ) training_dataloader -> train_dataloader Co-authored-by: Alexander Kreuzer <alexander.kreuzer@sap.com>	2020-05-10 13:16:16 -04:00
Nicki Skafte	4970927ec8	Feature: auto scale batch size (#1638 ) * auto batch finder * fix styling * add description * add different modes * fix copy paste error * better organised code * fix styling * add tests * fix * fix * add some documentation * added CHANGELOG.md * some documentation * update based on review * Update trainer.py * Update docs/source/training_tricks.rst Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * Update tests/trainer/test_trainer_tricks.py Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> * Update tests/trainer/test_trainer_tricks.py Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * use EvalModelTemplate * param tests * rename * wrap params * rename function * rename * rename param * fix * abs * rename * refactor code * add docs * try * arg * loop * exept * loop * drop bool * docs * docs * added check and test for passing dataloader to fit * styling fix * update based on review Co-authored-by: Nicki Skafte <nugginea@gmail.com> Co-authored-by: William Falcon <waf2107@columbia.edu> Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: Jirka <jirka.borovec@seznam.cz>	2020-05-09 08:28:36 -04:00
Adrian Wälchli	25bbd059df	Also update progress_bar in training_epoch_end (#1724 ) * update prog. bar metrics on train epoch end * changelog * wip test * more thorough testing * comments * update docs * move test Co-authored-by: Jirka <jirka.borovec@seznam.cz>	2020-05-08 23:31:56 -04:00
Yuri Brovman	3a642601e8	added warning for None dataloader (#1745 ) * added warning for None dataloader * fixed variable style * updated warning message * remove unused import Co-authored-by: ybrovman <ybrovman@ebay.com>	2020-05-07 09:26:41 -04:00
Shunta Komatsu	f656882942	Fix typo (#1750 )	2020-05-07 09:25:54 -04:00
Pavel Grunt	b9364f96b1	lr_finder: Fix typo in docstring (#1746 )	2020-05-06 12:39:22 -04:00
Peter Yu	851866333c	Attach version_ to checkpoint path only if version is int (#1748 )	2020-05-06 12:38:32 -04:00
Yuri Brovman	35bbe178bd	fix _reset_eval_dataloader() for IterableDataset (#1560 ) * removed if dl from _reset_eval_dataloader() * changed to if dl != None to be more safe * hints from pep8speaks Co-authored-by: ybrovman <ybrovman@ebay.com>	2020-05-05 14:09:48 -04:00
Tian Wang	d6a0375974	Fixing logic (#1734 )	2020-05-05 14:07:26 -04:00
Jirka Borovec	2a2f303ae9	Tests: refactor trainer dataloaders (#1690 ) * refactor default model * drop redundant seeds * refactor dataloaders tests * fix multiple * fix conf * flake8 * Apply suggestions from code review Co-authored-by: William Falcon <waf2107@columbia.edu> Co-authored-by: William Falcon <waf2107@columbia.edu>	2020-05-05 12:31:15 -04:00
Travis Addair	f90afa29b8	Fix disabling progress bar on non-zero ranks using Horovod backend (#1709 ) * Fix Horovod backend to disable progress bar on all ranks except 0 * Add join barriers * Added changelog * Make protected and add verbosity * Refactor to disable progress bar callback in train * Removed vebose setting * Add cache check for Horovod * Test run again * Updated comment * Always skip cache for Horovod * Only reinstall when necessary * Added separate step * Fixed spacing * Skip Python 3.8	2020-05-04 13:02:57 -04:00
Nicki Skafte	e865b046b1	Bugfix/lr finder (#1676 ) * fix early stopping bug * allow val dataloader * update CHANGELOG.md * fix early stopping bug * allow val dataloader * update CHANGELOG.md Co-authored-by: Nicki Skafte <nugginea@gmail.com>	2020-05-04 11:38:51 -04:00
Adrian Wälchli	d28b145393	Update type hints for multiple dataloaders in .fit() and .test() (#1723 ) * update typehints * change log	2020-05-04 08:24:34 -04:00
Adrian Wälchli	e6b34ef90d	[WIP] Reduction when batch size < num gpus (#1609 ) * reduce if <= num_gpus * add test with explanation * chlog * fix changelog Co-authored-by: J. Borovec <jirka.borovec@seznam.cz>	2020-05-02 11:01:44 -04:00
Travis Addair	2950f66983	Fix Horovod distributed backend to set the root_gpu property (#1669 ) * params * drop acc * Fix Horovod distributed backend to set the root_gpu * Fixed test * Fixed tests * Fixed lint * Set root_gpu during initialization * chlog Co-authored-by: Jirka <jirka.borovec@seznam.cz>	2020-05-01 14:13:35 -04:00
Nathan Breitsch	3eac6cfd4f	Don't convert namedtuple to tuple (#1589 ) * Don't convert namedtuple to tuple * Test namedtuples sent to device correctly	2020-04-30 08:04:50 -04:00
William Falcon	d40425d257	added warning to crash (#1625 ) * added warning to crash * formatting Co-authored-by: J. Borovec <jirka.borovec@seznam.cz>	2020-04-30 08:04:18 -04:00
Jacob Zhong	f9c9e39ab8	Add log output for slurm (#1657 ) * add log output for slurm * change log levels * formatting Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>	2020-04-30 07:58:03 -04:00
Peter Yu	8d564b5e38	call on_load_checkpoint() when resuming from checkpoint (#1666 )	2020-04-30 07:57:24 -04:00
Dan Campbell	981758fd04	Only call load_spawn_weights if COLAB_GPU or KAGGLE_URL_BASE environment variables are set	2020-04-27 19:46:29 -04:00
William Falcon	4c4d0e6bc6	Merge pull request #1635 from PyTorchLightning/pkl Fixes CPU DDP breaking change and DDP change	2020-04-27 07:42:42 -04:00
William Falcon	cebc74f7bc	ddp pickle	2020-04-27 07:19:12 -04:00
William Falcon	2181ad1bc7	ddp pickle	2020-04-27 07:12:04 -04:00
William Falcon	d2989f76e6	ddp pickle	2020-04-27 07:08:34 -04:00
William Falcon	d5dff384eb	ddp pickle	2020-04-27 07:07:03 -04:00
William Falcon	4aac5568a6	ddp pickle	2020-04-27 07:05:29 -04:00
William Falcon	6e86b59d21	ddp pickle	2020-04-27 06:51:42 -04:00
Justus Schock	f91b131ba2	Remove warning	2020-04-27 11:41:03 +02:00
William Falcon	879d879985	fix hparams issue	2020-04-26 17:27:45 -04:00
William Falcon	4755ded863	Clean up Argparse interface with trainer (#1606 ) * fixed distutil parsing * fixed distutil parsing * Apply suggestions from code review * log * fixed distutil parsing * fixed distutil parsing * fixed distutil parsing * fixed distutil parsing * doctest * fixed hparams section * fixed hparams section * fixed hparams section * formatting Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: J. Borovec <jirka.borovec@seznam.cz>	2020-04-26 09:20:06 -04:00
William Falcon	b620d86c54	diable val and test shuffling (#1600 ) * diable val and test shuffling * diable val and test shuffling * diable val and test shuffling * diable val and test shuffling * log * condition * shuffle * refactor Co-authored-by: J. Borovec <jirka.borovec@seznam.cz>	2020-04-25 16:45:20 -04:00
William Falcon	791ba91dec	slurm job id (#1605 )	2020-04-25 16:01:15 -04:00
Jirka Borovec	58a467dd68	model checkpint on rank_zero_only & global rank state (#1408 ) * try delete in async or DDP us0-ecase * changelog * add model chekpoint rank * simple delete * flake8 * use global rank * chnagelog * fix review * fix import * proposal * proposal * proposal * improve proposal (fix problems with method call self) * cleaning Co-authored-by: Adrian Wälchli <adrian.waelchli@students.unibe.ch> Co-authored-by: William Falcon <waf2107@columbia.edu>	2020-04-24 17:21:00 -04:00
William Falcon	d0faf97893	fixed dataset stuff + docs (#1599 ) * Fixed dataset docs and disabled auto-sampler for iterable dataset	2020-04-24 16:51:26 -04:00
Jirka Borovec	570b2c7aeb	fix depreated call (#1596 ) * fix parity * update deprecated call	2020-04-24 14:45:43 -04:00
William Falcon	cd15bfc3ce	fixed new amp bugs (#1593 )	2020-04-24 09:29:39 -04:00

... 3 4 5 6 7 ...

744 Commits