Commit Graph

744 Commits

Author SHA1 Message Date
William Falcon 9a503de6af
Replace docs gifs with videos snippets so user can play at own speed (#2966)
* update docs
2020-08-13 18:52:47 -04:00
Jeff Yang 07c023c32f
fix(docs): docstring for amp_backend (#2960)
* fix(docs): docstring for amp_backend

* fix(docs): early_stop_checkpoint -> early_stop_callback

* docs

Co-authored-by: ananyahjha93 <ananya@pytorchlightning.ai>
2020-08-13 23:25:56 +02:00
SiddhantRanade 88bfed371e
Fix enforce_datamodule_dataloader_override() for iterable datasets (#2957)
This function has the if statement `if (train_dataloader or val_dataloaders) and datamodule:`.


The issue is similar to that in https://github.com/PyTorchLightning/pytorch-lightning/pull/1560. The problem is that the `if(dl)` translates to `if(bool(dl))`, but there's no dataloader.__bool__ so bool() uses dataloader.__len__ > 0. But... dataloader.__len__ uses IterableDataset.__len__ for IterableDatasets for which __len__ is undefined.

The fix is also the same, the `if dl` should be replaced by `if dl is not None`.

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
2020-08-13 23:06:17 +02:00
William Falcon 2c935d048e
track batch size (#2954) 2020-08-13 12:40:54 -04:00
Jirka Borovec 4354690e55
add apex test (#2921)
* add apex test

* rename

* level

* events

* wrap

* evt

* miss

* apex

* apex

* apex

* apex

* apex

* apex

* Update tests/models/test_amp.py

Co-authored-by: William Falcon <waf2107@columbia.edu>

* notes

* notes

Co-authored-by: William Falcon <waf2107@columbia.edu>
2020-08-13 10:03:13 -04:00
William Falcon 6c5a0a172f
Resultd (#2947)
* updated docs
2020-08-13 09:58:05 -04:00
Jirka Borovec 519b97effd
Clean save (#2933)
* thr
deterministic=True

* clean

* clean

* Apply suggestions from code review

Co-authored-by: Vadym Stupakov <vadim.stupakov@gmail.com>

* Apply suggestions from code review

Co-authored-by: Vadym Stupakov <vadim.stupakov@gmail.com>
2020-08-13 07:26:33 -04:00
William Falcon a46130cdc1
add weighted average to results obj (#2930)
* track batch size in result obj
2020-08-12 08:02:00 -04:00
Brendan Fahy 56396abe98
fix checkpointing to remote file paths (#2925) 2020-08-12 06:31:17 -04:00
William Falcon d13e5c9e53
document lightiningmodule better (#2920)
* updated docs
2020-08-11 19:39:43 -04:00
Brendan Fahy 97e6f35b34
fix missing return statement. Do not normalize remote paths (#2894)
* fix missing return statement. Do not normalize remote paths

* Update pytorch_lightning/utilities/cloud_io.py

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* Add some documentation that we now support s3 and hdfs paths

* suggestion from code review

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
2020-08-09 22:38:43 +00:00
Uladzislau Sazanovich e9846dd758
Add tracking of basic states in Trainer [wip - to-be-merged after v0.9] (#2541)
* Add initial tracking of states in Trainer.

* Add INTERRUPTED state, improve tests, move state switching from callback to a trainer.

* Move part of a trainer state switching to a decorator.

* Add documentation.

* Fix docs, rename state enum, restore state to previous on exit if None, add tests for decorator only.

* Fix callback typing.

Co-authored-by: William Falcon <waf2107@columbia.edu>
2020-08-09 06:24:09 -04:00
Brendan Fahy 6e77181ec7
Squashed commit of the following: (#2164)
commit 29fb0506cd38a15c359e369cc8bc4435916b0c78
Author: Brendan Fahy <bmfahy@gmail.com>
Date:   Sat Aug 8 19:35:30 2020 +0000

    fix checking for version for docs to build

commit 467fd640db02275972c7111af031c86bb59333e9
Author: Brendan Fahy <bmfahy@gmail.com>
Date:   Sat Aug 8 18:56:05 2020 +0000

    remove no local test

commit a7cc9f88de00feec1a5406874d05313c42bd004c
Author: Brendan Fahy <bmfahy@gmail.com>
Date:   Sat Aug 8 18:46:44 2020 +0000

    fix

commit 3fdbb729da79ae9348c83410a138666bad467951
Author: Brendan Fahy <bmfahy@gmail.com>
Date:   Sat Aug 8 18:23:30 2020 +0000

    revert requirements

commit 9b8686bd83e2bc243cf329e26f1c667c6949cf67
Author: Brendan Fahy <bmfahy@gmail.com>
Date:   Sat Aug 8 18:16:42 2020 +0000

    make it a fixture

commit eec74953d24c8b25268d3b6dde3cc4affdd5cb8f
Author: Brendan Fahy <bmfahy@gmail.com>
Date:   Sat Aug 8 18:01:32 2020 +0000

    fix up the testing

commit 896d94a0e60083d52c81db2a036b7f1e015cad11
Author: Brendan Fahy <bmfahy@gmail.com>
Date:   Sat Aug 8 17:47:28 2020 +0000

    fix some tests

commit 6d22bde19767bf2b71dfd44839b01efdf6888f83
Merge: 6175d4e2 6ebe0d72
Author: Brendan Fahy <bmfahy@gmail.com>
Date:   Sat Aug 8 10:20:47 2020 +0000

    Merge remote-tracking branch 'origin/master' into tb_use_gfile

commit 6175d4e26b15a43c412c26d501762cd0b570616a
Author: Brendan Fahy <bmfahy@gmail.com>
Date:   Fri Aug 7 10:16:36 2020 +0000

    Use tensorboard.compat.gfile to support remote writing
2020-08-09 06:08:44 -04:00
William Falcon 256059a1d0
tracks all outputs including TBPTT and multiple optimizers (#2890)
* pl 0.9 update

* pl 0.9 update

* pl 0.9 update

* pl 0.9 update

* pl 0.9 update

* pl 0.9 update

* pl 0.9 update

* pl 0.9 update

* pl 0.9 update

* pl 0.9 update

* pl 0.9 update

* pl 0.9 update

* pl 0.9 update

* pl 0.9 update

* pl 0.9 update

* pl 0.9 update

* pl 0.9 update

* pl 0.9 update

* pl 0.9 update

* pl 0.9 update

* pl 0.9 update
2020-08-09 06:00:15 -04:00
Adrian Wälchli 1bb268ad8a
Clarify what gpus=0 means in docs (#2876)
* docs clarify what gpus=0 means

* add example suggested by @ydcjeff
2020-08-08 11:50:08 -04:00
Adrian Wälchli f798cffd02
save last model after saving top_k when save_last=True (#2881)
* save_last should be last

* changelog

* seed, docs

* retrigger ci

* compare filenames

* move constants

* fix test

* epoch, global step

* improve test
2020-08-08 06:02:43 -04:00
Jirka Borovec a6e7aa7796
allow using apex with any PT version (#2865)
* wip

* setup

* type

* name

* wip

* docs

* imports

* fix if

* fix if

* use_amp

* Apply suggestions from code review

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* Apply suggestions from code review

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* fix tests

* Apply suggestions from code review

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* fix tests

* todos

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2020-08-08 11:07:32 +02:00
Santiago Castro fed0ac838b
Fix Trainer arg name in docs (#2879)
* Fix Trainer arg name in docs

* Fix a PR comment
2020-08-08 07:52:35 +02:00
Jirka Borovec b7d72706c3
clean imports (#2867)
* clean imports

* miss
2020-08-08 00:33:51 +02:00
Jirka Borovec f8c058215f
simplify tests & cleaning (#2588)
* simplify

* tmpdir

* revert

* clean

* accel

* types

* test

* edit test acc

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* Update test acc

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2020-08-07 23:22:05 +02:00
Iz Beltagy 2cc60c625e
fix set_epoch on TPUs (#2740)
* fix https://github.com/PyTorchLightning/pytorch-lightning/issues/2622

* Update training_loop.py
2020-08-07 09:31:30 -04:00
William Falcon f82d7feb6c
updated hooks (#2850)
* modified hooks

* modified hooks

* modified hooks

* modified hooks

* modified hooks

* modified hooks

* modified hooks

* modified hooks

* modified hooks
2020-08-07 09:29:57 -04:00
ananthsub b39f4798a6
Add support to Tensorboard logger for OmegaConf hparams (#2846)
* Add support to Tensorboard logger for OmegaConf hparams

Address https://github.com/PyTorchLightning/pytorch-lightning/issues/2844

We check if we can import omegaconf, and if the hparams are omegaconf instances. if so, we use OmegaConf.merge to preserve the typing, such that saving hparams to yaml actually triggers the OmegaConf branch

* avalaible

* chlog

* test

Co-authored-by: Jirka Borovec <jirka@pytorchlightning.ai>
2020-08-07 09:13:21 -04:00
Rohit Gupta a642349228
Support limit_mode_batches (int) for infinite dataloader (#2840)
* Support limit_mode_batches(int) for infinite dataloader

* flake8

* revert and update

* add and update tests

* pep8

* chlog

* Update CHANGELOG.md

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* Add suggestions by @awaelchli

* docs

* Apply suggestions from code review

Co-authored-by: Ethan Harris <ewah1g13@soton.ac.uk>

* Apply suggestions from code review

* fix

* max

* check

* add and update tests

* max

* check

* check

* check

* chlog

* tests

* update exception message

* Apply suggestions from code review

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: Ethan Harris <ewah1g13@soton.ac.uk>
Co-authored-by: Jirka Borovec <jirka@pytorchlightning.ai>
2020-08-07 13:02:36 +02:00
Nima Sarang 793036d29c
Support returning python scalars in DP (#1935)
* Override the default gather method to support scalars

* add computing average of a list

* bug: change if to elif

* add some tests

* change style

* change documentation

* use apply_to_collection in DP gather

* use apply_to_collection in DP gather

* fix warning msg

* override gather method in DP

* add tests for python scalars

* add python scalars to docstring

* Update message

* override gather method in DP

* formatting

* chlog

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Jirka Borovec <jirka@pytorchlightning.ai>
2020-08-07 09:18:29 +02:00
Nicki Skafte 9a402461da
Bugfix: Lr finder and hparams compatibility (#2821)
* fix hparams lr finder bug

* add tests for new functions

* better tests

* fix codefactor

* fix styling

* fix tests

* fix codefactor

* Apply suggestions from code review

* modified hook

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: William Falcon <waf2107@columbia.edu>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
2020-08-07 00:34:48 +02:00
William Falcon b507c42c47
clarify batch hooks (#2842)
* modified hook

* modified hook

* modified hook

* modified hook

* modified hook

* modified hook

* modified hook

* modified hook

* modified hook

* modified hook

* modified hook

* modified hook

* modified hook
2020-08-05 20:01:30 -04:00
Ananya Harsh Jha a5f2b89ed0
updated sync bn (#2838)
* updated sync bn

* updated sync bn

* updated sync bn

* updated sync bn

* updated sync bn

* updated sync bn

* updated sync bn

* updated sync bn

* added ddp_spawn test

* updated test

* clean

* clean

Co-authored-by: Jirka Borovec <jirka@pytorchlightning.ai>
2020-08-06 01:12:11 +02:00
William Falcon 5d0f0325d8
Revert "Support limit_mode_batches (int) for infinite dataloader" (#2839)
* Revert "Support limit_mode_batches (int) for infinite dataloader (#2787)"

This reverts commit de9c9f0864.

* Update training_tricks.py
2020-08-05 15:57:26 -04:00
Ruotian(RT) Luo bef27c58ed
save apex scaler states (#2828) 2020-08-05 13:43:50 -04:00
Ruotian(RT) Luo 6034d5e37d
fix apex gradient clipping (#2829) 2020-08-05 13:42:21 -04:00
Ananya Harsh Jha e31c520c21
add support for sync_bn (#2801)
* initial commit for sync_bn

* updated changelog

* tests

* tests

* ddp tests hanging with script tests

* updated trainer

* updated params

* test

* passingtests

* passing tests

* passing tests

* passing tests

* tests

* removed apex

* doc

* doc

* doc

* doc

* docs

* tests

* tests

* tests
2020-08-05 13:29:05 -04:00
Rohit Gupta de9c9f0864
Support limit_mode_batches (int) for infinite dataloader (#2787)
* Support limit_mode_batches(int) for infinite dataloader

* flake8

* revert and update

* add and update tests

* pep8

* chlog

* Update CHANGELOG.md

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* Add suggestions by @awaelchli

* docs

* Apply suggestions from code review

Co-authored-by: Ethan Harris <ewah1g13@soton.ac.uk>

* Apply suggestions from code review

* fix

* max

* check

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: Ethan Harris <ewah1g13@soton.ac.uk>
Co-authored-by: Jirka Borovec <jirka@pytorchlightning.ai>
2020-08-05 17:04:49 +00:00
Nicki Skafte b2a7d7580c
Docs for auto_select_gpu (#2836)
* added docs

* Update docs/source/multi_gpu.rst

Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>

* testcode change to example

Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
2020-08-05 12:28:33 +00:00
Nathan Raw 1c4244e1ff
🐛 fix dm prepare_data call (#2811) 2020-08-03 19:39:01 -04:00
Rohit Gupta 6b9c548bab
docs update and follow up of #2789 (#2797)
* docs update and follow up of #2789

* pep8

* Update trainer.py

* Update trainer.py

Co-authored-by: edenlightning <66261195+edenlightning@users.noreply.github.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
2020-08-03 19:57:21 +00:00
Vlad Lialin ed8a01afb0
More clear docstring for `val_check_interval` (#2802)
* More clear docstring for `val_check_interval`

* Update trainer.py
2020-08-03 09:13:05 -04:00
William Falcon a0c4365278
Gpu idx (#2796)
* ddp refactor
2020-08-02 08:13:31 -04:00
Jirka Borovec b01ad75700
missing chlogs (#2672)
* missing

* miss

* Apply suggestions from code review

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* miss

* note

* notes

* update CI testing with pip upgrade (#2380)

* try pt1.5

* cpu

* upgrade

* tpu

* user

* [blocked by #2380] freeze GPU PT 1.4 (#2780)

* freeze

* user

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2020-08-02 12:34:36 +02:00
bkhakshoor 96eb6ebacd
fix shell injection vulnerability in subprocess call (#2786) 2020-08-01 23:25:57 -04:00
Rohit Gupta 8baec1a191
Fix shuffle for distributed sampler (#2789)
* Fix shuffle for distributed sampler

* add test

* test

* chlog

* update test

* update test

* update test

* assertions via callback

* define callback outside for pickling

* skip ddp test on windows

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2020-08-01 23:22:57 -04:00
Iz Beltagy 38fce2ea68
fix selecting GPUs using CUDA_VISIBLE_DEVICES (#2739)
* fix https://github.com/PyTorchLightning/pytorch-lightning/issues/2407

* Update pytorch_lightning/trainer/distrib_data_parallel.py

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2020-08-01 23:21:15 -04:00
Nathan Raw 036bcea499
Call DataModule hooks implicitly in trainer (#2755)
*  call dm hooks in trainer implicitly

*  update tests

* 📝 remove unused stage arg from dm docs

*  update tests

*  update tests

* 🚧 include stage in datamodule.setup

* 📝 docs

* 📝 docs

* added more dm tests

* added more dm tests

* 🐛 call dm.setup everywhere

* 🔥 pickle tests now implied by accelerator tests

* 🎨 set dm as attr of trainer

* 🐛 .

* 🚧 wip

* add can prepare test

* add can prepare test

* verified setup in fit

* fixed setup call

* fixed setup call

* fixed setup call

Co-authored-by: William Falcon <waf2107@columbia.edu>
2020-08-01 20:17:57 -04:00
siahuat0727 78a07e5f2d
Fix doc typo (#2773) 2020-07-31 07:42:47 -04:00
Jirka Borovec 06e8910f06
pytorch 1.6 (#2745)
* pt 1.6

* don't use the new zipfile serialization for now

* quick flake8 fixes

* remove unnecessary f

* coalesce strings

* remove comma

* remove extra commas

* Apply suggestions from code review

Co-authored-by: Peter Yu <2057325+yukw777@users.noreply.github.com>

* set _use_new_zipfile_serialization to False only for pytorch 1.6.0

* remove unnecessary comments

* flake8 fixes

* use pkg_resources instead of packaging

* readme

* format

* version

* chlog

Co-authored-by: Peter Yu <peter@asapp.com>
Co-authored-by: Peter Yu <2057325+yukw777@users.noreply.github.com>
2020-07-31 11:18:32 +02:00
Jirka Borovec 949734489a
remove deprecated in v0.9 (#2760)
* remove deprecated in v0.9

* data_loader

* import

* hook

* args
2020-07-30 23:19:28 +02:00
Phil 2f0fb34496
Speed up gradient clipping and allow parameters on multiple devices. (#2767)
The speed up is achieved by:
- Moving the "where" out of the loop (and replacing with min for simplicity).
- Replacing manual sum and pow with torch.norm. Even though this results
  in unnessecary computation (computing pow(root)) this is still a lot
  faster.
- Preallocating the output gives a slight speed up.

Note that calling .to for all parameters results in a small speed
penalty (~4 ms in my case) but allows parameters on different devices.

Overall this reduces the time used for gradient clipping from 206ms to
74 ms for my model (Resnet50 + few additional vars, all vars on GPU).
2020-07-30 11:53:24 -04:00
Tejasvi S Tomar 8ab5bcda3d
Misleading exception raised during batch scaling (#2223)
* Misleading exception raised during batch scaling

Use batch_size from `model.hparams.batch_size` instead of `model.batch_size`

* Improvements considering #1896

* Apply suggestions from code review

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2020-07-29 18:47:11 -04:00
Peter Yu b7f613ba6d
Correct CWD for ddp subprocesses when using Hydra (#2719)
* when hydra is enabled, set the cwd of subprocesses to the original cwd for ddp

* move imports up

* clean up imports
2020-07-28 16:33:28 -04:00
Iz Beltagy c047676fae
fix https://github.com/PyTorchLightning/pytorch-lightning/issues/2635 (#2738) 2020-07-28 16:29:46 -04:00
Stas Bekman 2bd39c66af
make the error message readable (#2729)
* make the error message readable

make the error message readable by adding spaces, fixing a type "his -> this",

* cleanup

* Update pytorch_lightning/trainer/auto_mix_precision.py

* Apply suggestions from code review

Co-authored-by: Ethan Harris <ewah1g13@soton.ac.uk>

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Ethan Harris <ewah1g13@soton.ac.uk>
2020-07-28 16:28:22 -04:00
Jirka Borovec 0fe933e23d
fixing TPU tests (#2632)
* init

* rename

* tpu_core_idx

* idx 8

* idxs

* @pl_multi_process_test

* assert

* assert

* deamon

* no close

* imort

* msg

* use_single_gpu

* dataset

* idx

* fix idx

* dataset

* format

* add pickable

* typo

* apex

* typo

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* docs

* typo

* tests

* tests

* tests

* tests

* tests

* tests

* tests

* tests

* tests

* tests

* tests

* tests

* tests

* tests

* tests

* tests

* tests

* docs

* docs

* Apply suggestions from code review

Co-authored-by: Ethan Harris <ewah1g13@soton.ac.uk>
Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>

* Apply suggestions from code review

Co-authored-by: Ethan Harris <ewah1g13@soton.ac.uk>

* docs

* Apply suggestions from code review

Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>

Co-authored-by: Ethan Harris <ewah1g13@soton.ac.uk>
Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>
2020-07-27 19:07:09 -04:00
Rohit Gupta 84c507c4df
Fix max_batches with fast_dev_run. (#2581)
* Fix fast_dev_run to run for all val_dataloaders

* fast_dev_run check

* changelog

* explicit

* limit_batches with fast_dev_run in init

* add test

* whitespace and comment fix

* comment and assertion

* added tests

* Fix fast_dev_run to run for all val_dataloaders

* fast_dev_run check

* changelog

* explicit

* limit_batches with fast_dev_run in init

* add test

* whitespace and comment fix

* comment and assertion

* added tests

* added tests

* added tests

* added tests

* update rtol

* Revert "update rtol"

This reverts commit 4320329540.

* added tests

Co-authored-by: William Falcon <waf2107@columbia.edu>
2020-07-27 17:56:55 -04:00
Adrian Wälchli d03953260d
Fix weights_save_path when logger is used + simplify path handling + better docs (#2681)
* fix weights_save path and drop ckpt_path

* add tests

* unused import

* update docs

* changelog

* pep8

* fix horovod test

* make backward compatible

* perform same test for all loggers

* fix for when logger=False and weights_save_path is set

* update changelog

* update docs

* update tests

* do not set save dir dynamically

* remove duplicate test

* remove duplicated tests

* update tests

* update tests

* remove remaining ckpt_path references

* move defaults to init as suggested by @Borda

* test deprecation
2020-07-27 12:53:11 -04:00
William Falcon afd7f67047
cpu backend (#2712)
* cpu backend

* cpu backend

* cpu backend
2020-07-25 22:55:09 -04:00
William Falcon 0d96b2698a
refactor 4 (#2711)
* refactor ddp_spawn
2020-07-25 22:06:18 -04:00
William Falcon 4dbd761a1c
refactor 3/n (#2709)
* reactor into gpu accelerator

* reactor into gpu accelerator

* reactor into gpu accelerator

* reactor into gpu accelerator

* reactor into gpu accelerator

* reactor into gpu accelerator

* reactor into gpu accelerator

* reactor into gpu accelerator

* reactor into gpu accelerator

* reactor into gpu accelerator

* reactor into gpu accelerator

* reactor into gpu accelerator

* reactor into gpu accelerator

* reactor into gpu accelerator

* reactor into gpu accelerator

* reactor into gpu accelerator

* reactor into gpu accelerator

* reactor into gpu accelerator

* reactor into gpu accelerator

* reactor into gpu accelerator

* reactor into gpu accelerator
2020-07-25 20:56:50 -04:00
William Falcon b34217e410
Refactor 2/n (#2708)
* reactor into gpu accelerator

* reactor into gpu accelerator

* reactor into gpu accelerator

* reactor into gpu accelerator

* reactor into gpu accelerator

* reactor into gpu accelerator

* reactor into gpu accelerator
2020-07-25 17:31:34 -04:00
William Falcon 071e09fe38
refactor 1/n for v1.0.0 (#2704)
* reactor into gpu accelerator

* reactor into gpu accelerator

* reactor into gpu accelerator

* reactor into gpu accelerator

* reactor into gpu accelerator

* reactor into gpu accelerator

* reactor into gpu accelerator

* reactor into gpu accelerator

* reactor into gpu accelerator

* reactor into gpu accelerator

* reactor into gpu accelerator

* reactor into gpu accelerator
2020-07-25 14:38:51 -04:00
Nathan Raw 9076551aec
Enable val/test loop disabling + datamodule tests (#2692)
* 🎨 warn instead of error out on loaders

* 🐛 test misconfiguration should still fail

* 🚧 .

* updated docs with new result obj

* updated docs with new result obj

* updated docs with new result obj

* updated docs with new result obj

* updated docs with new result obj

* updated docs with new result obj

* updated docs with new result obj

* updated docs with new result obj

* updated docs with new result obj

* updated docs with new result obj

* updated docs with new result obj

* updated docs with new result obj

* updated docs with new result obj

* updated docs with new result obj

* updated docs with new result obj

* updated docs with new result obj

* updated docs with new result obj

* updated docs with new result obj

* updated docs with new result obj

* updated docs with new result obj

* updated docs with new result obj

* updated docs with new result obj

* updated docs with new result obj

* updated docs with new result obj

* updated docs with new result obj

* updated docs with new result obj

* updated docs with new result obj

* updated docs with new result obj

* updated docs with new result obj

* updated docs with new result obj

* updated docs with new result obj

* updated docs with new result obj

* updated docs with new result obj

* updated docs with new result obj

* updated docs with new result obj

Co-authored-by: William Falcon <waf2107@columbia.edu>
2020-07-25 12:57:40 -04:00
Rohit Gupta cb0c6ad51a
fix setup call while testing (#2624)
* fix setup call while testing

* changelog

* drop if condition

* add test to check setup call

* flake8

* update test to check model stage

Co-authored-by: William Falcon <waf2107@columbia.edu>
2020-07-24 23:57:31 -04:00
Rohit Gupta 8599b670c5
Fix logging interval (#2694) 2020-07-24 20:59:00 -04:00
Nathan Raw 1caf8beb2c
Datamodule (#2668)
*  Add copy of pl_bolts datamodule to lightning

*  add datamodule to necessary init files

* 🚧 add datamodule property to LightningModule

* 🚧 .

* 🎨 Let DataModule do its own thing

* 🚧 add back setup and run both hooks implicitly

* 🚧 .

* 🐛 fix add_argparse_args

* 💄 apply black formatting and isort

* 📝 docstrings

* 📝 .

* 📝 .

* 🐛 overwrite cls prepare_data instead of instance

* 📝 .

*  add some tests

* Update datamodule.py

* Update datamodule.py

* Update datamodule.py

Co-authored-by: William Falcon <waf2107@columbia.edu>
2020-07-24 11:42:15 -04:00
Ananya Harsh Jha 6780214b27
quick-fix for --gpus flag bug (#2674)
* quick-fix for --gpus flag bug

* warning added

* warning added

* set on_gpu using data_parallel_device_ids

* self.on_gpu repositioned

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
2020-07-24 08:26:05 +00:00
Travis Addair 1369012bc7
Horovod: adjust base LR used by schedulers to scale with the number of workers (#2626)
* Horovod: Adjust base LR used by schedulers to match that of the optimizer after scaling by number of workers

* Added unit test

* Removed debug statements

* Updated changelog

* Apply suggestions from code review

Co-authored-by: William Falcon <waf2107@columbia.edu>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2020-07-23 12:14:57 -04:00
Adrian Wälchli 1e68968ed7
support num_sanity_val_steps=-1 (#2246)
* support sanity_val_step=-1

* fix list size

* simplification

* simplify

* add test for num_sanity_val_steps=-1

* update test

* update docs

* extend tests to multiple dataloaders

* changelog

* Update tests/trainer/test_trainer.py

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* improve test

* refactor the sanity check decision

* fix merge

* Update trainer.py

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: William Falcon <waf2107@columbia.edu>
2020-07-23 07:07:03 -04:00
William Falcon 62ce00f96c
EvalResult support for val loop (PR 3/5) (#2651)
* add EvalResult to support to val/test loops
2020-07-22 13:53:10 -04:00
William Falcon 6d10ac2ac8
Structured results (train loop only. val loop separate PR) (PR 2/5) (#2615)
* r

* r

* r

* patched optimizer closure with sr

* patched optimizer closure with sr

* patched optimizer closure with sr

* added train step structured result

* added train step structured result

* added train step structured result

* added train step structured result

* added train step structured result

* added train step structured result

* added train step structured result

* added train step structured result

* added train step structured result

* added train step structured result

* added train step structured result

* added train step structured result

* added train step structured result

* added train step structured result

* added train step structured result

* added train step structured result

* added train step structured result

* added train step structured result

* added train step structured result

* added train step structured result

* added autoreduce for train step

* added auto reduce on train

* added auto reduce on train

* added auto reduce on train

* added auto reduce on train

* added auto reduce on train

* added auto reduce on train

* added hooks

* added hooks

* added hooks

* added hooks

* added hooks

* finished tests for structured results on train epoch

* finished tests for structured results on train epoch

* finished tests for structured results on train epoch

* finished tests for structured results on train epoch

* finished tests for structured results on train epoch

* finished tests for structured results on train epoch

* finished tests for structured results on train epoch

* finished tests for structured results on train epoch

* finished tests for structured results on train epoch

* finished tests for structured results on train epoch

* finished tests for structured results on train epoch

* finished tests for structured results on train epoch

* finished tests for structured results on train epoch

* finished tests for structured results on train epoch

* finished tests for structured results on train epoch

* finished tests for structured results on train epoch

* finished tests for structured results on train epoch

* finished tests for structured results on train epoch

* finished tests for structured results on train epoch

* finished tests for structured results on train epoch

* finished tests for structured results on train epoch

* finished tests for structured results on train epoch

* finished tests for structured results on train epoch

* finished tests for structured results on train epoch

* finished tests for structured results on train epoch

* finished tests for structured results on train epoch

* finished tests for structured results on train epoch

* finished tests for structured results on train epoch

* finished tests for structured results on train epoch

* finished tests for structured results on train epoch

* finished tests for structured results on train epoch

* finished tests for structured results on train epoch

* finished tests for structured results on train epoch

* finished tests for structured results on train epoch

* finished tests for structured results on train epoch

* finished tests for structured results on train epoch

* finished tests for structured results on train epoch

* finished tests for structured results on train epoch

* finished tests for structured results on train epoch

* finished tests for structured results on train epoch

* finished tests for structured results on train epoch

* finished tests for structured results on train epoch

* finished tests for structured results on train epoch

* finished tests for structured results on train epoch

* finished tests for structured results on train epoch

* finished tests for structured results on train epoch

* finished tests for structured results on train epoch

* finished tests for structured results on train epoch

* finished tests for structured results on train epoch

* finished tests for structured results on train epoch

* finished tests for structured results on train epoch

* finished tests for structured results on train epoch

* finished tests for structured results on train epoch

* finished tests for structured results on train epoch

* finished tests for structured results on train epoch

* finished tests for structured results on train epoch

* finished tests for structured results on train epoch

* finished tests for structured results on train epoch

* finished tests for structured results on train epoch

* finished tests for structured results on train epoch

* finished tests for structured results on train epoch

* finished tests for structured results on train epoch

* finished tests for structured results on train epoch

* finished tests for structured results on train epoch

* finished tests for structured results on train epoch

* finished tests for structured results on train epoch

* finished tests for structured results on train epoch

* finished tests for structured results on train epoch

* finished tests for structured results on train epoch

* finished tests for structured results on train epoch

* finished tests for structured results on train epoch

* finished tests for structured results on train epoch

* finished tests for structured results on train epoch

* finished tests for structured results on train epoch

* finished tests for structured results on train epoch

* finished tests for structured results on train epoch

* finished tests for structured results on train epoch

* finished tests for structured results on train epoch

* finished tests for structured results on train epoch

* finished tests for structured results on train epoch

* finished tests for structured results on train epoch

* finished tests for structured results on train epoch

* finished tests for structured results on train epoch

* finished tests for structured results on train epoch

* finished tests for structured results on train epoch

* cache

* finished tests for structured results on train epoch

* finished tests for structured results on train epoch

* finished tests for structured results on train epoch

* finished tests for structured results on train epoch

* finished tests for structured results on train epoch

* Update pytorch_lightning/callbacks/early_stopping.py

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* Update pytorch_lightning/callbacks/early_stopping.py

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* Update pytorch_lightning/callbacks/early_stopping.py

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* Update pytorch_lightning/callbacks/model_checkpoint.py

* Update pytorch_lightning/core/step_result.py

* finished tests for structured results on train epoch

* finished tests for structured results on train epoch

* Apply suggestions from code review

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: Justus Schock <12886177+justusschock@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: Justus Schock <12886177+justusschock@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: Justus Schock <12886177+justusschock@users.noreply.github.com>

* simple

* finished tests for structured results on train epoch

* simple

* simple

* revert

* finished tests for structured results on train epoch

* finished tests for structured results on train epoch

* Update tests/base/deterministic_model.py

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* finished tests for structured results on train epoch

* docstring typos

* finished tests for structured results on train epoch

* finished tests for structured results on train epoch

* finished tests for structured results on train epoch

* finished tests for structured results on train epoch

* finished tests for structured results on train epoch

* finished tests for structured results on train epoch

* finished tests for structured results on train epoch

* finished tests for structured results on train epoch

* finished tests for structured results on train epoch

* finished tests for structured results on train epoch

* finished tests for structured results on train epoch

* finished tests for structured results on train epoch

* finished tests for structured results on train epoch

* finished tests for structured results on train epoch

* finished tests for structured results on train epoch

* finished tests for structured results on train epoch

* finished tests for structured results on train epoch

* Update pytorch_lightning/core/step_result.py

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* Update pytorch_lightning/overrides/data_parallel.py

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

Co-authored-by: Jirka <jirka@pytorchlightning.ai>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: Justus Schock <12886177+justusschock@users.noreply.github.com>
2020-07-20 19:00:20 -04:00
Jaesun Park d14d3b2a34
Fix typo (#2646) 2020-07-20 14:13:56 -04:00
ananthsub ed581eb64f
Fix local rank zero casting (#2640)
* Fix local rank zero casting

The environment variable 'LOCAL_RANK' can be a string, causing the `if rank_zero_only.rank == 0` check to fail

* Update distributed.py

address comment
2020-07-18 20:12:06 -04:00
William Falcon aaa1553e35
tests for val loop flow (#2605)
* add tests for single scalar return from training

* add tests for single scalar return from training

* add tests for single scalar return from training

* fixing val step only

* fixing val step only

* fixing val step only

* fixing val step only

* fixing val step only

* fixing val step only

* fixing val step only

* fixing val step only

* fixing val step only

* fixing val step only

* fixing val step only

* fixing val step only

* fixing val step only

* fixing val step only

* fixing val step only

* fixing val step only

* fixing val step only

* fixing val step only

* fixing val step only

* fixing val step only

* fixing val step only

* fixing val step only
2020-07-14 14:20:45 -04:00
William Falcon 1d565e175d
add tests for single scalar return from training (#2587)
* add tests for single scalar return from training

* add tests for single scalar return from training

* add tests for single scalar return from training

* add tests for single scalar return from training

* add tests for single scalar return from training
2020-07-11 17:43:00 -04:00
William Falcon e068af9ea8
Ampt (#2572)
* remove grad scaling tpu

* remove grad scaling tpu

* remove grad scaling tpu

* remove grad scaling tpu

* remove grad scaling tpu

* remove grad scaling tpu

* remove grad scaling tpu

* remove grad scaling tpu

* remove grad scaling tpu
2020-07-09 21:28:11 -04:00
William Falcon f35337adba
Fixes .test() for ddp (#2570)
* enable none checkpoint

* enable none checkpoint

* enable none checkpoint

* enable none checkpoint

* enable none checkpoint

* enable none checkpoint

* enable none checkpoint

* enable none checkpoint

* enable none checkpoint

* enable none checkpoint

* enable none checkpoint

* enable none checkpoint

* enable none checkpoint

* enable none checkpoint

* enable none checkpoint

* enable none checkpoint

* enable none checkpoint

* enable none checkpoint

* enable none checkpoint

* enable none checkpoint

* enable none checkpoint

* enable none checkpoint

* enable none checkpoint

* enable none checkpoint

* enable none checkpoint

* enable none checkpoint

* enable none checkpoint

* enable none checkpoint

* enable none checkpoint

* enable none checkpoint

* enable none checkpoint

* enable none checkpoint

* enable none checkpoint

* enable none checkpoint

* enable none checkpoint

* enable none checkpoint

* enable none checkpoint

* enable none checkpoint

* enable none checkpoint

* enable none checkpoint

* enable none checkpoint

* enable none checkpoint

* enable none checkpoint

* enable none checkpoint

* enable none checkpoint

* enable none checkpoint

* enable none checkpoint

* enable none checkpoint

* enable none checkpoint

* enable none checkpoint

* enable none checkpoint

* enable none checkpoint

* enable none checkpoint

* enable none checkpoint

* enable none checkpoint

* enable none checkpoint

* enable none checkpoint

* enable none checkpoint

* enable none checkpoint

* enable none checkpoint

* enable none checkpoint

* enable none checkpoint

* enable none checkpoint
2020-07-09 18:36:36 -04:00
William Falcon b73812648f
don't pass tpu weights back on test (#2566)
* enable none checkpoint

* enable none checkpoint
2020-07-09 12:11:56 -04:00
William Falcon 4bbcfa04a3
.fit() returns last not best weights in ddp_spawn (#2565)
* added base tests for tpu

* added base tests for tpu

* enable none checkpoint

* enable none checkpoint

* enable none checkpoint

* enable none checkpoint

* enable none checkpoint

* enable none checkpoint

* enable none checkpoint

* enable none checkpoint

* enable none checkpoint

* enable none checkpoint
2020-07-09 11:36:21 -04:00
Jan Sellner e1bc208f66
Fix default value in documentation (#2452) 2020-07-09 07:16:50 -04:00
Hayden Housen 992a7e2a41
Start accumulate gradients schedule at epoch 0 (continued) (#2513)
* Start accumulate gradients schedule at epoch 0

* Undo change in #2375

* Update test_trainer.py::test_gradient_accumulation_scheduling

* Fix pep8 formatting

* Remove 'Datasets/' folder

* Split args for readability

* Fix pep8 formatting
2020-07-09 07:11:07 -04:00
Espen Haugsdal b3ebfec863
Fix argparse default value bug (#2526)
* Add failing test for bug

* Fix bug
2020-07-09 07:10:30 -04:00
William Falcon 11069c8784
Fix ddp tests + .test() (#2512)
* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* fix deprecation warnings

* added base tests for tpu

* added base tests for tpu

* Update pytorch_lightning/trainer/trainer.py

Co-authored-by: Jeremy Jordan <13970565+jeremyjordan@users.noreply.github.com>

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

* added base tests for tpu

Co-authored-by: Jirka <jirka@pytorchlightning.ai>
Co-authored-by: Jeremy Jordan <13970565+jeremyjordan@users.noreply.github.com>
2020-07-07 12:24:56 -04:00
Jeremy Jordan a91b06ed1e
fix worker warning (#2504)
* fix worker warning

* improve tests

* suggestion

Co-authored-by: Jirka <jirka@pytorchlightning.ai>
2020-07-06 15:45:43 +02:00
vr140 96b32bee04
[tiny] Fix training_dataloader usage to be train_dataloader instead. (#2521)
Co-authored-by: Vijay Rajaram <vrajaram3@gatech.edu>
2020-07-06 10:44:44 +02:00
William Falcon 9924c76faa
Amp2 (#2505)
* fix tpu hang

* fix tpu hang

* fix tpu hang

* fix tpu hang

* fix tpu hang

* fix tpu hang

* fix tpu hang

* fix tpu hang

* fix tpu hang

* fix tpu hang

* fix tpu hang

* fix tpu hang

* fix tpu hang

* fix tpu hang

* fix tpu hang

* fix tpu hang

* fix tpu hang

* fix tpu hang

* fix tpu hang

* fix tpu hang

* fix tpu hang

* fix tpu hang

* fix tpu hang

* fix tpu hang

* fix tpu hang

* fix tpu hang

* fix tpu hang

* fix tpu hang

* fix tpu hang

* fix tpu hang

* fix tpu hang

* fix tpu hang

* fix tpu hang

* fix tpu hang

* fix tpu hang

* fix tpu hang

* fix tpu hang

* fix tpu hang

* fix tpu hang

* fix tpu hang
2020-07-04 22:52:49 -04:00
Jirka Borovec fc61c200c0
DDp interpreter (#2482)
* interpreter

* chlog
2020-07-03 13:23:30 -04:00
William Falcon 0697dd306d
Fixes #2455 (#2463) 2020-07-02 07:18:58 -04:00
William Falcon afdfba1dc6
removed auto val reduce (#2462) 2020-07-02 07:04:18 -04:00
Adrian Wälchli 927f305f7e
Warn user when IterableDataset has __len__ defined (#2437)
* add warning when getting checking len

* added test

* changelog

* pep

* do not show warning below 1.4

* try version parse

* comments

* xfail

* Update requirements/base.txt

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* Update pytorch_lightning/trainer/data_loading.py

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* version

Co-authored-by: William Falcon <waf2107@columbia.edu>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Jirka <jirka@pytorchlightning.ai>
2020-07-01 07:53:19 -04:00
William Falcon 325852c6df
enabled no returns from eval (#2446)
* enabled no returns from eval

* fixed docs

* fixed docs

* fixed docs

* fixed docs

* fixed docs

* fixed docs

* fixed docs

* fixed docs

* fixed docs

* fixed docs

* fixed docs

* fixed docs
2020-07-01 07:38:00 -04:00
Jirka Borovec e268061614
Pure package & base tests (#2418)
* base tests

* pil

* wip

* wip

* wip

* ignore

* ignore

* win

* link

* win

* cpu

* Apply suggestions from code review

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2020-06-30 19:35:54 -04:00
Adrian Wälchli 145670f893
fix logging on rank 0 only (#2425)
* fix and test for ddp block logging rank > 0

* rename

* use the dummy logger

* dummy logger test

* set the logger in  model

* decorator for rank zero experiment

* simplify check

* simplify

* fix problem with None in checkpoint path

* revert configure logger

* unused import

* offline

* try rank 0 decorator in checkpoint

* try fix test

* imgs

* add asserts to make sure log zero only saves checkpoints

* add asserts to make sure log zero only saves checkpoints

* add asserts to make sure log zero only saves checkpoints

* add asserts to make sure log zero only saves checkpoints

* add asserts to make sure log zero only saves checkpoints

* fix tpu tests

* fix tpu tests

Co-authored-by: William Falcon <waf2107@columbia.edu>
2020-06-30 18:09:16 -04:00
William Falcon 04e68f022f fix tpu tests 2020-06-30 17:20:35 -04:00
William Falcon fc26078e39 fix tpu tests 2020-06-30 17:20:18 -04:00
Oliver Neumann 1a54ed6ad9
Checking ipywidgets is installed for ensure tqdm working (#2417)
* Adding importing ipywidgets before importing tqdm.auto to make sure ipywidgets is installed.

* Updated CHANGELOG.md

* Updated ipywidgets importing checks to @awaelchli comments.

Co-authored-by: William Falcon <waf2107@columbia.edu>
2020-06-30 16:59:35 -04:00
William Falcon 309ed75c5d
added reduce ddp results on eval (#2434)
* added reduce ddp results on eval

* added reduce ddp results on eval

* added reduce ddp results on eval

* added reduce ddp results on eval

* added reduce ddp results on eval

* added reduce ddp results on eval

* added reduce ddp results on eval

* added reduce ddp results on eval

* added reduce ddp results on eval

* added reduce ddp results on eval

* added reduce ddp results on eval
2020-06-30 16:15:35 -04:00
William Falcon e8bb4165b7
Fix apex scaling with decoupled backward (#2433)
* fix outputs

* fix outputs

* fix outputs

* fix outputs

* fix outputs

* fix outputs

* fix outputs

* fix outputs

* fix outputs

* fix outputs

* fix outputs

* fix outputs

* fix outputs

* fix outputs

* fix outputs

* fix outputs

* fix outputs

* fix outputs

* fix outputs
2020-06-30 14:51:39 -04:00
William Falcon a42a0e16dd
Fixes train outputs (#2428)
* fix outputs

* fix outputs
2020-06-30 10:03:49 -04:00
William Falcon 593837e1da fix amp wrong call 2020-06-29 06:46:19 -04:00
Adrian Wälchli 25ee51bc57
Continue Jeremy's early stopping PR #1504 (#2391)
* add state_dict for early stopping

* move best attr after monitor_op defined

* improve early stopping and model checkpoint callbacks

* fix formatting

* fix attr init order

* clean up setting of default_root_dir attr

* logger needs default root dir set first

* reorg trainer init

* remove direct references to checkpoint callback

* more fixes

* more bugfixes

* run callbacks at epoch end

* update tests to use on epoch end

* PR cleanup

* address failing tests

* refactor for homogeneity

* fix merge conflict

* separate tests

* tests for early stopping bug regressions

* small fixes

* revert model checkpoint change

* typo fix

* fix tests

* update train loop

* cannot pass an int as default_save_path

* refactor log message

* fix test case

* appease the linter

* fix some doctests

* move config to callback

* fixes from rebase

* fixes from rebase

* chlog

* docs

* reformat

* formatting

* fix

* fix

* fixes from rebase

* add new test for patience

* Update pytorch_lightning/callbacks/model_checkpoint.py

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* Update pytorch_lightning/callbacks/model_checkpoint.py

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* Update tests/callbacks/test_early_stopping.py

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* fix formatting

* remove enable_early_stop attribute

* add state_dict for early stopping

* move best attr after monitor_op defined

* improve early stopping and model checkpoint callbacks

* fix formatting

* fix attr init order

* clean up setting of default_root_dir attr

* logger needs default root dir set first

* reorg trainer init

* remove direct references to checkpoint callback

* more fixes

* more bugfixes

* run callbacks at epoch end

* update tests to use on epoch end

* PR cleanup

* address failing tests

* refactor for homogeneity

* fix merge conflict

* separate tests

* tests for early stopping bug regressions

* small fixes

* revert model checkpoint change

* typo fix

* fix tests

* update train loop

* fix test case

* appease the linter

* fix some doctests

* move config to callback

* fixes from rebase

* fixes from rebase

* chlog

* docs

* reformat

* formatting

* fix

* fix

* fixes from rebase

* add new test for patience

* Update pytorch_lightning/callbacks/model_checkpoint.py

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* Update pytorch_lightning/callbacks/model_checkpoint.py

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* Update tests/callbacks/test_early_stopping.py

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* fix formatting

* remove enable_early_stop attribute

* fix test with new epoch indexing

* fix progress bar totals

* fix off by one error (see #2289) epoch starts at 0 now

* added missing imports

* fix hpc_save folderpath

* fix formatting

* fix tests

* small fixes from a rebase

* fix

* tmpdir

* tmpdir

* tmpdir

* wandb

* fix merge conflict

* add back evaluation after training

* test_resume_early_stopping_from_checkpoint TODO

* undo the horovod check

* update changelog

* remove a duplicate test from merge error

* try fix dp_resume test

* add the logger fix from master

* try remove default_root_dir

* try mocking numpy

* try import numpy in docs test

* fix wandb test

* pep 8 fix

* skip if no amp

* dont mock when doctesting

* install extra

* fix the resume ES test

* undo conf.py changes

* revert remove comet pickle from test

* Update CHANGELOG.md

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* Update weights_loading.rst

* Update weights_loading.rst

* Update weights_loading.rst

* renamed flag

* renamed flag

* revert the None check in logger experiment name/version

* add the old comments

* _experiment

* test chckpointing on DDP

* skip the ddp test on windows

* cloudpickle

* renamed flag

* renamed flag

* parentheses for clarity

* apply suggestion max epochs

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

Co-authored-by: Jeremy Jordan <jtjordan@ncsu.edu>
Co-authored-by: Jirka <jirka@pytorchlightning.ai>
Co-authored-by: Jeremy Jordan <13970565+jeremyjordan@users.noreply.github.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: William Falcon <waf2107@columbia.edu>
2020-06-28 21:36:46 -04:00
William Falcon 66ffbaddf5
updates teardown to account for ddp (#2389)
* remove warnings

* remove warnings

* added doc lines

* added doc lines
2020-06-28 07:01:04 -04:00
Adrian Wälchli d910cc5200
docs: dont mock imports when running sphinx doctest (#2396)
* skip if no amp

* dont mock when doctesting

* install extra
2020-06-27 23:31:06 -04:00
Jirka Borovec 51711c265a
fix loading model with kwargs (#2387)
* test

* fix

* fix
2020-06-27 16:38:03 -04:00
William Falcon 90f641af0d
fixes logger crash on ddp (#2388)
* remove warnings

* remove warnings

* remove warnings

* remove warnings

* remove warnings

* remove warnings

* remove warnings

* remove warnings

* remove warnings

* remove warnings
2020-06-27 15:08:22 -04:00
Jirka Borovec 0be78d13aa
native amp (#2373)
* native amp

* typo

* imports

* apex
2020-06-26 21:45:13 -04:00
Jirka Borovec f1c96930b1
repair CI for Win (#2358)
* no cov

* no cov

* ReduceOp

* group

* reduce_op.sum

* Update sklearns.py

* formatting

* horovod

* Apply suggestions from code review

* horovod

* horovod

* horovod

* horovod

* ci

* print

* ci

* timeout

* timeout

* time

* fix

* distributed cpu

* pipes

* time

* cpu

* spawn

* spawn

* spawn

* tp

* separate

* os

* os

* npm

* Fix load_from_checkpoint() not working with URL on Windows

* Update CHANGELOG

* Update CHANGELOG.md

Co-authored-by: Peter Yu <2057325+yukw777@users.noreply.github.com>

* Apply suggestions from code review

* fix

* fix meta tags creating empty lines

* pyright

* node

* fix httpserver address

* drop tutils.default_trainer_options

* imports

* Better fix for load_from_checkpoint() not working with absolute path on Windows (#2294)

* Fix load_from_checkpoint() not working with URL on Windows

* Update CHANGELOG

* Update CHANGELOG.md

Co-authored-by: Peter Yu <2057325+yukw777@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Peter Yu <2057325+yukw777@users.noreply.github.com>

* drop duplicate

Co-authored-by: Justus Schock <12886177+justusschock@users.noreply.github.com>
Co-authored-by: airium <airium@outlook.com>
Co-authored-by: Peter Yu <2057325+yukw777@users.noreply.github.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: AIRIUM <38249940+airium@users.noreply.github.com>
2020-06-26 21:38:25 -04:00
Jirka Borovec a5f45787ea
fix get dataloader size (#2375)
* get dataloader size

* pyright
2020-06-26 15:38:48 -04:00
William Falcon cbb2427f0d
changed apex level (#2362) 2020-06-25 18:54:32 -04:00
William Falcon 0a092f6683
making optimization steps for hooks (#2363)
*simplified optimizer step and zero grad overriding
2020-06-25 16:02:16 -04:00
William Falcon d22181714a
fix 2333 (#2360) 2020-06-25 11:10:17 -04:00
William Falcon c275e1fc91
swaps lr sched order (#2356)
* swaps lr sched order

* Update optimizers.py

* added amdim encoder choice
2020-06-25 09:21:41 -04:00
William Falcon 598f5140c5
refactor training loop (#2336)
* refactoring training epoch

* refactored training epoch

* refactored training epoch

* refactored training epoch

* refactored training epoch

* refactored training epoch

* fixes slurm weights saving

* fixes slurm weights saving
2020-06-23 23:38:22 -04:00
William Falcon c09b2ffb91
test (#2341)
* fixes rank zero issue
2020-06-23 21:57:45 -04:00
William Falcon a915280427
fixes slurm weights saving (#2339) 2020-06-23 20:16:34 -04:00
Lezwon Castelino 9446390779
fix TPU parsing and TPU tests (#2094)
* added tpu params test

* added tests

* removed xla imports

* added test cases for TPU

* fix pep 8 issues

* refactorings and comments

* add message to MisconfigurationException

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* test if device is set correctly

* added TPU device check
removed mark.spawn

* removed device selection

* remove xla_device call

* readded spawn due to test failures

* add TODO for tpu check

* Apply suggestions from code review

* Apply suggestions from code review

* flake8

* added tpu args to cli tests

* added support for tpu_core selection via cli

* fixed flake formatting

* replaced default_save_path with default_root_dir

* added check for data type for tpu_cores

* fixed flake indent

* protected

* protected

* added tpu params test

* added tests

* removed xla imports

* test if device is set correctly

* added support for tpu_core selection via cli

* replaced default_save_path with default_root_dir

* added check for data type for tpu_cores

* chlog

* fixed tpu cores error

* rebased with latest changes

* flake fix

* Update pytorch_lightning/trainer/distrib_parts.py

added suggesstion

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Jirka <jirka@pytorchlightning.ai>
2020-06-23 12:06:57 -04:00
Adrian Wälchli e085e93dd3
Add missing test for "multiple dataloader + percent_check fix" (#2226)
* Init fix num_batches

* Fix num_batches in case of multiple dataloaders

* Apply suggestions from code review

* Changes based on suggestions

* Flake8

* Add test to check num_batches

* generalize dataloader percent check test

* fix formatting

* remove hparams

* tests

* CHANGELOG

* Update CHANGELOG.md

* max_batches can be int

* conflict and rebase

* add back the test


fix


fix message


0.0 works


Revert "fix message"

This reverts commit 839cacf8b8610f4e697e654ef6f3d2501bf23984.

* update changelog

* Update CHANGELOG.md

* Fix num batches in case of multiple dataloaders and percent_check (#1920)

* git conflict

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* missing union

* doc update suggestion by @rohitgr7

* extend test

* changelog

* docs add note about multiple loaders

* update changelog

* remove unused variable

Co-authored-by: rohitgr7 <rohitgr1998@gmail.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2020-06-23 11:21:24 -04:00
Adrian Wälchli bdee1cd106
update docs for "overfit_batches" (#2324)
* update docs

* Apply suggestions from code review

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2020-06-23 11:19:38 -04:00
William Falcon 0f073819d3
refactored training_batch + tests to verify correctness (#2328)
* refactored training_bath

* refactored training_bath

* refactored training_bath

* refactored training_bath

* refactored training_bath

* refactored training_bath

* refactored training_bath

* refactored training_bath

* refactored training_bath

* refactored training_bath

* refactored training_bath
2020-06-23 11:17:10 -04:00
Jirka Borovec 4b90b79080
check omegaconf gpus (#2273)
* check omegaconf gpus

* test

* test

* Apply suggestions from code review

Co-authored-by: Omry Yadan <omry@fb.com>

Co-authored-by: Omry Yadan <omry@fb.com>
2020-06-19 23:42:11 -04:00
Jirka Borovec f278ac42c8
Revert/Fix: epoch indexing from 1, to be from 0 (#2289)
* Revert "deprecated: epoch indexing from 1 (#2206)"

This reverts commit f94b919b

* chlog

* grad index

* Apply suggestions from code review

* tests

* fix

* test
2020-06-19 23:39:53 -04:00
thschaaf 554fb4754c
Bugfix/_has_len (#2293)
* deal with NotImplementedError raised by torchtext

* deal with NotImplementedError raised by torchtext

* Added tests for dataloader which raise NotImplementedError in __len__()

* Fixed some typos

Co-authored-by: Thomas Schaaf <tschaaf@cs.cmu.edu>
2020-06-19 23:38:15 -04:00
Jirka Borovec e0b7fed92e
deprecated Trainer proc_rank (#2269)
* deprecated

* test
2020-06-19 15:46:27 -04:00
William Falcon 8d51279703
[refactor results 1] - refactor backward (#2276)
* move backward

* refactor backward to remove 16 bit from user override

* refactor backward to remove 16 bit from user override

* Update pytorch_lightning/core/hooks.py

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2020-06-19 15:44:44 -04:00
Jirka Borovec 54acc79f31
continue 0.8.x (#2264)
* cleaning

* docs

* docs

* types

* mixins

* mixins

* docs

* typo
2020-06-19 11:00:46 -04:00
William Falcon e8f58b5ed6 Merge branch 'master' of https://github.com/PyTorchLightning/pytorch-lightning 2020-06-19 01:11:29 -04:00
William Falcon 3c8c2e3deb fix missing arg 2020-06-19 01:11:22 -04:00
William Falcon a6f94a6f43
remove tpu barrier (#2260) 2020-06-19 00:57:00 -04:00
William Falcon 57d5f6e74a
Barrier (#2257)
* remove barriers

* remove barriers

* remove barriers

* remove barriers

* remove barriers

* remove barriers

* remove barriers

* remove barriers

* remove barriers

* remove barriers
2020-06-19 00:42:20 -04:00
William Falcon b5a2f1ec44
fix setup and on fit calls (#2252) 2020-06-18 21:45:09 -04:00
William Falcon b7fc092bf4
made fx public (#2247)
* made fx public

* made fx public

* made fx public
2020-06-18 20:20:29 -04:00
William Falcon 68a1e52292
added barrier (#2245)
* added barrier

* blank line

* added barrier

* added barrier

* made fx public

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2020-06-18 20:15:02 -04:00
Jirka Borovec a2d3ee80ad
final cleanup for v0.8.0 (#2181)
* final clean for v0.8.0

* chlog

* chlog

* date

* rename stage

* date

* missing
2020-06-18 07:21:44 -04:00
William Falcon 476911d60c
Pid port + duplicate rank_zero logging (#2231)
* init the port using a seed that matches process id for ddp

* init the port using a seed that matches process id for ddp

* init the port using a seed that matches process id for ddp

* init the port using a seed that matches process id for ddp

* init the port using a seed that matches process id for ddp

* init the port using a seed that matches process id for ddp

* init the port using a seed that matches process id for ddp

Co-authored-by: Zhaofeng Wu <zfw7@cs.washington.edu>
2020-06-18 00:19:06 -04:00
William Falcon 15cf6a86c2
Tpu logging (#2230)
* add tpu view

* add tpu view

* add tpu view

* add tpu view

* add tpu view
2020-06-17 22:45:09 -04:00
William Falcon 34816e9ec4
adds setup+teardown hook (#2229)
* allow regression metrics to import

* allow regression metrics to import

* allow regression metrics to import

* allow regression metrics to import

* allow regression metrics to import

* allow regression metrics to import

* allow regression metrics to import

* allow regression metrics to import

* allow regression metrics to import
2020-06-17 19:49:58 -04:00
William Falcon 2411c3be70
replace train_percent_check with limit_train_batches (#2220)
* drop train_percent_check

* drop train_percent_check

* drop train_percent_check

* drop train_percent_check

* drop train_percent_check

* drop train_percent_check

* drop train_percent_check

* drop train_percent_check

* drop train_percent_check

* drop train_percent_check

* drop train_percent_check

* drop train_percent_check

* drop train_percent_check

* drop train_percent_check

* drop train_percent_check

* drop train_percent_check

* drop train_percent_check

* chlog

* deprecated

* deprecated

* deprecated

* tests

* tests

* Apply suggestions from code review

* tests

* hydra support

* tests

* hydra support

* hydra support

* hydra support

* tests

* typo

* typo

* Update test_dataloaders.py

* docs

* docs

* docs

* docs

Co-authored-by: Jirka <jirka@pytorchlightning.ai>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2020-06-17 13:42:28 -04:00
Nilansh Rajput 25c7465591
Enable gradients at train start (#2200)
* Enable gradients at train start

* Update training_loop.py

Co-authored-by: William Falcon <waf2107@columbia.edu>
2020-06-17 10:52:58 -04:00
William Falcon 2330d32531
hydra ddp support + generalized ddp gpu flags (#2212)
* hydra ddp support

* hydra ddp support

* hydra ddp support

* hydra support

* hydra support

* hydra support
2020-06-17 10:33:25 -04:00
William Falcon 04c794ca72
[WIP] Rename overfit_pct to overfit_batches (and fix) and val_percent_check and test_percent_check (and fix) (#2213)
* fixed percent check for val/test

* fixed percent check for val/test

* fixed percent check for val/test

* fixed percent check for val/test

* overfit_pct now uses train loaders for val and test and does not shuffle

* overfit_pct now uses train loaders for val and test and does not shuffle

* overfit_pct now uses train loaders for val and test and does not shuffle

* overfit_pct now uses train loaders for val and test and does not shuffle

* overfit_pct now uses train loaders for val and test and does not shuffle

* overfit_pct now uses train loaders for val and test and does not shuffle

* overfit_pct now uses train loaders for val and test and does not shuffle

* overfit_pct now uses train loaders for val and test and does not shuffle

* overfit_pct now uses train loaders for val and test and does not shuffle

* overfit_pct now uses train loaders for val and test and does not shuffle

* overfit_pct now uses train loaders for val and test and does not shuffle

* overfit_pct now uses train loaders for val and test and does not shuffle

* overfit_pct now uses train loaders for val and test and does not shuffle

* overfit_pct now uses train loaders for val and test and does not shuffle

* overfit_pct now uses train loaders for val and test and does not shuffle

* overfit_pct now uses train loaders for val and test and does not shuffle

* overfit_pct now uses train loaders for val and test and does not shuffle

* overfit_pct now uses train loaders for val and test and does not shuffle

* overfit_pct now uses train loaders for val and test and does not shuffle

* overfit_pct now uses train loaders for val and test and does not shuffle

* overfit_pct now uses train loaders for val and test and does not shuffle

* overfit_pct now uses train loaders for val and test and does not shuffle

* overfit_pct now uses train loaders for val and test and does not shuffle

* add on fit_start on fit_end hooks

* add on fit_start on fit_end hooks

* add on fit_start on fit_end hooks

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2020-06-17 08:03:28 -04:00
William Falcon 97dfd3a80a
Revert "Misleading exception raised during batch scaling (#1973)" (#2219)
This reverts commit f8103f9c7d.
2020-06-17 08:01:53 -04:00
Tejasvi S Tomar f8103f9c7d
Misleading exception raised during batch scaling (#1973)
* Misleading exception raised during batch scaling

Use batch_size from `model.hparams.batch_size` instead of `model.batch_size`

* Improvements considering #1896

* Apply suggestions from code review

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2020-06-17 08:01:04 -04:00
William Falcon e1f238a097
add on fit_start on fit_end hooks (#2217)
* add on fit_start on fit_end hooks

* add on fit_start on fit_end hooks

* add on fit_start on fit_end hooks
2020-06-17 07:37:16 -04:00
Jirka Borovec e289e45120
test: save hparams to yaml (#2198)
* save hparams to yaml

* import

* resolves

* req

* Update requirements/base.txt

Co-authored-by: Omry Yadan <omry@fb.com>

Co-authored-by: Omry Yadan <omry@fb.com>
2020-06-16 06:34:55 -04:00
Jirka Borovec f94b919b96
deprecated: epoch indexing from 1 (#2206)
* epoch indexing from 1

* chlog

* fix tests

* fix tests

* self.min_epochs
2020-06-16 06:33:41 -04:00
Adrian Wälchli 7dc58bd286
Refactor model summary + generalize example input array (#1773)
* squash

variant a


variant b


add test


revert rename


add changelog


docs


move changelog entry to top


use hooks


wip


wipp


layer summary


clean up, refactor


type hints


rename


remove obsolete code


rename


unused imports


simplify formatting of table and increase readability


doctest


superclass object


update examples


print unknown sizes


more docs and doctest


testing


unknown layers


add rnn test


remove main


restore train mode


test device wip


device


constant


simplify model forward transfer


return summary object in method


extend tests


fix summary for empty module


extend tests


refactor and added hook


variant a


variant b


add test


revert rename


add changelog


docs


move changelog entry to top


remove hardcoded string


simplify


test unknown shapes and all others


comments for tests


fix hparams attribute

* update default

* unused import

* clean up

* replace hardcoded strings

* fix doctest

* fix top/full

* black

* fix rnn test

* fix rnn

* update debugging docs


update docs


typo


update docs


update docs

* add changelog

* extract constant

* setter and getter

* move parity models to test folder

* parameterize mode
2020-06-15 17:05:58 -04:00
Adrian Wälchli 22d9464e56
HenryJia: auto-move data decorator (#1905)
* First attempt at auto-moving data for inference

* Correct my copypaste errors

* Correct for if device is CPU

* Get rid of the WIP code I accidentally added

* Add tests

* Make tests more foolproof

* Make sure we stick with pep8 formatting

* Clarify docs a little

* Apply suggestions from code review

* Get everything working again hopefully

* refactor and added hook


variant a


variant b


add test


revert rename


add changelog


docs

* move changelog entry to top

* Move data transfer to utilities

* Add back in warnings for autotransfer

* Get rid of the test code I ended up accidentally commiting again

* Add docs any changelog

* Correct PR number in Changelog

* Correct changelog

* Update data.py

* Update test_cpu.py

* make a decorator

* type hint

* changelog

* changelog

* remove old function

* import

* test for decorator

* fix test

* remove old test

* doctest

* apply decorator directly

* convert doctest to code block

* prevent side effects in tests

* fix merge

* update forward docs

* update docs

* added docs in section "deployment / prediction"

* update changelog

Co-authored-by: Hengjian Jia <henryjia18@gmail.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: William Falcon <waf2107@columbia.edu>
2020-06-15 17:04:32 -04:00
Peter Yu 37e7582486
Add ckpt_path option to LightningModule.test() (#2190)
* Add ckpt_path option to LightningModule.test()

If ckpt_path is "best" (default), it loads the best weights saved by ModelCheckpoint for the test loop.
If ckpt_path is a path to a checkpoint file, it loads the weights from the file for the test loop.
If ckpt_path is None, it uses the weights from the end of training for the test loop.
If model parameter is set, ckpt_path is ignored.

* Update test_set.rst

Co-authored-by: William Falcon <waf2107@columbia.edu>
2020-06-15 08:02:37 -04:00
William Falcon 48a76a785d
Performance docs (#2191)
* add workers fix

* add workers fix
2020-06-15 08:02:19 -04:00
Simon-Martin Schröder fd1693e289
Handle KeyboardInterrupt during training (#2134)
* Handle KeyboardInterrupt during training

Fixes #2079.

* chlog

* Fix whitespace

* Update callback_hook.py

* Update base.py

* Update training_loop.py

* Update test_trainer.py

* Update CHANGELOG.md

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* Update CHANGELOG.md

* on_keyboard_interrupt

Co-authored-by: Jirka <jirka@pytorchlightning.ai>
Co-authored-by: William Falcon <waf2107@columbia.edu>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2020-06-15 12:35:26 +02:00
William Falcon bd3a1f7dd4
Fast defaults (#2185)
* log row might be a bottleneck depending on network. 50 unblocks this and is small enough for small datasets

* log row might be a bottleneck depending on network. 50 unblocks this and is small enough for small datasets
2020-06-14 20:17:49 -04:00
Jirka Borovec c0903b800d
past checkpoints (#2160)
* past checkpoints

* omegaConf save

* enforce type

* resolve=True

Co-authored-by: Omry Yadan <omry@fb.com>

* test omegaconf

* tests

* test past

Co-authored-by: Omry Yadan <omry@fb.com>
2020-06-14 11:36:45 -04:00
William Falcon 5fd01b0e68
Finish Ananthsub patch 1 (enable prepare_data from correct processes). clarify local vs global rank (#2166)
* [trainer] Call prepare_data once per node in DDP/DDP2 training

* refactored DDP routes

* renamed proc_rank to local_rank

* renamed proc_rank to local_rank

* renamed proc_rank to local_rank

* renamed proc_rank to local_rank

* renamed proc_rank to local_rank

* renamed proc_rank to local_rank

* renamed proc_rank to local_rank

* renamed proc_rank to local_rank

* renamed proc_rank to local_rank

* renamed proc_rank to local_rank

* renamed proc_rank to local_rank

* renamed proc_rank to local_rank

* spawn message

* spawn message

* spawn message

* fixes

* fixes

* fixes

* fixes

* fixes

* Update trainer.py

Co-authored-by: ananthsub <ananth.subramaniam@gmail.com>
2020-06-13 12:00:14 -04:00
William Falcon 9df2b2090d
sets default ddp mode to spawn (#2168)
* set ddp_spawn as default

* spawn message

* spawn message

* spawn message

* spawn message

* spawn message

* spawn message

* spawn message

* spawn message
2020-06-13 03:47:45 -04:00
Jirka Borovec 2674976f2c
remove deprecated API for v0.8 (#2073)
* remove deprecated API

* chlog

* times

* missed

* formatting check

* missing

* missing

* miss

* fix docs build error

* fix pep whitespace error

* docs

* wip

* amp_level

* amp_level

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2020-06-12 14:37:52 -04:00
Udit Arora 08573d0f7e
Fix some pyright member access errors in training module (#2121)
* Fix pyright member access errors in training module

* Fix Trainer instantiation error due to inheritence order

* Add GH workflow for pyright

* Fix more pyright errors in trainer module

* Add pyrightconfig and setup python environment in type-check workflow

* Exclude pyrightconfig.json

* suggestions

Co-authored-by: Jirka <jirka@pytorchlightning.ai>
2020-06-12 17:23:18 +02:00
Peter Yu 06cd849538
Allow loading checkpoints from urls (#1667)
* allow loading checkpoints from urls

* tmpdir_server fixture

* test cases for loading checkpoints from url

* dir => root_dir

* default map_location to None

* test case for resume_from_checkpoint

* changelog

* doc update

* monkeypatch TORCH_HOME to avoid caching

* Use a threading server with random ports so that it is easier to clean up

* test fixes

* pep8 fix

* ThreadingHTTPServer support in 3.6

* pep8 fix

* fix changelog

* separate tests for urls

* typo

Co-authored-by: Peter Yu <2057325+yukw777@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2020-06-11 17:12:48 -04:00
Pattarawat Chormai 3be557dc5b
document: fix callback signature (#2113) 2020-06-09 07:10:44 -04:00
William Falcon 479ab49d03
temporarily fixes early stopping bug (#2119)
* fixes early stopping bug

* fixes early stopping bug

* fixes early stopping bug

* fixes early stopping bug

* fixe docs

* fixe docs

* fixe docs

* fixe docs

* fixe docs

* fixe docs

* fixe docs

* fixe docs

* fixe docs

* fixe docs

* fixe docs

* fixe docs

* fixe docs

* fixe docs

* fixe docs

* fixe docs

* fixe docs

* fixe docs

* fixe docs

* fixe docs

* added test
2020-06-08 19:28:26 -04:00
William Falcon 3260e59b27
Adds back the slow spawn ddp implementation that people want (#2115)
* training batch clean up

* training batch clean up

* training batch clean up

* training batch clean up

* training batch clean up

* training batch clean up

* training batch clean up

* training batch clean up

* training batch clean up

* training batch clean up

* training batch clean up

* training batch clean up

* training batch clean up

* training batch clean up

* training batch clean up

* training batch clean up

* training batch clean up

* training batch clean up

* training batch clean up

* training batch clean up

* adding spawn

* adding spawn

* adding spawn

* adding spawn

* adding spawn

* adding spawn

* adding spawn

* adding spawn
2020-06-08 17:55:25 -04:00
William Falcon 0bd7780adc
Fixes CPU and hanging GPU crash (#2118)
* training batch clean up

* training batch clean up

* training batch clean up
2020-06-08 16:30:20 -04:00
Jirka Borovec d2967d9305
update hparams, allow OmegaConf (#2047)
* DictConf

* inits

* Apply suggestions from code review

Co-authored-by: Omry Yadan <omry@fb.com>

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* atrib

* wip

* wip

* wip

* added hparams test

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* Update test_hparams.py

* added hparams test

* added hparams test

* pep8

* pep8

* pep8

* docs

* wip

* wip

* clean

* review @omry

* Update docs/source/hyperparameters.rst

Co-authored-by: Omry Yadan <omry@fb.com>

Co-authored-by: Omry Yadan <omry@fb.com>
Co-authored-by: William Falcon <waf2107@columbia.edu>
2020-06-08 07:19:34 -04:00
Wah Loon Keng 6e993c608b
correct trainer.fit production example (#2068)
trainer.fit uses the parameter `val_dataloaders` but in the documentation it is `val_dataloader`, which is invalid.
2020-06-04 11:24:12 -04:00
Adrian Wälchli 8211256c46
data transfer model hook (+ refactor) (#1756)
* refactor and added hook


variant a


variant b


add test


revert rename


add changelog


docs

* resolve merge duplication

* overridden typo

* fix test

* tpu id

* raise if TPU not available

* re-use apply_to_collection function for parsing collections

* comment

* make utility function available to user

* documentation

* move changelog entry to top

* fix tpu transfer call

* fix call

* remove hardcoded string

* improve test

* call model hook by default

* Apply suggestions from code review

* rename utility function

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2020-06-02 21:45:19 -04:00
Devashish Shankar ade3f36b7a
Raise an error when lightning replaces an existing sampler (#2020)
* Raise an error when lightning replaces an existing sampler

Currently, Trainer replaces the existing sampler with DistributedSampler
if running distributing training and `replace_sampler_ddp=True` (default
behaviour). If a user has configured an existing sampler, this would
lead to widely different results if running a distributed vs
non-distributed training.

This PR fixes this by raising an Error if user has configured a sampler
and uses `replace_sampler_ddp=True`. The recommended behavior from now
on is to either remove the sampler or set `replace_sampler_ddp=False`

* Fix tests

* Simpler fix

* Fix tests

* Make inner method protected

* Apply suggestions from code review

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2020-06-02 18:52:04 -04:00
Ivan Nazarov e85a646a41
Mistake in parameters' grad norm tracking (#2012)
* fix grad norm formula

* grad-norm tracker test

* fixed seed and explicit rtol in grad norm tracking test

* a docstring for grad-norms and forced cast to float of norm_type

* support for inf-norm

* renamed the grad norm test

* docs

* fixed language in docstring

* Apply suggestions from code review

Co-authored-by: Jirka <jirka@pytorchlightning.ai>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2020-06-02 18:51:09 -04:00
Adrian Wälchli a699003e67
Update/merge multi-gpu docs (#2021)
* merge multi-gpu docs

* extend slurm docs

* update links to elastic

* format docs and type hints in distrib parts

* reference multi-gpu/slurm in trainer args docs

* fix doctest

* typo

* doctest

* Apply suggestions from code review

Co-authored-by: Lucas Vazquez <lucasgouvaz@gmail.com>

* wall time

* Update docs/source/slurm.rst

Co-authored-by: Lucas Vazquez <lucasgouvaz@gmail.com>

* fix title

* update docs for weights summary

* update changelog

Co-authored-by: Lucas Vazquez <lucasgouvaz@gmail.com>
2020-06-02 18:50:08 -04:00
Lezwon Castelino 943c4b20af
slow tpu train (#2033)
* use parallel loader

* Revert "use parallel loader"

This reverts commit ed6e7583

* select tpu id for pl

* condition if tpu_id is None

* added info to changelog

* Revert "condition if tpu_id is None"

This reverts commit 1fb6e586

* Apply suggestions from code review

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2020-06-02 18:48:05 -04:00
William Falcon 82a20296e3
Replaces ddp .spawn with subprocess (#2029)
* replace ddp spawn with subprocess

* replace ddp spawn with subprocess

* replace ddp spawn with subprocess

* replace ddp spawn with subprocess

* replace ddp spawn with subprocess

* replace ddp spawn with subprocess

* replace ddp spawn with subprocess

* replace ddp spawn with subprocess

* replace ddp spawn with subprocess

* replace ddp spawn with subprocess

* replace ddp spawn with subprocess

* replace ddp spawn with subprocess

* replace ddp spawn with subprocess

* replace ddp spawn with subprocess

* replace ddp spawn with subprocess

* replace ddp spawn with subprocess

* replace ddp spawn with subprocess

* replace ddp spawn with subprocess

* replace ddp spawn with subprocess

* replace ddp spawn with subprocess

* replace ddp spawn with subprocess

* replace ddp spawn with subprocess

* replace ddp spawn with subprocess

* replace ddp spawn with subprocess

* replace ddp spawn with subprocess

* replace ddp spawn with subprocess

* replace ddp spawn with subprocess

* replace ddp spawn with subprocess

* replace ddp spawn with subprocess

* replace ddp spawn with subprocess

* replace ddp spawn with subprocess

* replace ddp spawn with subprocess

* replace ddp spawn with subprocess

* replace ddp spawn with subprocess

* replace ddp spawn with subprocess

* replace ddp spawn with subprocess

* replace ddp spawn with subprocess

* replace ddp spawn with subprocess

* replace ddp spawn with subprocess

* replace ddp spawn with subprocess

* replace ddp spawn with subprocess

* replace ddp spawn with subprocess

* replace ddp spawn with subprocess

* replace ddp spawn with subprocess

* replace ddp spawn with subprocess

* replace ddp spawn with subprocess

* replace ddp spawn with subprocess

* replace ddp spawn with subprocess

* replace ddp spawn with subprocess

* replace ddp spawn with subprocess

* replace ddp spawn with subprocess

* replace ddp spawn with subprocess

* replace ddp spawn with subprocess

* replace ddp spawn with subprocess

* replace ddp spawn with subprocess

* hot fix

* hot fix

* hot fix

* hot fix

* hot fix

* hot fix

* hot fix

* hot fix

* hot fix

* hot fix

* hot fix

* hot fix

* hot fix

* hot fix

* hot fix

* hot fix

* hot fix

* hot fix

* hot fix

* hot fix

* hot fix

* hot fix

* hot fix

* hot fix

* hot fix

* hot fix

* hot fix

* hot fix

* hot fix

* hot fix

* hot fix

* hot fix

* hot fix

* hot fix

* hot fix

* hot fix

* hot fix

* hot fix

* hot fix

* hot fix

* hot fix

* hot fix

* hot fix

* hot fix

* hot fix

* hot fix

* hot fix

* hot fix

* hot fix

* hot fix

* hot fix

* hot fix

* hot fix

* hot fix

* hot fix

* hot fix

* hot fix

* hot fix

* hot fix
2020-06-01 11:00:32 -04:00
Fabio Natanael Kepler 8b9b923ca8
Keep track of the best model's path saved by ModelCheckpoint (#1799)
* Add an additional attribute to ModelCheckpoint to keep track of the best model's path

Currently, only the best metric value is directly tracked. This new attribute will help in uses cases where the trained model needs to be used or tracked right after training.

* Add small description and usage example to docs

* Fix PEP8 issues

* Fix doctest example

* Fix expected output in doctest

* Apply suggestions from code review

* Show example as code block instead of doctest

* Apply suggestions from code review

* Update CHANGELOG.md

* Rename `ModelCheckpoint.best` to `ModelCheckpoint.best_model_score`

Also rename `ModelCheckpoint.best_model` (added in this PR) to `ModelCheckpoint.best_model_path`, for consistency, and `kth_best_model` to `kth_best_model_path`.

* Update pytorch_lightning/trainer/training_io.py

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* Add warning when loading checkpoint from an old version

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2020-05-31 08:47:13 -04:00
Yassine Alouini d8dc0a7228
Few typo correction (#2011) 2020-05-31 00:39:56 -04:00
Justus Schock ceecf1cea9
Graceful shutdown on python interpreter exit (#1631)
* Fraceful shutdown on python interpreter exit

* Update CHANGELOG.md

* Update training_loop.py

* Update training_loop.py

* Update CHANGELOG.md

Co-Authored-By: Jirka Borovec <Borda@users.noreply.github.com>

* pep8, move to constant

* Update training_loop.py

* Update training_loop.py

* Update training_loop.py

* pep8, move to constant

* pep8

* timeout

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Jirka <jirka.borovec@seznam.cz>
2020-05-29 16:20:04 +02:00
Loïc Grobol c3cf33d1de
Fix root node resolution (#1954) 2020-05-27 22:50:37 -04:00
Mateusz Pieniak 3af4994d5a
Removing unecessary early stopping calls (#1863)
* Removing unecessary early stopping calls

* Update CHANGELOG.md

Co-authored-by: Mateusz Pieniak <mateusz.pieniak@evidenceprime.com>
Co-authored-by: William Falcon <waf2107@columbia.edu>
2020-05-26 19:06:06 -04:00
Jirka Borovec 5e8c5abf63
fix default arg (#1927)
* fix default

* formatting errors

* update

* flake8
2020-05-26 19:04:42 -04:00
Jirka Borovec ca815698f5
Revert "Remove unused param tpu_core_idx (#1948)" (#1963)
This reverts commit d0ec11b9d6.
2020-05-26 19:02:51 -04:00
William Falcon 460ab5485e
Gen ddp support (#1961)
* updated docs

* added mixed

* added mixed
2020-05-26 19:02:30 -04:00
Rohit Gupta d0ec11b9d6
Remove unused param tpu_core_idx (#1948) 2020-05-25 16:04:53 -04:00
Adrian Wälchli 34237cfcaf
handle unknown args passed to Trainer.from_argparse_args (#1932)
* filter valid args

* error on unknown manual args

* added test

* changelog

* update docs and doctest

* simplify

* doctest

* doctest

* doctest

* better test with mock check for init call

* fstring

* extend test

* skip test on 3.6 not working

Co-authored-by: William Falcon <waf2107@columbia.edu>
2020-05-25 16:01:29 -04:00
William Falcon f46a7bae77
updated docs (#1941) 2020-05-25 15:59:32 -04:00
Federico Baldassarre 65b4352930
early stopping checks on_validation_end (#1458)
* Fixes PyTorchLightning/pytorch-lightning#490

`EarlyStopping` should check the metric of interest `on_validation_end` rather than `on_epoch_end`. 
In a normal scenario, this does not cause a problem, but in combination with `check_val_every_n_epoch>1` in the `Trainer` it results in a warning or in a `RuntimeError` depending on `strict`.

* Highlighted that ES callback runs on val epochs in docstring

* Updated EarlyStopping in rst doc

* Update early_stopping.py

* Update early_stopping.rst

* Update early_stopping.rst

* Update early_stopping.rst

* Update early_stopping.rst

* Apply suggestions from code review

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* Update docs/source/early_stopping.rst

* fix doctest indentation warning

* Train loop calls early_stop.on_validation_end

* chlog

Co-authored-by: William Falcon <waf2107@columbia.edu>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: Jirka <jirka@pytorchlightning.ai>
2020-05-25 17:33:00 +00:00
Adrian Wälchli 8ca8336ce5
protect progress bar callback (#1855)
* wip protected progress bar settings

* remove callback attr from LRfinder

* whitespace

* changelog
2020-05-25 07:49:23 -04:00
Nicki Skafte a34eb9e169
Fix logger bug and prepare data bug (#1933)
* tests, fix logger bug and prepare data bug

* add CHANGELOG.md

Co-authored-by: Nicki Skafte <nugginea@gmail.com>
2020-05-25 07:43:56 -04:00
William Falcon caa9c6760b
replace Hparams by init args (#1896)
* remove the need for hparams

* remove the need for hparams

* remove the need for hparams

* remove the need for hparams

* replace self.hparams

* replace self.hparams

* replace self.hparams

* replace self.hparams

* replace self.hparams

* replace self.hparams

* replace self.hparams

* replace self.hparams

* replace self.hparams

* replace self.hparams

* replace self.hparams

* replace self.hparams

* replace self.hparams

* replace self.hparams

* replace self.hparams

* replace self.hparams

* replace self.hparams

* replace self.hparams

* replace self.hparams

* replace self.hparams

* replace self.hparams

* replace self.hparams

* replace self.hparams

* replace self.hparams

* replace self.hparams

* replace self.hparams

* replace self.hparams

* replace self.hparams

* replace self.hparams

* replace self.hparams

* replace self.hparams

* replace self.hparams

* fixed

* fixed

* fixed

* fixed

* fixed

* fixed

* fixed

* fixed

* fixed

* fixed

* fixed

* fixed

* fixed

* fixed

* finished moco

* basic

* testing

* todo

* recurse

* hparams

* persist

* hparams

* chlog

* tests

* tests

* tests

* tests

* tests

* tests

* review

* saving

* tests

* tests

* tests

* docs

* finished moco

* hparams

* review

* Apply suggestions from code review

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* hparams

* overwrite

* transform

* transform

* transform

* transform

* cleaning

* cleaning

* tests

* examples

* examples

* examples

* Apply suggestions from code review

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* chp key

* tests

* Apply suggestions from code review

* class

* updated docs

* updated docs

* updated docs

* updated docs

* save

* wip

* fix

* flake8

Co-authored-by: Jirka <jirka@pytorchlightning.ai>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2020-05-24 18:59:08 -04:00
Nicki Skafte 8f6b7a2b4f
Fix user warning produced by apex + scheduler combination (#1873)
* fix user error produced by apex + scheduler combination

* add changelog

* added reinit to every configure_apex call

* fix styling

Co-authored-by: Nicki Skafte <nugginea@gmail.com>
2020-05-22 07:19:37 -04:00
Maxim Grechkin 98f7842970
Allow dataloaders without sampler field present (#1907)
* Allow dataloaders without sampler field present

Sometimes we have a custom dataloader that doesn't have a sampler, better to check that the field is there before reading it.

* chlog

Co-authored-by: Jirka <jirka@pytorchlightning.ai>
2020-05-20 20:57:12 +00:00
Kevin Trebing 3459a54667
Changed order of `update_learning_rates()` and `run_training_teardown()`. (#1891) 2020-05-19 13:16:26 -04:00
Rohit Gupta ac76dfcf62
Remove NaNs from loss in LRFinder (#1862)
* Remove NaNs from loss in LRFinder

* np.isfinite

* chlog

* add test

* chlog

Co-authored-by: Jirka <jirka@pytorchlightning.ai>
2020-05-19 08:39:19 +02:00
Ashraful Islam 981169cacc
add warning for shuffling in test/val (#1865) 2020-05-18 09:53:02 -04:00
Lezwon Castelino 7c7e50ca47
Allow user to select individual TPU core to train on (#1729)
* added tpu_id

added tpu_id to mixins

* train on individual tpu

* parallel loader if tpu_id is None

* removed progress_bar_refresh_rate

* chlog

* replaced num_tpu_cores with tpu_cores

* set tpu_id to None if int

* changed num_tpu_cores to tpu_cores in docs

* updated docs

* updated __init__.py
removed self.tpu_id for ParallelLoader

* Update pytorch_lightning/trainer/__init__.py

* check if tpu_cores is a list

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* xla device conditional

* num_tpu_cores deprecation

* removed duplicate warning

* fixed pep8 error

* Revert "removed duplicate warning"

This reverts commit 8adb0a9b

* deprecated api update

* fixed recursion error

* fixed tests

* fixed flake errors

* removed current_tpu_index

* Update CHANGELOG.md

* Update trainer.py

Co-authored-by: Jirka <jirka.borovec@seznam.cz>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: William Falcon <waf2107@columbia.edu>
2020-05-17 16:30:54 -04:00
Fabio Natanael Kepler 8c4c7b105e
Fix `save_weights_only` flag in ModelCheckpoint (#1780)
* Add flag to `dump_checkpoint` for only including weights

`ModelCheckpoint` then passes `self.save_weights_only` to the save function.

* Fix tests and add changelog entry

* Add check and descriptive message when training state is restored from a weights only checkpoint

Also add a test for making sure `ModelCheckpoint.save_weights_only` works as expected.

* Fix weights-only test to properly match expected exception

* Apply suggestions from code review

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2020-05-17 09:24:17 -04:00
Adrian Wälchli 769a459d27
remove extra kwargs from Trainer init (#1820)
* remove kwargs

* remove useless test

* rename unknown trainer flag

* trainer inheritance and test

* blank line

* test for unknown arg

* changelog
2020-05-17 09:14:54 -04:00
Rohit Gupta 56d521a317
Fix test configuration check and testing (#1804)
* Fix test configuration check and testing

* Fix test configuration check and testing

* Remove check_testing_configuration during test

* Fix docstring

* fix function name

* remove conflicts
2020-05-17 08:22:44 -04:00
Adrian Wälchli 4cdebf9a64
remove obsolete self._device in Trainer (#1849)
* remove unused device attribute

* dtype

* move on_gpu to model
2020-05-17 08:20:51 -04:00
William Falcon b84b02400a
enable any dict and namespace in hparams (#1847) 2020-05-15 15:08:16 -04:00
Justus Schock c05077fae3
Enable non-blocking for gpu device transfer (#1843)
* Update distrib_parts.py

* Update CHANGELOG.md
2020-05-14 17:56:40 -04:00
Jirka Borovec bee0392c37
extend arg parser (#1842)
* extend arg parser

* flake8

* tests

* example

* fix test
2020-05-14 17:56:11 -04:00
Nicki Skafte 88f816ed06
dummy logger (#1836)
Co-authored-by: Nicki Skafte <nugginea@gmail.com>
2020-05-14 10:34:11 -04:00
William Falcon 53d9316a56
fixes ddp bugs (#1819)
* debug

* debug

* debug

* debug

* debug

* debug

* debug

* debug

* debug

* debug

* debug

* debug

* debug

* debug

* debug

* debug

* debug

* debug

* debug

* debug

* debug

* debug

* debug

* debug

* debug

* debug

* debug

* debug

* debug

* debug

* debug

* debug

* debug

* debug

* debug

* debug

* debug
2020-05-13 19:17:04 -04:00
William Falcon 648d516668
Use store_true for bool args (#1822)
*  Use store_true for bool args

* debug

Co-authored-by: Nate Raw <nxr9266@g.rit.edu>
2020-05-13 19:12:06 -04:00
Ashwin Bharambe 0e71705a0a
[checkpoint logic] Fix bug which doesn't account for NoneType for `model.hparams` (#1817)
The intention of the code is to output a warning message when `hparams`
is null or not set. Instead the code now fatals when
`model.hparams = None`. Prevent that.
2020-05-13 17:14:11 -04:00
Nicki Skafte 663b90035c
Bugfix: accumulation and suggestion for learning rate finder (#1801)
* fix suggestion being too naive

* fix accumulation error and added new tests

* fix styling

* update CHANGELOG.md

* update based on review

* fix tests

* Apply suggestions from code review

* Apply suggestions from code review

* Apply suggestions from code review

* Apply suggestions from code review

Co-authored-by: Nicki Skafte <nugginea@gmail.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2020-05-13 14:40:44 -04:00
Ashwin Bharambe aefc5314bc
[ddp] Support multi-node distributed execution under torchelastic (#1811)
The changes are quite local and limited in nature -- viz., checking for
some indicator environment variables. We check for (SLURM_LOCALID,
NODE_RANK, GROUP_RANK) in order. If multiple are found set, a warning is
logged.

This patch also fixes a minor bug with comparing the `WORLD_SIZE`
environment variable. This can be a string type.
2020-05-13 14:06:59 -04:00
So Uchida 22d7d03118
Replace meta_tags.csv with hparams.yaml (#1271)
* Add support for hierarchical dict

* Support nested Namespace

* Add docstring

* Migrate hparam flattening to each logger

* Modify URLs in CHANGELOG

* typo

* Simplify the conditional branch about Namespace

Co-Authored-By: Jirka Borovec <Borda@users.noreply.github.com>

* Update CHANGELOG.md

Co-Authored-By: Jirka Borovec <Borda@users.noreply.github.com>

* added examples section to docstring

* renamed _dict -> input_dict

* mata_tags.csv -> hparams.yaml

* code style fixes

* add pyyaml

* remove unused import

* create the member NAME_HPARAMS_FILE

* improve tests

* Update tensorboard.py

* pass the local test w/o relavents of Horovod

* formatting

* update dependencies

* fix dependencies

* Apply suggestions from code review

* add savings

* warn

* docstrings

* tests

* Apply suggestions from code review

* saving

* Apply suggestions from code review

* use default

* remove logging

* typo fixes

* update docs

* update CHANGELOG

* clean imports

* add blank lines

* Update pytorch_lightning/core/lightning.py

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* Update pytorch_lightning/core/lightning.py

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* back to namespace

* add docs

* test fix

* update dependencies

* add space

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2020-05-13 15:05:15 +02:00
Jirka Borovec 10ce1c0256
device property (#1791)
* device property

* add/copy properties

* inherit

* rename

* Apply suggestions from code review

Co-authored-by: Justus Schock <12886177+justusschock@users.noreply.github.com>

* dtype

* prop

* pt api

Co-authored-by: Justus Schock <12886177+justusschock@users.noreply.github.com>
2020-05-12 23:18:39 -04:00
Adrian Wälchli 8978794730
add missing flag (#1805) 2020-05-12 17:06:38 -04:00
Oliver Neumann 9059d21042
Missing profiler attribute in add_argparse_args() ArgumentParser (#1794)
* Fixed typing annotation by adding boolean type. After that Profiler flag will be added to argparse.

* Updated CHANGELOG.md

* Updated git_init_arguments_and_types() to pass doctests.

* Added doctest example to add_argparse_parser()
2020-05-12 08:53:26 -04:00
kumuji 619f984c36
Option to provide seed to random generators to ensure reproducibility (#1572)
* Option to provide seed to random generators to ensure reproducibility

I added small function in utilities which imports torch, numpy, python
random and sets seed for all of the libraries to ensure reproducibility
of results.

* Apply recommendations from core contributors on seeding

1. Moved the seeding code to another file
2. Make deterministic as a parameter for trainer class
3. Add assertions for seeding numpy
4. Added warnings
5. torch.manual_seed should be enough for seeding torch

* Revert "Apply recommendations from core contributors on seeding"

This reverts commit a213c8e6882eec8a9e7408b9418926d2db7c5461.

* Revert "Revert "Apply recommendations from core contributors on seeding""

This reverts commit 59b2da53c62878de7aab0aa3feb3115e105eea06.

* Change in test, for correct seeding

* Allow seed equal to 0

* Allow seed to be uint32.max

* Added deterministic to benchmarks

* Cuda manual seed as in benchmark seeding

* Seeding should be done before model initialization

* cuda manual_seed is not necessary

* Fixing seed test_cpu_lbfgs

On some seeds seems like lbfgs doesn't converge.
So I fixed the seed during testing.

* rebasing issue with old reproducibility.py

* Improved documentation and ability to seed before initializing Train
class

* Change in docs

* Removed seed from trainer, update for documentation

* Typo in the docs

* Added seed_everything to _all_

* Fixing old changes

* Model initialization should be earlier then Trainer

* Update pytorch_lightning/trainer/__init__.py

From Example to testcode

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* Fixing according to the contributors suggestions

* Moving horovod deterministic to Trainer class

* deterministic flag affects horovod docs update

* Improved static typing

* Added deterministic to test runners of horovod

It is failing on some versions, not very predictable

* static seeds for horovod tests

* Change for reset_seed function in tests

* Seeding horovod using reset_seed from tutils

* Update pytorch_lightning/trainer/__init__.py

* chlog

* Update trainer.py

* change "testcode" to "Example" in trainer init documentation

* Update pytorch_lightning/trainer/seed.py, first line in comment

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Jirka <jirka.borovec@seznam.cz>
Co-authored-by: William Falcon <waf2107@columbia.edu>
2020-05-12 07:53:20 -04:00
William Falcon 10b16dbfab
made ddp the default if no backend specified with multiple GPUs (#1789)
* made ddp the default if no backend specified with multiple GPUs

* fix

* spawn

Co-authored-by: Jirka <jirka.borovec@seznam.cz>
2020-05-12 06:54:23 -04:00
Travis Addair acab068c74
Join Horovod workers at the end of trainer.fit() to prevent race conditions following training (#1786)
* Join Horovod workers at the end of trainer.fit() to prevent race conditions following training

* flake8

* flake8

Co-authored-by: Jirka <jirka.borovec@seznam.cz>
2020-05-12 09:15:25 +00:00
William Falcon 7b60d49432
fixed native amp + ddp (#1788)
* fixed native amp + ddp

* fixed native amp + ddp
2020-05-12 00:25:06 -04:00
Jeremy Jordan 1df0d2dc97
set logger level for package (#1718)
* move logging config to trainer class init

* alternate logging config
2020-05-12 00:14:35 -04:00
William Falcon 4b30ef6480
Device (#1790)
* added self.device

* added docs
2020-05-12 00:09:48 -04:00
Kevin Chen de1fdd8d3b
Removed test_dataloader call in check_testing_model_configuration (#1670)
* Removed test_dataloader call

* Check if test_dataloader is actually overriden

* Fixed method spelling

* Replaced lambdas

* Replaced None with super method

* Fixed testpass
2020-05-12 00:08:07 -04:00
William Falcon 5bb6b41b78
dataloaders with fast_dev_run (#1787)
* dataloaders with fast_dev_run

* dataloaders with fast_dev_run

* dataloaders with fast_dev_run

* fix

* pep 8
2020-05-11 23:32:44 -04:00
Fabio Natanael Kepler d120f97896
Fix saving native AMP scaler state (#1777)
Saving was introduced in #1561.
2020-05-11 21:38:37 -04:00
William Falcon eeb411144f
enable fast_dev_run without a validation loop (#1779)
* fix val dataloader

* Update evaluation_loop.py
2020-05-11 11:30:22 -04:00
Alexander Kreuzer ee17c7c9c8
Fixed error message and test docstring (#1698)
training_dataloader -> train_dataloader

Co-authored-by: Alexander Kreuzer <alexander.kreuzer@sap.com>
2020-05-10 13:16:16 -04:00
Nicki Skafte 4970927ec8
Feature: auto scale batch size (#1638)
* auto batch finder

* fix styling

* add description

* add different modes

* fix copy paste error

* better organised code

* fix styling

* add tests

* fix

* fix

* add some documentation

* added CHANGELOG.md

* some documentation

* update based on review

* Update trainer.py

* Update docs/source/training_tricks.rst

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* Update tests/trainer/test_trainer_tricks.py

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* Update tests/trainer/test_trainer_tricks.py

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* use EvalModelTemplate

* param tests

* rename

* wrap params

* rename function

* rename

* rename param

* fix

* abs

* rename

* refactor code

* add docs

* try

* arg

* loop

* exept

* loop

* drop bool

* docs

* docs

* added check and test for passing dataloader to fit

* styling fix

* update based on review

Co-authored-by: Nicki Skafte <nugginea@gmail.com>
Co-authored-by: William Falcon <waf2107@columbia.edu>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Jirka <jirka.borovec@seznam.cz>
2020-05-09 08:28:36 -04:00
Adrian Wälchli 25bbd059df
Also update progress_bar in training_epoch_end (#1724)
* update prog. bar metrics on train epoch end

* changelog

* wip test

* more thorough testing

* comments

* update docs

* move test

Co-authored-by: Jirka <jirka.borovec@seznam.cz>
2020-05-08 23:31:56 -04:00
Yuri Brovman 3a642601e8
added warning for None dataloader (#1745)
* added warning for None dataloader

* fixed variable style

* updated warning message

* remove unused import

Co-authored-by: ybrovman <ybrovman@ebay.com>
2020-05-07 09:26:41 -04:00
Shunta Komatsu f656882942
Fix typo (#1750) 2020-05-07 09:25:54 -04:00
Pavel Grunt b9364f96b1
lr_finder: Fix typo in docstring (#1746) 2020-05-06 12:39:22 -04:00
Peter Yu 851866333c
Attach version_ to checkpoint path only if version is int (#1748) 2020-05-06 12:38:32 -04:00
Yuri Brovman 35bbe178bd
fix _reset_eval_dataloader() for IterableDataset (#1560)
* removed if dl from _reset_eval_dataloader()

* changed to if dl != None to be more safe

* hints from pep8speaks

Co-authored-by: ybrovman <ybrovman@ebay.com>
2020-05-05 14:09:48 -04:00
Tian Wang d6a0375974
Fixing logic (#1734) 2020-05-05 14:07:26 -04:00
Jirka Borovec 2a2f303ae9
Tests: refactor trainer dataloaders (#1690)
* refactor default model

* drop redundant seeds

* refactor dataloaders tests

* fix multiple

* fix conf

* flake8

* Apply suggestions from code review

Co-authored-by: William Falcon <waf2107@columbia.edu>

Co-authored-by: William Falcon <waf2107@columbia.edu>
2020-05-05 12:31:15 -04:00
Travis Addair f90afa29b8
Fix disabling progress bar on non-zero ranks using Horovod backend (#1709)
* Fix Horovod backend to disable progress bar on all ranks except 0

* Add join barriers

* Added changelog

* Make protected and add verbosity

* Refactor to disable progress bar callback in train

* Removed vebose setting

* Add cache check for Horovod

* Test run again

* Updated comment

* Always skip cache for Horovod

* Only reinstall when necessary

* Added separate step

* Fixed spacing

* Skip Python 3.8
2020-05-04 13:02:57 -04:00
Nicki Skafte e865b046b1
Bugfix/lr finder (#1676)
* fix early stopping bug

* allow val dataloader

* update CHANGELOG.md

* fix early stopping bug

* allow val dataloader

* update CHANGELOG.md

Co-authored-by: Nicki Skafte <nugginea@gmail.com>
2020-05-04 11:38:51 -04:00
Adrian Wälchli d28b145393
Update type hints for multiple dataloaders in .fit() and .test() (#1723)
* update typehints

* change log
2020-05-04 08:24:34 -04:00
Adrian Wälchli e6b34ef90d
[WIP] Reduction when batch size < num gpus (#1609)
* reduce if <= num_gpus

* add test with explanation

* chlog

* fix changelog

Co-authored-by: J. Borovec <jirka.borovec@seznam.cz>
2020-05-02 11:01:44 -04:00
Travis Addair 2950f66983
Fix Horovod distributed backend to set the root_gpu property (#1669)
* params

* drop acc

* Fix Horovod distributed backend to set the root_gpu

* Fixed test

* Fixed tests

* Fixed lint

* Set root_gpu during initialization

* chlog

Co-authored-by: Jirka <jirka.borovec@seznam.cz>
2020-05-01 14:13:35 -04:00
Nathan Breitsch 3eac6cfd4f
Don't convert namedtuple to tuple (#1589)
* Don't convert namedtuple to tuple

* Test namedtuples sent to device correctly
2020-04-30 08:04:50 -04:00
William Falcon d40425d257
added warning to crash (#1625)
* added warning to crash

* formatting

Co-authored-by: J. Borovec <jirka.borovec@seznam.cz>
2020-04-30 08:04:18 -04:00
Jacob Zhong f9c9e39ab8
Add log output for slurm (#1657)
* add log output for slurm

* change log levels

* formatting

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2020-04-30 07:58:03 -04:00
Peter Yu 8d564b5e38
call on_load_checkpoint() when resuming from checkpoint (#1666) 2020-04-30 07:57:24 -04:00
Dan Campbell 981758fd04 Only call load_spawn_weights if COLAB_GPU or KAGGLE_URL_BASE
environment variables are set
2020-04-27 19:46:29 -04:00
William Falcon 4c4d0e6bc6
Merge pull request #1635 from PyTorchLightning/pkl
Fixes CPU DDP breaking change and DDP change
2020-04-27 07:42:42 -04:00
William Falcon cebc74f7bc ddp pickle 2020-04-27 07:19:12 -04:00
William Falcon 2181ad1bc7 ddp pickle 2020-04-27 07:12:04 -04:00
William Falcon d2989f76e6 ddp pickle 2020-04-27 07:08:34 -04:00
William Falcon d5dff384eb ddp pickle 2020-04-27 07:07:03 -04:00
William Falcon 4aac5568a6 ddp pickle 2020-04-27 07:05:29 -04:00
William Falcon 6e86b59d21 ddp pickle 2020-04-27 06:51:42 -04:00
Justus Schock f91b131ba2
Remove warning 2020-04-27 11:41:03 +02:00
William Falcon 879d879985 fix hparams issue 2020-04-26 17:27:45 -04:00
William Falcon 4755ded863
Clean up Argparse interface with trainer (#1606)
* fixed distutil parsing

* fixed distutil parsing

* Apply suggestions from code review

* log

* fixed distutil parsing

* fixed distutil parsing

* fixed distutil parsing

* fixed distutil parsing

* doctest

* fixed hparams section

* fixed hparams section

* fixed hparams section

* formatting

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: J. Borovec <jirka.borovec@seznam.cz>
2020-04-26 09:20:06 -04:00
William Falcon b620d86c54
diable val and test shuffling (#1600)
* diable val and test shuffling

* diable val and test shuffling

* diable val and test shuffling

* diable val and test shuffling

* log

* condition

* shuffle

* refactor

Co-authored-by: J. Borovec <jirka.borovec@seznam.cz>
2020-04-25 16:45:20 -04:00
William Falcon 791ba91dec
slurm job id (#1605) 2020-04-25 16:01:15 -04:00
Jirka Borovec 58a467dd68
model checkpint on rank_zero_only & global rank state (#1408)
* try delete in async or DDP us0-ecase

* changelog

* add model chekpoint rank

* simple delete

* flake8

* use global rank

* chnagelog

* fix review

* fix import

* proposal

* proposal

* proposal

* improve proposal (fix problems with method call self)

* cleaning

Co-authored-by: Adrian Wälchli <adrian.waelchli@students.unibe.ch>
Co-authored-by: William Falcon <waf2107@columbia.edu>
2020-04-24 17:21:00 -04:00
William Falcon d0faf97893
fixed dataset stuff + docs (#1599)
* Fixed dataset docs and disabled auto-sampler for iterable dataset
2020-04-24 16:51:26 -04:00
Jirka Borovec 570b2c7aeb
fix depreated call (#1596)
* fix parity

* update deprecated call
2020-04-24 14:45:43 -04:00
William Falcon cd15bfc3ce
fixed new amp bugs (#1593) 2020-04-24 09:29:39 -04:00