Commit Graph

2605 Commits

Author SHA1 Message Date
Adrian Wälchli 25ee51bc57
Continue Jeremy's early stopping PR #1504 (#2391)
* add state_dict for early stopping

* move best attr after monitor_op defined

* improve early stopping and model checkpoint callbacks

* fix formatting

* fix attr init order

* clean up setting of default_root_dir attr

* logger needs default root dir set first

* reorg trainer init

* remove direct references to checkpoint callback

* more fixes

* more bugfixes

* run callbacks at epoch end

* update tests to use on epoch end

* PR cleanup

* address failing tests

* refactor for homogeneity

* fix merge conflict

* separate tests

* tests for early stopping bug regressions

* small fixes

* revert model checkpoint change

* typo fix

* fix tests

* update train loop

* cannot pass an int as default_save_path

* refactor log message

* fix test case

* appease the linter

* fix some doctests

* move config to callback

* fixes from rebase

* fixes from rebase

* chlog

* docs

* reformat

* formatting

* fix

* fix

* fixes from rebase

* add new test for patience

* Update pytorch_lightning/callbacks/model_checkpoint.py

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* Update pytorch_lightning/callbacks/model_checkpoint.py

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* Update tests/callbacks/test_early_stopping.py

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* fix formatting

* remove enable_early_stop attribute

* add state_dict for early stopping

* move best attr after monitor_op defined

* improve early stopping and model checkpoint callbacks

* fix formatting

* fix attr init order

* clean up setting of default_root_dir attr

* logger needs default root dir set first

* reorg trainer init

* remove direct references to checkpoint callback

* more fixes

* more bugfixes

* run callbacks at epoch end

* update tests to use on epoch end

* PR cleanup

* address failing tests

* refactor for homogeneity

* fix merge conflict

* separate tests

* tests for early stopping bug regressions

* small fixes

* revert model checkpoint change

* typo fix

* fix tests

* update train loop

* fix test case

* appease the linter

* fix some doctests

* move config to callback

* fixes from rebase

* fixes from rebase

* chlog

* docs

* reformat

* formatting

* fix

* fix

* fixes from rebase

* add new test for patience

* Update pytorch_lightning/callbacks/model_checkpoint.py

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* Update pytorch_lightning/callbacks/model_checkpoint.py

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* Update tests/callbacks/test_early_stopping.py

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* fix formatting

* remove enable_early_stop attribute

* fix test with new epoch indexing

* fix progress bar totals

* fix off by one error (see #2289) epoch starts at 0 now

* added missing imports

* fix hpc_save folderpath

* fix formatting

* fix tests

* small fixes from a rebase

* fix

* tmpdir

* tmpdir

* tmpdir

* wandb

* fix merge conflict

* add back evaluation after training

* test_resume_early_stopping_from_checkpoint TODO

* undo the horovod check

* update changelog

* remove a duplicate test from merge error

* try fix dp_resume test

* add the logger fix from master

* try remove default_root_dir

* try mocking numpy

* try import numpy in docs test

* fix wandb test

* pep 8 fix

* skip if no amp

* dont mock when doctesting

* install extra

* fix the resume ES test

* undo conf.py changes

* revert remove comet pickle from test

* Update CHANGELOG.md

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* Update weights_loading.rst

* Update weights_loading.rst

* Update weights_loading.rst

* renamed flag

* renamed flag

* revert the None check in logger experiment name/version

* add the old comments

* _experiment

* test chckpointing on DDP

* skip the ddp test on windows

* cloudpickle

* renamed flag

* renamed flag

* parentheses for clarity

* apply suggestion max epochs

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

Co-authored-by: Jeremy Jordan <jtjordan@ncsu.edu>
Co-authored-by: Jirka <jirka@pytorchlightning.ai>
Co-authored-by: Jeremy Jordan <13970565+jeremyjordan@users.noreply.github.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: William Falcon <waf2107@columbia.edu>
2020-06-28 21:36:46 -04:00
Jirka Borovec 1e16681693
fix loading with hparams (#2403)
* fix #2386

* extra test

* extra case

* extra test

* chlog

* fix test
2020-06-28 20:22:03 -04:00
Adrian Wälchli 058c500300
fix when torchtext not installed (#2402) 2020-06-28 20:03:51 -04:00
Jirka Borovec 861a73be12
fix loading past checpoints (#2405)
* fix #2334

* chlog
2020-06-28 17:20:33 -04:00
William Falcon 66ffbaddf5
updates teardown to account for ddp (#2389)
* remove warnings

* remove warnings

* added doc lines

* added doc lines
2020-06-28 07:01:04 -04:00
Adrian Wälchli d910cc5200
docs: dont mock imports when running sphinx doctest (#2396)
* skip if no amp

* dont mock when doctesting

* install extra
2020-06-27 23:31:06 -04:00
Jirka Borovec 75f0a2062c
move torchtext as optional (#2395)
* torchtext

* Update pytorch_lightning/utilities/apply_func.py

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* Update apply_func.py

Co-authored-by: William Falcon <waf2107@columbia.edu>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2020-06-27 20:15:10 -04:00
Jirka Borovec 51711c265a
fix loading model with kwargs (#2387)
* test

* fix

* fix
2020-06-27 16:38:03 -04:00
Mateusz Pieniak e82d9cdb66
Support torchtext on a single GPU (#2379)
* Handle torchtext.data.Batch on GPU

* Update CHANGELOG.md

* Apply code review requests

* Correct the docs

* Change requirements
2020-06-27 16:36:45 -04:00
Jirka Borovec 73a78a13c7
CI: partial move from CircleCI (#2378)
* move from CircleCI

* req

* tex

* tex

* sudo

* extra

* recom

* pic

* dvipng
2020-06-27 16:25:33 -04:00
William Falcon 90f641af0d
fixes logger crash on ddp (#2388)
* remove warnings

* remove warnings

* remove warnings

* remove warnings

* remove warnings

* remove warnings

* remove warnings

* remove warnings

* remove warnings

* remove warnings
2020-06-27 15:08:22 -04:00
Jirka Borovec 41f5df18a4
move Trains logger to Bolts (#2384)
* move Trains logger

* chlog
2020-06-27 09:14:05 -04:00
Jirka Borovec 4e13e419ea
add CLI test for examples (#2285)
* cli examples

* ddp

* CI

* CI

* req

* tests

* skip DDP

Co-authored-by: William Falcon <waf2107@columbia.edu>
2020-06-27 09:13:29 -04:00
Jirka Borovec 6673fc9a0b
fix docker builds (#2383) 2020-06-27 08:49:19 -04:00
Jirka Borovec 2f739f5977
fix key typo (#2374) 2020-06-26 21:46:08 -04:00
Kshitij09 20d0f53896
Fix ModelCheckpoint example (#2321)
`save_top_k` should be an `int` and have been mentioned as `save_top_k=True` in the snippet provided under 'Saving and Loading Weights' docs. Changed it to its default value (1) to make it consistent.

Signed-off-by: Kshitij Patil <kshitijpatil98@gmail.com>
2020-06-26 21:45:41 -04:00
Jirka Borovec 0be78d13aa
native amp (#2373)
* native amp

* typo

* imports

* apex
2020-06-26 21:45:13 -04:00
Jirka Borovec f1c96930b1
repair CI for Win (#2358)
* no cov

* no cov

* ReduceOp

* group

* reduce_op.sum

* Update sklearns.py

* formatting

* horovod

* Apply suggestions from code review

* horovod

* horovod

* horovod

* horovod

* ci

* print

* ci

* timeout

* timeout

* time

* fix

* distributed cpu

* pipes

* time

* cpu

* spawn

* spawn

* spawn

* tp

* separate

* os

* os

* npm

* Fix load_from_checkpoint() not working with URL on Windows

* Update CHANGELOG

* Update CHANGELOG.md

Co-authored-by: Peter Yu <2057325+yukw777@users.noreply.github.com>

* Apply suggestions from code review

* fix

* fix meta tags creating empty lines

* pyright

* node

* fix httpserver address

* drop tutils.default_trainer_options

* imports

* Better fix for load_from_checkpoint() not working with absolute path on Windows (#2294)

* Fix load_from_checkpoint() not working with URL on Windows

* Update CHANGELOG

* Update CHANGELOG.md

Co-authored-by: Peter Yu <2057325+yukw777@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Peter Yu <2057325+yukw777@users.noreply.github.com>

* drop duplicate

Co-authored-by: Justus Schock <12886177+justusschock@users.noreply.github.com>
Co-authored-by: airium <airium@outlook.com>
Co-authored-by: Peter Yu <2057325+yukw777@users.noreply.github.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: AIRIUM <38249940+airium@users.noreply.github.com>
2020-06-26 21:38:25 -04:00
Jirka Borovec a5f45787ea
fix get dataloader size (#2375)
* get dataloader size

* pyright
2020-06-26 15:38:48 -04:00
Thomas Schaaf 7c0a3f4745
Bugfix/_has_len (#2307)
* deal with NotImplementedError raised by torchtext

* deal with NotImplementedError raised by torchtext

* Added tests for dataloader which raise NotImplementedError in __len__()

* Fixed some typos

* enabled tests for dataloader raising NotImplementedError in __len__ and corrected match string for raised exception

* deleted empty line for style compliance

* refactored CustomNotImplementedErrorDataloader to derive from CustomInfDataloader

* enabled reduced number of not_implemented_error dataloader test to reduce runtime for continuous integration

* reduced test number of not_implemented_error dataloader test further to reduce test time

* reduced test number of not_implemented_error dataloader test to one to reduce test time

* disabled all not_implemented_error dataloader test to see if test pass in time

* added __next__ with a reduced number (5) of elements after which CustomNotImplementedErrorDataloader stops to speedup test.

* enabling all not_implemented_error dataloader test

* added brief description of change and relation of torchtext

* CustomNotImplementedErrorDataloader reduced number of batches served to 2.

* Update CHANGELOG.md

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* Apply suggestions from code review

* Update CHANGELOG.md

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* Disable parallelism in dataloader

Suspect that it might cause pytest to hang more frequent

* added max_steps=None to Trainer in not_implemented_error dataloader tests

* rearranged not_implemented_error test in file to group them together

* disabled parallel data loading
Reason: testing if that stops the test framework from hanging.

* Apply suggestions from code review

* Apply suggestions from code review

Co-authored-by: Thomas Schaaf <tschaaf@cs.cmu.edu>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2020-06-26 09:31:08 -04:00
William Falcon cbb2427f0d
changed apex level (#2362) 2020-06-25 18:54:32 -04:00
William Falcon 0a092f6683
making optimization steps for hooks (#2363)
*simplified optimizer step and zero grad overriding
2020-06-25 16:02:16 -04:00
William Falcon d22181714a
fix 2333 (#2360) 2020-06-25 11:10:17 -04:00
William Falcon f2710bb500
adds tensorboard hparams logging test (#2342)
* fixes hparam logging

* fixes hparam logging

* fixes hparam logging

* fixes hparam logging

* fixes hparam logging

* Apply suggestions from code review

* skipif

* rename

* Update test_tensorboard.py

* Update test_tensorboard.py

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Jirka <jirka@pytorchlightning.ai>
2020-06-25 09:22:28 -04:00
William Falcon c275e1fc91
swaps lr sched order (#2356)
* swaps lr sched order

* Update optimizers.py

* added amdim encoder choice
2020-06-25 09:21:41 -04:00
davinnovation b6ab7ca121
[docs] add community example : pl + ms nni (#2340)
https://github.com/PyTorchLightning/pytorch-lightning/issues/2329
2020-06-24 23:13:49 -04:00
Adrian Wälchli 220bb6db57
remove wrong annotation (#2349) 2020-06-24 22:29:26 -04:00
Adrian Wälchli 9b2e60530f
Python logging level docs (#2348)
* docs about Python logging

* add link to Python logging docs
2020-06-24 22:29:01 -04:00
David Waterworth cc07dcae96
corrected example usage of save_hyperparameters from List[str] to seperate str (#2353)
Co-authored-by: David Waterworth <david.waterworth@cim.io>
2020-06-24 22:28:38 -04:00
Adrian Wälchli aab9e77d2d
Fix lost compatibility with custom datatypes implementing `.to` (#2335)
* generalize data transfer

* added test

* update docs

* fix spelling error

* changelog

* update docs
2020-06-23 23:41:02 -04:00
William Falcon 598f5140c5
refactor training loop (#2336)
* refactoring training epoch

* refactored training epoch

* refactored training epoch

* refactored training epoch

* refactored training epoch

* refactored training epoch

* fixes slurm weights saving

* fixes slurm weights saving
2020-06-23 23:38:22 -04:00
William Falcon c09b2ffb91
test (#2341)
* fixes rank zero issue
2020-06-23 21:57:45 -04:00
William Falcon a915280427
fixes slurm weights saving (#2339) 2020-06-23 20:16:34 -04:00
Lezwon Castelino 9446390779
fix TPU parsing and TPU tests (#2094)
* added tpu params test

* added tests

* removed xla imports

* added test cases for TPU

* fix pep 8 issues

* refactorings and comments

* add message to MisconfigurationException

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* test if device is set correctly

* added TPU device check
removed mark.spawn

* removed device selection

* remove xla_device call

* readded spawn due to test failures

* add TODO for tpu check

* Apply suggestions from code review

* Apply suggestions from code review

* flake8

* added tpu args to cli tests

* added support for tpu_core selection via cli

* fixed flake formatting

* replaced default_save_path with default_root_dir

* added check for data type for tpu_cores

* fixed flake indent

* protected

* protected

* added tpu params test

* added tests

* removed xla imports

* test if device is set correctly

* added support for tpu_core selection via cli

* replaced default_save_path with default_root_dir

* added check for data type for tpu_cores

* chlog

* fixed tpu cores error

* rebased with latest changes

* flake fix

* Update pytorch_lightning/trainer/distrib_parts.py

added suggesstion

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Jirka <jirka@pytorchlightning.ai>
2020-06-23 12:06:57 -04:00
Adrian Wälchli e085e93dd3
Add missing test for "multiple dataloader + percent_check fix" (#2226)
* Init fix num_batches

* Fix num_batches in case of multiple dataloaders

* Apply suggestions from code review

* Changes based on suggestions

* Flake8

* Add test to check num_batches

* generalize dataloader percent check test

* fix formatting

* remove hparams

* tests

* CHANGELOG

* Update CHANGELOG.md

* max_batches can be int

* conflict and rebase

* add back the test


fix


fix message


0.0 works


Revert "fix message"

This reverts commit 839cacf8b8610f4e697e654ef6f3d2501bf23984.

* update changelog

* Update CHANGELOG.md

* Fix num batches in case of multiple dataloaders and percent_check (#1920)

* git conflict

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* missing union

* doc update suggestion by @rohitgr7

* extend test

* changelog

* docs add note about multiple loaders

* update changelog

* remove unused variable

Co-authored-by: rohitgr7 <rohitgr1998@gmail.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2020-06-23 11:21:24 -04:00
Siavash Sakhavi 44385bb582
Checking if the parameters are a DictConfig Object (#2216)
* Checking if the parameters are a DictConfig Object

This is in reference to #2058 . 

To be honest, I have no idea how I should go about writing a test for this.

* Update pytorch_lightning/loggers/base.py

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* fix ...

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Jirka <jirka@pytorchlightning.ai>
2020-06-23 17:20:44 +02:00
Adrian Wälchli bdee1cd106
update docs for "overfit_batches" (#2324)
* update docs

* Apply suggestions from code review

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2020-06-23 11:19:38 -04:00
William Falcon 0f073819d3
refactored training_batch + tests to verify correctness (#2328)
* refactored training_bath

* refactored training_bath

* refactored training_bath

* refactored training_bath

* refactored training_bath

* refactored training_bath

* refactored training_bath

* refactored training_bath

* refactored training_bath

* refactored training_bath

* refactored training_bath
2020-06-23 11:17:10 -04:00
Tri Dao 29179dbfcc
Fix ROC metric for CUDA tensors (#2304)
* Fix ROC metric for CUDA tensors

Previously roc metric (and auroc) errors when passed in CUDA tensors,
due to torch.tensor construction without specifying device.
This fixes the error by using F.pad instead.

* Update test_classification.py

* Update test_classification.py

* chlog

* Update test_classification.py

* Update test_classification.py

* Update tests/metrics/functional/test_classification.py

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* Update test_classification.py

Co-authored-by: Justus Schock <12886177+justusschock@users.noreply.github.com>
Co-authored-by: Jirka <jirka@pytorchlightning.ai>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2020-06-23 15:19:16 +02:00
elias-ramzi 92f122e0df
Fix average_precision metric (#2319)
* Fixed average_precision metric, parenthesis were missing. Added test test that failed with the old implementation

* Modified CHANGELOG.md

* Update CHANGELOG.md

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2020-06-23 13:21:00 +02:00
Rezyapkin-Vyacheslav 63bd0582e3
fix typo in forward return (#2301) 2020-06-21 15:54:17 -04:00
Adrian Wälchli f972ab3a82
Fix summary hook handles not getting removed (#2298)
* detach hooks after completion

* detach hook

* update docs

* add test

* docs

* changelog
2020-06-20 07:38:47 -04:00
Jirka Borovec c7f8367650
devel version (#2292) 2020-06-19 23:42:57 -04:00
Jirka Borovec 4b90b79080
check omegaconf gpus (#2273)
* check omegaconf gpus

* test

* test

* Apply suggestions from code review

Co-authored-by: Omry Yadan <omry@fb.com>

Co-authored-by: Omry Yadan <omry@fb.com>
2020-06-19 23:42:11 -04:00
Jirka Borovec 7ecb0d2528
test CLI parsing gpus (#2284)
* cli gpus

* test

* test
2020-06-19 23:41:42 -04:00
Rohit Gupta b96dd21d69
Update new project code sample (#2287) 2020-06-19 23:41:03 -04:00
Jirka Borovec f278ac42c8
Revert/Fix: epoch indexing from 1, to be from 0 (#2289)
* Revert "deprecated: epoch indexing from 1 (#2206)"

This reverts commit f94b919b

* chlog

* grad index

* Apply suggestions from code review

* tests

* fix

* test
2020-06-19 23:39:53 -04:00
thschaaf 554fb4754c
Bugfix/_has_len (#2293)
* deal with NotImplementedError raised by torchtext

* deal with NotImplementedError raised by torchtext

* Added tests for dataloader which raise NotImplementedError in __len__()

* Fixed some typos

Co-authored-by: Thomas Schaaf <tschaaf@cs.cmu.edu>
2020-06-19 23:38:15 -04:00
Paweł Biernat 3256fe4e5a
Update progress.py (#2268)
Fixes a minor bug introduced in #2213
2020-06-19 15:47:39 -04:00
Jirka Borovec e0b7fed92e
deprecated Trainer proc_rank (#2269)
* deprecated

* test
2020-06-19 15:46:27 -04:00