Commit Graph

41 Commits

Author SHA1 Message Date
William Falcon cd16aa9854
ref: checkpoint connector methods 4/n (#3474)
* ref: checkpoint connector methods 4/n

* ref: checkpoint connector methods 4/n

* ref: checkpoint connector methods 4/n

* ref: checkpoint connector methods 4/n

* ref: checkpoint connector methods 4/n

* ref: checkpoint connector methods 4/n

* ref: checkpoint connector methods 4/n

* ref: checkpoint connector methods 4/n

* ref: checkpoint connector methods 4/n
2020-09-12 08:42:27 -04:00
William Falcon 4724cdf5e0 ref: checkpoint connector methods 3/n 2020-09-12 07:05:21 -04:00
William Falcon a208d6da46
ref: organize args 2/n (#3448)
* ref: organize args 2/n

* ref: organize args 2/n

* ref: organize args 2/n
2020-09-10 10:51:35 -04:00
Adrian Wälchli e245065fbc
limit auto scaling batch size to the size of the training dataset (#3271)
* fix

* fix and test

* fix merge error

* test for max dataset size

* changelog

* update docs

* fix merge

* unused imports

* imports
2020-09-09 10:51:43 +02:00
William Falcon b76d9e5dd5
Refa22 (#3388)
* ref: inner train loop (intermediate step) 20/n

* ref: inner train loop (intermediate step) 21/n

* ref: inner train loop (intermediate step) 21/n

* ref: inner train loop (intermediate step) 21/n

* ref: inner train loop (intermediate step) 21/n

* ref: inner train loop (intermediate step) 21/n
2020-09-07 16:45:31 -04:00
William Falcon 38b9677638
ref: inner train loop (intermediate step) 5/n (#3365) 2020-09-05 18:27:28 -04:00
Adrian Wälchli 48c22c8bad
update batch size in DataModule when auto scaling batch size (#3266)
* fix datamodule hasattr

* fix patch check

* fix setattr

* update docs

* revert patch fix

* changelog

* fix datamodule passed in as fit arg

* docs

* set datamodule batch size in lightning_setattr

* fix merge

* check with has_attr

* access datamodule via trainer

* pass fit args down to tuner

* docs

* fix typos in docs

Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>

Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>
2020-09-03 22:07:49 +02:00
LiJiezhi 0112355055
Update training_tricks.py (#3151)
* Update training_tricks.py

* pep

Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
2020-08-26 07:57:34 +00:00
Adrian Wälchli 7b917de946
fix setting batch_size attribute in batch_size finder (finishing PR #2523) (#3043)
* lightning attr fix

* revert refactor

* create test

* separate test

* changelog update

* tests

* revert

* Update pytorch_lightning/trainer/training_tricks.py

Co-authored-by: William Falcon <waf2107@columbia.edu>
2020-08-19 19:01:55 -04:00
Jirka Borovec 4354690e55
add apex test (#2921)
* add apex test

* rename

* level

* events

* wrap

* evt

* miss

* apex

* apex

* apex

* apex

* apex

* apex

* Update tests/models/test_amp.py

Co-authored-by: William Falcon <waf2107@columbia.edu>

* notes

* notes

Co-authored-by: William Falcon <waf2107@columbia.edu>
2020-08-13 10:03:13 -04:00
Jirka Borovec a6e7aa7796
allow using apex with any PT version (#2865)
* wip

* setup

* type

* name

* wip

* docs

* imports

* fix if

* fix if

* use_amp

* Apply suggestions from code review

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* Apply suggestions from code review

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* fix tests

* Apply suggestions from code review

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* fix tests

* todos

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2020-08-08 11:07:32 +02:00
Jirka Borovec b7d72706c3
clean imports (#2867)
* clean imports

* miss
2020-08-08 00:33:51 +02:00
Ruotian(RT) Luo 6034d5e37d
fix apex gradient clipping (#2829) 2020-08-05 13:42:21 -04:00
Rohit Gupta de9c9f0864
Support limit_mode_batches (int) for infinite dataloader (#2787)
* Support limit_mode_batches(int) for infinite dataloader

* flake8

* revert and update

* add and update tests

* pep8

* chlog

* Update CHANGELOG.md

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* Add suggestions by @awaelchli

* docs

* Apply suggestions from code review

Co-authored-by: Ethan Harris <ewah1g13@soton.ac.uk>

* Apply suggestions from code review

* fix

* max

* check

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: Ethan Harris <ewah1g13@soton.ac.uk>
Co-authored-by: Jirka Borovec <jirka@pytorchlightning.ai>
2020-08-05 17:04:49 +00:00
Phil 2f0fb34496
Speed up gradient clipping and allow parameters on multiple devices. (#2767)
The speed up is achieved by:
- Moving the "where" out of the loop (and replacing with min for simplicity).
- Replacing manual sum and pow with torch.norm. Even though this results
  in unnessecary computation (computing pow(root)) this is still a lot
  faster.
- Preallocating the output gives a slight speed up.

Note that calling .to for all parameters results in a small speed
penalty (~4 ms in my case) but allows parameters on different devices.

Overall this reduces the time used for gradient clipping from 206ms to
74 ms for my model (Resnet50 + few additional vars, all vars on GPU).
2020-07-30 11:53:24 -04:00
Tejasvi S Tomar 8ab5bcda3d
Misleading exception raised during batch scaling (#2223)
* Misleading exception raised during batch scaling

Use batch_size from `model.hparams.batch_size` instead of `model.batch_size`

* Improvements considering #1896

* Apply suggestions from code review

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2020-07-29 18:47:11 -04:00
William Falcon 071e09fe38
refactor 1/n for v1.0.0 (#2704)
* reactor into gpu accelerator

* reactor into gpu accelerator

* reactor into gpu accelerator

* reactor into gpu accelerator

* reactor into gpu accelerator

* reactor into gpu accelerator

* reactor into gpu accelerator

* reactor into gpu accelerator

* reactor into gpu accelerator

* reactor into gpu accelerator

* reactor into gpu accelerator

* reactor into gpu accelerator
2020-07-25 14:38:51 -04:00
Hayden Housen 992a7e2a41
Start accumulate gradients schedule at epoch 0 (continued) (#2513)
* Start accumulate gradients schedule at epoch 0

* Undo change in #2375

* Update test_trainer.py::test_gradient_accumulation_scheduling

* Fix pep8 formatting

* Remove 'Datasets/' folder

* Split args for readability

* Fix pep8 formatting
2020-07-09 07:11:07 -04:00
Adrian Wälchli 25ee51bc57
Continue Jeremy's early stopping PR #1504 (#2391)
* add state_dict for early stopping

* move best attr after monitor_op defined

* improve early stopping and model checkpoint callbacks

* fix formatting

* fix attr init order

* clean up setting of default_root_dir attr

* logger needs default root dir set first

* reorg trainer init

* remove direct references to checkpoint callback

* more fixes

* more bugfixes

* run callbacks at epoch end

* update tests to use on epoch end

* PR cleanup

* address failing tests

* refactor for homogeneity

* fix merge conflict

* separate tests

* tests for early stopping bug regressions

* small fixes

* revert model checkpoint change

* typo fix

* fix tests

* update train loop

* cannot pass an int as default_save_path

* refactor log message

* fix test case

* appease the linter

* fix some doctests

* move config to callback

* fixes from rebase

* fixes from rebase

* chlog

* docs

* reformat

* formatting

* fix

* fix

* fixes from rebase

* add new test for patience

* Update pytorch_lightning/callbacks/model_checkpoint.py

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* Update pytorch_lightning/callbacks/model_checkpoint.py

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* Update tests/callbacks/test_early_stopping.py

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* fix formatting

* remove enable_early_stop attribute

* add state_dict for early stopping

* move best attr after monitor_op defined

* improve early stopping and model checkpoint callbacks

* fix formatting

* fix attr init order

* clean up setting of default_root_dir attr

* logger needs default root dir set first

* reorg trainer init

* remove direct references to checkpoint callback

* more fixes

* more bugfixes

* run callbacks at epoch end

* update tests to use on epoch end

* PR cleanup

* address failing tests

* refactor for homogeneity

* fix merge conflict

* separate tests

* tests for early stopping bug regressions

* small fixes

* revert model checkpoint change

* typo fix

* fix tests

* update train loop

* fix test case

* appease the linter

* fix some doctests

* move config to callback

* fixes from rebase

* fixes from rebase

* chlog

* docs

* reformat

* formatting

* fix

* fix

* fixes from rebase

* add new test for patience

* Update pytorch_lightning/callbacks/model_checkpoint.py

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* Update pytorch_lightning/callbacks/model_checkpoint.py

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* Update tests/callbacks/test_early_stopping.py

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* fix formatting

* remove enable_early_stop attribute

* fix test with new epoch indexing

* fix progress bar totals

* fix off by one error (see #2289) epoch starts at 0 now

* added missing imports

* fix hpc_save folderpath

* fix formatting

* fix tests

* small fixes from a rebase

* fix

* tmpdir

* tmpdir

* tmpdir

* wandb

* fix merge conflict

* add back evaluation after training

* test_resume_early_stopping_from_checkpoint TODO

* undo the horovod check

* update changelog

* remove a duplicate test from merge error

* try fix dp_resume test

* add the logger fix from master

* try remove default_root_dir

* try mocking numpy

* try import numpy in docs test

* fix wandb test

* pep 8 fix

* skip if no amp

* dont mock when doctesting

* install extra

* fix the resume ES test

* undo conf.py changes

* revert remove comet pickle from test

* Update CHANGELOG.md

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* Update weights_loading.rst

* Update weights_loading.rst

* Update weights_loading.rst

* renamed flag

* renamed flag

* revert the None check in logger experiment name/version

* add the old comments

* _experiment

* test chckpointing on DDP

* skip the ddp test on windows

* cloudpickle

* renamed flag

* renamed flag

* parentheses for clarity

* apply suggestion max epochs

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

Co-authored-by: Jeremy Jordan <jtjordan@ncsu.edu>
Co-authored-by: Jirka <jirka@pytorchlightning.ai>
Co-authored-by: Jeremy Jordan <13970565+jeremyjordan@users.noreply.github.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: William Falcon <waf2107@columbia.edu>
2020-06-28 21:36:46 -04:00
William Falcon 2411c3be70
replace train_percent_check with limit_train_batches (#2220)
* drop train_percent_check

* drop train_percent_check

* drop train_percent_check

* drop train_percent_check

* drop train_percent_check

* drop train_percent_check

* drop train_percent_check

* drop train_percent_check

* drop train_percent_check

* drop train_percent_check

* drop train_percent_check

* drop train_percent_check

* drop train_percent_check

* drop train_percent_check

* drop train_percent_check

* drop train_percent_check

* drop train_percent_check

* chlog

* deprecated

* deprecated

* deprecated

* tests

* tests

* Apply suggestions from code review

* tests

* hydra support

* tests

* hydra support

* hydra support

* hydra support

* tests

* typo

* typo

* Update test_dataloaders.py

* docs

* docs

* docs

* docs

Co-authored-by: Jirka <jirka@pytorchlightning.ai>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2020-06-17 13:42:28 -04:00
William Falcon 97dfd3a80a
Revert "Misleading exception raised during batch scaling (#1973)" (#2219)
This reverts commit f8103f9c7d.
2020-06-17 08:01:53 -04:00
Tejasvi S Tomar f8103f9c7d
Misleading exception raised during batch scaling (#1973)
* Misleading exception raised during batch scaling

Use batch_size from `model.hparams.batch_size` instead of `model.batch_size`

* Improvements considering #1896

* Apply suggestions from code review

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2020-06-17 08:01:04 -04:00
Udit Arora 08573d0f7e
Fix some pyright member access errors in training module (#2121)
* Fix pyright member access errors in training module

* Fix Trainer instantiation error due to inheritence order

* Add GH workflow for pyright

* Fix more pyright errors in trainer module

* Add pyrightconfig and setup python environment in type-check workflow

* Exclude pyrightconfig.json

* suggestions

Co-authored-by: Jirka <jirka@pytorchlightning.ai>
2020-06-12 17:23:18 +02:00
William Falcon caa9c6760b
replace Hparams by init args (#1896)
* remove the need for hparams

* remove the need for hparams

* remove the need for hparams

* remove the need for hparams

* replace self.hparams

* replace self.hparams

* replace self.hparams

* replace self.hparams

* replace self.hparams

* replace self.hparams

* replace self.hparams

* replace self.hparams

* replace self.hparams

* replace self.hparams

* replace self.hparams

* replace self.hparams

* replace self.hparams

* replace self.hparams

* replace self.hparams

* replace self.hparams

* replace self.hparams

* replace self.hparams

* replace self.hparams

* replace self.hparams

* replace self.hparams

* replace self.hparams

* replace self.hparams

* replace self.hparams

* replace self.hparams

* replace self.hparams

* replace self.hparams

* replace self.hparams

* replace self.hparams

* replace self.hparams

* replace self.hparams

* replace self.hparams

* fixed

* fixed

* fixed

* fixed

* fixed

* fixed

* fixed

* fixed

* fixed

* fixed

* fixed

* fixed

* fixed

* fixed

* finished moco

* basic

* testing

* todo

* recurse

* hparams

* persist

* hparams

* chlog

* tests

* tests

* tests

* tests

* tests

* tests

* review

* saving

* tests

* tests

* tests

* docs

* finished moco

* hparams

* review

* Apply suggestions from code review

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* hparams

* overwrite

* transform

* transform

* transform

* transform

* cleaning

* cleaning

* tests

* examples

* examples

* examples

* Apply suggestions from code review

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* chp key

* tests

* Apply suggestions from code review

* class

* updated docs

* updated docs

* updated docs

* updated docs

* save

* wip

* fix

* flake8

Co-authored-by: Jirka <jirka@pytorchlightning.ai>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2020-05-24 18:59:08 -04:00
Nicki Skafte 88f816ed06
dummy logger (#1836)
Co-authored-by: Nicki Skafte <nugginea@gmail.com>
2020-05-14 10:34:11 -04:00
William Falcon 5bb6b41b78
dataloaders with fast_dev_run (#1787)
* dataloaders with fast_dev_run

* dataloaders with fast_dev_run

* dataloaders with fast_dev_run

* fix

* pep 8
2020-05-11 23:32:44 -04:00
Nicki Skafte 4970927ec8
Feature: auto scale batch size (#1638)
* auto batch finder

* fix styling

* add description

* add different modes

* fix copy paste error

* better organised code

* fix styling

* add tests

* fix

* fix

* add some documentation

* added CHANGELOG.md

* some documentation

* update based on review

* Update trainer.py

* Update docs/source/training_tricks.rst

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* Update tests/trainer/test_trainer_tricks.py

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* Update tests/trainer/test_trainer_tricks.py

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* use EvalModelTemplate

* param tests

* rename

* wrap params

* rename function

* rename

* rename param

* fix

* abs

* rename

* refactor code

* add docs

* try

* arg

* loop

* exept

* loop

* drop bool

* docs

* docs

* added check and test for passing dataloader to fit

* styling fix

* update based on review

Co-authored-by: Nicki Skafte <nugginea@gmail.com>
Co-authored-by: William Falcon <waf2107@columbia.edu>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Jirka <jirka.borovec@seznam.cz>
2020-05-09 08:28:36 -04:00
William Falcon 29ebe92208
support for native amp (#1561)
* adding native amp suppport

* adding native amp suppport

* adding native amp suppport

* adding native amp suppport

* autocast

* autocast

* autocast

* autocast

* autocast

* autocast

* removed comments

* removed comments

* added state saving

* added state saving

* try install amp again

* added state saving

* drop Apex reinstall

Co-authored-by: J. Borovec <jirka.borovec@seznam.cz>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2020-04-23 14:47:08 -04:00
Jonas-Jaeger e02146943d
Removed redundant computations in clip_gradients that slowed down the gradient clipping. (#1523)
Fixes #1522
2020-04-18 23:07:15 -04:00
Alex Sergeev 8dd9b80d7a
Fix gradient clipping (#1438)
* Fix gradient clipping

* Relax accuracy constraint
2020-04-09 21:08:28 -04:00
Adrian Wälchli 732eaee4d7
nan detection and intervention (#1097)
* check for nan values

* test nan detection on loss

* sys.exit

* whitespace

* detect nan and inf values in loss and params

* update

* added documentation

* moved detect nan to training loop, remove flag for print

* blank line

* test

* rename

* deprecate print_nan_grads

* deprecated print_nan_grads

* remove unused imports

* update changelog

* fix line too long

* correct deprecated version

Co-Authored-By: Jirka Borovec <Borda@users.noreply.github.com>

* raise exception instead of sysexit

Co-Authored-By: Jirka Borovec <Borda@users.noreply.github.com>

* raise exception instead of sysexit

Co-Authored-By: Jirka Borovec <Borda@users.noreply.github.com>

* Update pytorch_lightning/trainer/training_tricks.py

Co-Authored-By: Jirka Borovec <Borda@users.noreply.github.com>

* Update pytorch_lightning/trainer/training_tricks.py

Co-Authored-By: Jirka Borovec <Borda@users.noreply.github.com>

* fix test

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2020-03-19 09:24:45 -04:00
Jacob Zhong 1a73fa0b03
change default logger to dedicated one (#1064)
Fix test


Fix format

Update pytorch_lightning/__init__.py
Separate imports
2020-03-17 18:44:00 -04:00
Jirka Borovec 514d182b7f
cleaning imports (#1032) 2020-03-12 12:41:37 -04:00
William Falcon 4c5e82c065
Skepticleo trainer argparser (#1023)
* Added default parser for trainer and class method to construct trainer from default args

* Removed print statement

* Added test for constructing Trainer from command line args

* Removed extra line

* Removed redundant imports, removed whitespace from empty lines

* Fixed typo

* Updated default parser creation to get class attributes automatically

* Updated default parser creation to get class attributes automatically

* Added method to get default args for trainer

* Trimmed trainer get default args method

* Updated from argparse method to not return trainer with static arguments

* Update trainer get default args to classmethod

* adjustment

* fix

* Fixed variable name

* Update trainer.py

* Update test_trainer.py

* Update trainer.py

* Update tests/trainer/test_trainer.py

Co-Authored-By: Jirka Borovec <Borda@users.noreply.github.com>

* Update trainer.py

* Update test_trainer.py

* Update trainer.py

* Update test_trainer.py

* Update tests/trainer/test_trainer.py

Co-Authored-By: Jirka Borovec <Borda@users.noreply.github.com>

* Update pytorch_lightning/trainer/trainer.py

Co-Authored-By: Jirka Borovec <Borda@users.noreply.github.com>

* Update trainer.py

* Update test_trainer.py

Co-authored-by: Mudit Tanwani <mudittanwani@gmail.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2020-03-03 09:32:15 -05:00
Jirka Borovec 7beed7cae6
Trainer cleanup (#934)
* Trainer cleanup

* update abstract

* remove ...

* remove __init__

* update mixin types

* update callbacks

* fix

* lower test acc
2020-02-27 16:21:14 -05:00
srush 27a3be0287
TPU gradient clipping. (#963)
* clip

* Update pytorch_lightning/trainer/training_tricks.py

Co-Authored-By: Jirka Borovec <Borda@users.noreply.github.com>

* Update pytorch_lightning/trainer/training_tricks.py

Co-Authored-By: Jirka Borovec <Borda@users.noreply.github.com>

* pull out epsilon

* add fp16 case

* Update pytorch_lightning/trainer/training_tricks.py

Co-Authored-By: Jirka Borovec <Borda@users.noreply.github.com>

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2020-02-27 15:46:47 -05:00
Hadrien Mary be244560b2
Callbacks [wip] (#889)
* Add callback system + associated test

* Add trainer and pl_module args to callback methods

* typing

* typo in docstring

* Switch to on_.*_start()

* fix on_test_start

* fix the mess after rebasing
2020-02-25 23:17:27 -05:00
Vadim Bereznyuk edd4a87fb0
Refactor callbacks (#776)
* Refactor callbacks

* flake8

* Update docstrings

* Simplified callback, protected trainer

* .set_trainer() check

* update docs

* missed super().__ini__()

* Updated tests

* Use uppercase

* refine checkpoint callback tests

* Added test_begin() and test_end()
2020-02-16 00:03:05 -05:00
Jirka Borovec 76a1c67d87
rename logging -> loggers (#767)
* move logging >> loggers

* add warning

* fix tests

* logging alias

* formatting

* formatting
2020-02-01 15:47:58 -05:00
Jirka Borovec ea59a99426 update org paths & convert logos (#685)
* fix typos

* update org paths

* update links from READMe to docs

* add svg logo

* add svg logo-text

* update logos

* testing temp paths

* prune links from readme

* optimize imports

* update logo

* update paths in README

* missing imports
2020-01-20 14:50:31 -05:00
Jirka Borovec 1d4b6be17b rename trainer modules, drop `_mixin` (#571)
* rename trainer modules, drop _mixin

* fix imports
2019-12-04 11:39:14 -05:00