Commit Graph

589 Commits

Author SHA1 Message Date
Ethan Harris 741c452551
Fix disabled grads after call to predict (#6657) 2021-03-23 23:07:48 +01:00
thomas chaton fd5cb7fcc3
Add PyTorch 1.8 Profiler 5/5 (#6618)
* Refactor profilers

* Update PassThrough

* WIP - This is broken and will change

* Update pytorch_lightning/profiler/pytorch.py

Co-authored-by: thomas chaton <thomas@grid.ai>

* resolve tests

* resolve tests

* find output

* try something

* update

* add support for test and predict

* update

* update

* use getattr

* test

* test

* update

* tests

* update

* update

* update

* update

* update

* remove file

* update

* update

* update

* update

* update

* test

* update#

* update

* update tests

* update

* add suport for 1.8

* rename records

* add support for 1.8

* update

* resolve flake8

* resolve test

* Refactor basic profilers

* Fixes

* Unused import

* Introduce setup

* Profile on all ranks. Print to stdout on 0

* Introduce dirpath + filename

* CHANGELOG

* Add tests. Address comments

* add `on_run_stage_setup`

* add on_run_stage_setup function

* update

* add test for RegisterRecordFunction

* update lightnng flow direction

* move variable to private

* remove trace

* Undo code that should be in 3/4

* Multi-stage multi-rank

* 2/5 changes

* Pass stage in __del__

* Remove TODOs

* Describe on_evaluation_end. Add tests

* Typo

* Address comments

* deepcopy tests

* Advanced teardown

* Fix teardown test

* Fix tests

* Minor change

* Update CHANGELOG.md

* Fix test

* Quick fixes

* Fix 6522

* resolve ddp tests

* resolve tests

* resolve some tests

* update tests

* resolve tests

* update

* resolve tests

* resolve some tests

* Missed fixes from 3/5

* Fixes

* resolve some tests

* resolve test for 1.7.1

* Broken refactor

* Missed stage

* Minor changes

* resolve tests

* Update CHANGELOG

* resolve bug

* remove print

* Typo

* Cleanup

* resolve ddp test

* remove barrier

* update profiler

* update

* Smaller model

* update

* resolve tests

* update

* Minor changes. CHANGELOG

* Minimize diff

* update to 1.8.1

* RunIf. Extra code. Check segfault

* resolve tests

* Typo. Bad merge

* Fixing a bad merge

* replace for kineto

* Update pytorch_lightning/profiler/pytorch.py

Co-authored-by: ananthsub <ananth.subramaniam@gmail.com>

* Update pytorch_lightning/profiler/pytorch.py

Co-authored-by: ananthsub <ananth.subramaniam@gmail.com>

* Minor changes

* Bad merge

* Use lists for flexibility

* Use sets

* predict_step

* Ananth's suggestion

* update

* Docs

* Update pl_examples/basic_examples/profiler_example.py

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* update example

* update example

Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
Co-authored-by: ananthsub <ananth.subramaniam@gmail.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2021-03-23 20:43:21 +00:00
Carlos Mocholí 51b10f78f4
Refactor PyTorch profiler 4/5 (#6349)
Co-authored-by: thomas chaton <thomas@grid.ai>
2021-03-23 18:13:29 +01:00
Jirka Borovec a74909affa
prune metrics: info retrieval (#6649) 2021-03-23 15:05:32 +00:00
Carlos Mocholí 36d180e532
Refactor base profilers 3/5 (#6621)
Co-authored-by: tchaton <thomas@grid.ai>
2021-03-23 10:07:35 +00:00
Jirka Borovec f93414d085
Prune metyrics: regression 9/n (#6637)
* psnr

* r2score

* ssim

* chlog
2021-03-23 10:01:25 +00:00
Jirka Borovec efce2b7777
Prune metrics: regression 8/n (#6636)
* explained_variance

* tests

* mean_absolute_error

* mean_squared_error

* mean_relative_error

* mean_squared_log_error

* chlog
2021-03-23 09:35:51 +01:00
Jirka Borovec 8cd75a4dd5
fix comparing versions (#6434)
* fix comparing versions

* chlog

* .

* ...

* datasets
2021-03-23 07:51:45 +00:00
thomas chaton 2064ece582
[refactor] Add setup to profilers + _run_stage_setup to trainer 2/5 (#6633)
* add setup

* update

* updates on comment

* Minor changes

* Extra import

* Docs

Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
2021-03-22 14:32:31 -04:00
camruta e2e1de0fb7
Add teardown method to BaseProfiler. (#6370)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: ananthsub <ananth.subramaniam@gmail.com>
2021-03-22 11:49:06 +00:00
Kaushik B 37f22c99ff
Add trainer.predict config validation (#6543)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2021-03-21 21:07:54 +00:00
Justus Schock 634d83134f
Add AMP for validation, prediction and testing (#6565)
* Add Tests for val and test-steps

* Add native AMP

* pep8 tests

* pep8 plugin

* changelog
2021-03-20 23:15:49 +00:00
Jirka Borovec 3a56a6024e
Prune metrics: other classification 7/n (#6584)
* confusion_matrix

* iou

* f_beta

* hamming_distance

* stat_scores

* tests

* flake8

* chlog
2021-03-20 03:18:52 +05:30
Kaushik B 87c03b1038
Update Gradient Clipping for TPU Accelerator (#6576) 2021-03-20 01:02:57 +05:30
Ethan Harris 983a888f49
Fix all_gather for tpu_cores=8 (#6587) 2021-03-19 21:56:58 +05:30
Sean Naren 4e9b453854
[Fix] Move init dist connection into the setup function (#6506)
* Move connection setup into the setup function. Call setup hook after we set up the accelerator

* Added CHANGELOG.md

* fix setup order in callback test

* fix input arguments in test

* Mock distributed function, remove protection to turn into training type hook

* Remove import

* Add missing mock, ensure custom plugin does not create children process

* Skip test on windows

* Update deepspeed to init connection in setup

* Do not initialize distributed module

* Move DeepSpeed tests to special tests since dist communication is being set up

* Special the test to see if this fixes CI

* Delete accelerator connector test to see if its causing build to fail

* Delete deepspeed test

* Revert "Delete accelerator connector test to see if its causing build to fail"

This reverts commit edde60b8

* Revert "Delete deepspeed test"

This reverts commit 9d317429

* Reverse hook

* Reverse setup hooks to debug again

* Add todo so i know where i left off

* For single device move in pre_dispatch after setup function

* Add additional model to device hook if any additional parameters have been set

* See if we can enable deepspeed tests

* Revert "See if we can enable deepspeed tests"

This reverts commit b5450def

* See if this hook approach works

* Introduce new granular hooks

* Remove import, fix tpu spawn by moving the function to setup

* Added missing special test

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2021-03-18 14:33:39 -07:00
Kaushik B b606171299
Update Changelog for v1.2.4 (#6581)
* Update changelog for v1.2.4

* lagacy v1.2.4

* prune duplicates from changelog

Co-authored-by: Jirka Borovec <jirka.borovec@seznam.cz>
2021-03-18 20:13:54 +00:00
Jirka Borovec 38a2119359
Prune metrics: precision & recall 6/n (#6573)
* avg precision

* precision
* recall

* curve

* tests

* chlog

* isort

* fix
2021-03-18 13:21:59 -04:00
Jirka Borovec 9e35f979ea
Prune metrics: AUC & AUROC (#6572)
* class: AUC AUROC

* func: auc auroc

* format

* tests
2021-03-18 10:38:56 +01:00
Jirka Borovec 2f6ce1ae7f
prune metric: accuracy 4/n (#6515)
* prune accuracy

* chlog

* flake8

* Apply suggestions from code review

Co-authored-by: Nicki Skafte <skaftenicki@gmail.com>

* wrap

* test

* test

* fix

Co-authored-by: Nicki Skafte <skaftenicki@gmail.com>
2021-03-17 11:37:10 +00:00
Kaushik B b190403e28
Add outputs param for `on_val/test_epoch_end` hooks (#6120)
* add outputs param for on_val/test_epoch_end hooks

* update changelog

* fix warning message

* add custom call hook

* cache logged metrics

* add args to docstrings

* use warning cache

* add utility method for param in sig check

* Update CHANGELOG.md

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* update docstring

* add test for eval epoch end hook

* add types and replace model ref

* add deprecation test

* fix test fx name

* add model hooks warning

* add old signature model to tests

* add clear warning cache

* sopport args param

* update tests

* add tests for model hooks

* code suggestions

* add signature utils

* fix pep8 issues

* fix pep8 issues

* fix outputs issue

* fix tests

* code fixes

* fix validate test

* test

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2021-03-16 12:15:16 -04:00
Jirka Borovec a312219d42
Prune metric: helpers and inputs 3/n (#6547)
* _basic_input_validation

* _check_shape_and_type_consistency

* _check_num_classes_binary

* _check_num_classes_mc

* _check_num_classes_ml

* _check_top_k

* _check_classification_inputs

* _input_format_classification

* _reduce_stat_scores

* DataType

* rest

* flake8

* chlog
2021-03-16 13:54:06 +01:00
Jirka Borovec 6453091b8a
Prune metrics base classes 2/n (#6530)
* base class

* extensions

* chlog

* _stable_1d_sort

* _check_same_shape

* _input_format_classification_one_hot

* utils

* to_onehot

* select_topk

* to_categorical

* get_num_classes

* reduce

* class_reduce

* tests
2021-03-15 19:28:18 +00:00
Jirka Borovec b341b53f70
deprecate metrics pkg (#6505)
* deprecate metrics

* examples

* req

* docs

* Apply suggestions from code review

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: Nicki Skafte <skaftenicki@gmail.com>

* pep8

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: Nicki Skafte <skaftenicki@gmail.com>
2021-03-15 14:39:38 +00:00
Luca Di Liello 5d73fbbd81
Mean Average Precision metric for Information Retrieval (1/5) (#5032)
* init information retrieval metrics

* changed retrieval metrics names, expanded arguments and fixed typo

* added 'Retrieval' prefix to metrics and fixed conflict with already-present 'average_precision' file

* improved code formatting

* pep8 code compatibility

* features/implemented new Mean Average Precision metrics for Information Retrieval + doc

* fixed pep8 compatibility

* removed threshold parameter and fixed typo on types in RetrievalMAP and improved doc

* improved doc, put first class-specific args in RetrievalMetric and transformed RetrievalMetric in abstract class

* implemented tests for functional and class metric. fixed typo when input tensors are empty or when all targets are False

* fixed typos in doc and changed torch.true_divide to torch.div

* fixed typos pep8 compatibility

* fixed types in long division in ir_average_precision and example in mean_average_precision

* RetrievalMetric states are not lists and _metric method accepts predictions and targets for easier extension

* updated CHANGELOG file

* added '# noqa: F401' flag to not used imports

* added double space before '# noqa: F401' flag

* Update CHANGELOG.md

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* change get_mini_groups in get_group_indexes

* added checks on target inputs

* minor refactoring for code cleanness

* split tests over exception raising in separate function && refactored test code into multiple functions

* fixed pep8 compatibility

* implemented suggestions of @SkafteNicki

* fixed imports for isort and added types annontations to functions in test_map.py

* isort on test_map and fixed typing

* isort on retrieval and on __init__.py and utils.py in metrics package

* fixed typo in pytorch_lightning/metrics/__init__.py regarding code style

* fixed yapf compatibility

* fixed yapf compatibility

* fixed typo in doc

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Nicki Skafte <skaftenicki@gmail.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
2021-03-15 12:18:43 +01:00
Adrian Wälchli 02fa32b7bc
Handle torch.jit scripted modules in layer summary (#6511) 2021-03-15 03:17:42 +01:00
thomas chaton 0544efd453
[bug] Update broadcast + reduce decision ModelCheckpoint] (#6410)
* resolve bug

* update

* update changelog

* update PR

* Update pytorch_lightning/trainer/connectors/logger_connector/epoch_result_store.py

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

* add todo

* resolve issues

* resolve flake8

* update

* add coverage for reduce

* wip

* restore back to brodbact

* remove test.py

* resolve flake8

* update

* check world size

* resolve test

* update

* use pytorch version when defined

* update on comments

* update on comments

* flake8

* resolve bugs

* Update CHANGELOG.md

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

* update

* update

* update

* update

* remove test

* update

* resolve flake8

* update

* update

* update

* proxy

* update

* update

* resolve typo

* prune

* update parallel

* update

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2021-03-14 17:14:27 +00:00
Adrian Wälchli b2bcad1132
Fix tuner.scale_batch_size not finding batch size attribute when using datamodule (#5968) 2021-03-14 09:16:19 +01:00
ananthsub cea170e011
[feat] Support iteration-based checkpointing in model checkpoint callback (#6146)
* Update model_checkpoint.py

* add tests

* Update model_checkpoint.py

* Update test_model_checkpoint.py

* fix tests

* every_n_batches

* Update test_model_checkpoint.py

* defaults

* rm tests

* Update model_checkpoint.py

* Update test_model_checkpoint.py

* Prune deprecated metrics for 1.3 (#6161)

* prune deprecated metrics for 1.3

* isort / yapf

* Update model_checkpoint.py

* add tests

* defaults

* Update CHANGELOG.md

* pre-commit

* Update model_checkpoint.py

* update defaults

* Update test_remove_1-5.py

* Update model_checkpoint.py

* Update model_checkpoint.py

* Update model_checkpoint.py

* Update model_checkpoint.py

* Update model_checkpoint.py

* Update model_checkpoint.py

* fix tests

* Update test_model_checkpoint.py

* Update model_checkpoint.py

* Update model_checkpoint.py

* Update model_checkpoint.py

* Update test_model_checkpoint.py

* ckpt-callback

* Update test_model_checkpoint.py

* Update model_checkpoint.py

* Update model_checkpoint.py

* validation-end

* Update model_checkpoint.py

* Update test_model_checkpoint.py

* Update test_model_checkpoint.py

* Update test_model_checkpoint.py

* Update test_model_checkpoint.py

* clarify-names

- Make names explicit as to which hooks they apply to
- Use step instead of batch for consistency with global step

* Update model_checkpoint.py

* Update model_checkpoint.py

* Update model_checkpoint.py

* Update model_checkpoint.py

* Update model_checkpoint.py

* mutual-exclusive

Make every_n_train_steps and every_n_val_epochs mutually exclusive

* fix-default-0

* Update CHANGELOG.md

* formatting

* make-private

make attributes private to the class

* rebase

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2021-03-11 14:44:29 -08:00
Rohit Gupta c53edce1a1
Disable batch transfer in DP mode (#6098)
* add exceptions and test

* hook

* fix

* clean up

* clean up

* regex

* regex

* docs

* rev

* comment and docs

* chlog

* Apply suggestions from code review

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

* Apply suggestions from code review

Co-authored-by: chaton <thomas@grid.ai>

* Monkey-patch device count

* docs

* pep

* api_change

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: chaton <thomas@grid.ai>
2021-03-11 10:51:10 -05:00
Max Frei 2ecda5df52
Allow user to disable the automatic formatting of checkpoint file names. (#6277)
* cleaning SWA (#6259)

* rename

* if

* test

* chlog

* Remove opt from manual_backward in docs (#6267)

* switch agents pool (#6270)

* Allow user to disable the automatic formatting of checkpoint file names.

* Added changelog entry.

* Made flake8 happy.

* Applied review suggestion: quotes for special characters in docstring

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

* Fixed example in docstring.

* Fixed syntax error in docstring.

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Akihiro Nitta <nitta@akihironitta.com>
Co-authored-by: thomas chaton <thomas@grid.ai>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2021-03-11 16:40:23 +08:00
Elia Cereda f4cc7451a9
Add Trainer.validate(…) method to run one validation epoch (#4948)
Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
Co-authored-by: chaton <thomas@grid.ai>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2021-03-11 03:46:37 +01:00
Sean Naren 1c013b43e0
[Fix] Ensure we set the default device before initializing deepspeed (#6460)
* Ensure we set the default device before initializing deepspeed

* Add CHANGELOG.md

* Update pytorch_lightning/plugins/training_type/deepspeed.py

Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com>

Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com>
2021-03-10 16:29:37 +00:00
thomas chaton 7d4e74c745
[bug] All_gather support tensor on cpu (#6416)
* add test

* update changelog

* update

* rename function
2021-03-10 14:19:07 +00:00
Sean Naren c81b2a8189
Set find unused parameters to True by default to fix breaking compatibility (#6438)
* Set find unused parameters to True by default to fix breaking models, add suggestion to re-enable

* Add changelog
2021-03-10 10:40:24 +01:00
Adrian Wälchli 615b2f7363
Improve DummyLogger (#6398)
* fix dummy logger

* docs

* update docs

* add changelog

* add none return annotation

* return empty string for name, version
2021-03-09 23:18:38 +00:00
thomas chaton 30d649b9a7
[changelog] Update Changelog on release v1.2.3 (#6444)
* update changelog

* legacy 1.2.3

Co-authored-by: Jirka Borovec <jirka.borovec@seznam.cz>
2021-03-09 15:17:36 -08:00
Adrian Wälchli fc6d402733
fix logger creating directory structure too early in DDP (#6380)
* fix

* add simple test

* fix imports

* add changelog

* tighter test with on_fit_start hook closer to the dispatch call

* move class inside test f unction

* add a comment
2021-03-09 09:49:59 +00:00
David Palzer 523c59bfdd
fixed bug where tuner would not tune lr if also tuning batch_size (#4688)
* fixed bug where tuner would not tune lr if also tuning batch_size

* added a '+1' to computing the smoothed loss. This maintains the behavior for the smoothed loss as before the bug fix

* pep8 fix

* add changelog

Co-authored-by: chaton <thomas@grid.ai>
Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2021-03-09 08:30:06 +08:00
Carlos Mocholí efd272a3ca
Pass {fit,validate,test,predict} to setup() and teardown() (#6386) 2021-03-08 15:27:07 +01:00
chizuchizu a6c98c4e49
Fix AttributeError: 'NoneType' object has no attribute 'finalize' on TPU (#6221)
* Fix bug

Fix AttributeError: 'NoneType' object has no attribute 'finalize'

* Update CHANGELOG.md

* deleted a period

* Update CHANGELOG.md

Co-authored-by: Akihiro Nitta <nitta@akihironitta.com>

* Update CHANGELOG.md

* Update pytorch_lightning/plugins/training_type/tpu_spawn.py

Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>

Co-authored-by: Akihiro Nitta <nitta@akihironitta.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>
2021-03-08 02:11:07 +00:00
Adrian Wälchli 718074b99a
Fix trainer not resetting lightning_optimizers (#6372)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2021-03-08 09:58:03 +08:00
Carlos Mocholí 826375effe
Fix ModelCheckpoint(monitor=None, save_last=True) not saving checkpoints (#6136)
Co-authored-by: ananthsub <ananth.subramaniam@gmail.com>
2021-03-08 00:59:14 +01:00
Rohit Gupta 38a5fe7af1
Remove optimizer_idx arg in manual optimization (#6093)
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: chaton <thomas@grid.ai>
2021-03-07 08:48:50 +01:00
Rohit Gupta facfda85f1
Remove no return warning from val/test step (#6139)
* remove warning

* auto_opt

* chlog

* auto_opt

* no_warning_call

* rm old code

* add warning for predict

* Apply suggestions from code review

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2021-03-06 17:15:21 +00:00
Elia Cereda d0596fac94
Refactor RunningStage usage in advance of implementing Trainer.validate() (#4945)
* Update code

Co-authored-by: EliaCereda

* More property updates

* Move properties. Introduce trainer._fitting

* Use trainer.fitting

* Fix reset dataloaders

* Unused code

* RunningStage.SANITY_CHECKING

* Use setters

* Fix bugs

* Fix bugs

* TrainerState.{FITTING,VALIDATING,TESTING,PREDICTING,TUNING}

* Fix bugs

* Fix bugs

* Fix tests

* Update CHANGELOG. Add deprecation warning. Fix tests

* Unused imports

* Optional trainer

* More deprecation. More refactoring

* Correct version

* Use properties

* Address comments

* flake8

* Missed renamings

* Typo

* is -> ==

It is recommended to use  for Enums since they are singletons, however, since the LightningEnum subclasses str, it's not a good idea in case a user sets the state/stage with a str

* Also for tests

* Typo

* Address @tchaton's comments

* PEP8

* Correct property

* Update CHANGELOG

* Apply suggestions from code review

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* Update pytorch_lightning/trainer/trainer.py

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* Remove called sanity check

Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2021-03-06 12:40:19 +00:00
thomas chaton 2ec67a48b3
[bug] Fix Pytorch profiler with emit_nvtx (#6260)
* resolve bug

* update changelog

* Update tests/trainer/test_trainer.py

* Update pytorch_lightning/profiler/profilers.py

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* resolve comments

* resolve flake8

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2021-03-05 21:12:03 +01:00
Kaushik B b6aa350fb2
Update changelog for v1.2.2 (#6325)
* update changelog for v1.2.2

* ckpr 1.2.2

Co-authored-by: Jirka Borovec <jirka.borovec@seznam.cz>
2021-03-05 15:57:50 +00:00
Adrian Wälchli ec8d46e02b
introduce default cluster environment for lightning-specific ddp (#5915)
* handle distributed_sampler_kwargs

* move emptying cache to accelertor

* fix a few tests

* restoring the result from subprocess

* fix queue.get() order for results

* add missing "block_backward_sync" context manager

* add missing "block_backward_sync" context manager

* fix sync_batchnorm

* fix supported gpu-ids for tuple

* fix clip gradients and inf recursion

* accelerator selection: added cluster_environment plugin

* fix torchelastic test

* fix reduce early stopping decision for DDP

* fix tests: callbacks, conversion to lightning optimizer

* fix lightning optimizer does not pickle

* fix setting benchmark and deterministic option

* fix slurm amp test

* fix prepare_data test and determine node_rank

* fix retrieving last path when testing

* remove obsolete plugin argument

* fix test: test_trainer_config

* fix torchscript tests

* fix trainer.model access

* move properties

* fix test_transfer_batch_hook

* fix auto_select_gpus

* fix omegaconf test

* fix test that needs to simulate slurm ddp

* add horovod plugin

* fix test with named arguments

* clean up whitespace

* fix datamodules test

* remove old accelerators

* fix naming

* move old plugins

* move to plugins

* create precision subpackage

* create training_type subpackage

* fix all new import errors

* fix wrong arguments order passed to test

* fix LR finder

* Added sharded training type and amp plugin

* Move clip grad to precision plugin

* Added sharded spawn, select accelerators based on distributed_backend + enable custom fp16 plugin automatically

* Fix import issue, attempting to fix tests

* Fix initial test

* Reflect hook logic from master, should wrap model after move to device

* Optional state consolidation, since master has optimizers not wrapped

* change attribute for instance test

* reset optimizers

optimizers are not used in main process, so state would be wrong.

* legacy

* imports in accel

* legacy2

* trainer imports

* fix import errors after rebase

* move hook to new setup location

* provide unwrapping logic

* fix trainer callback system

* added ddp2 implementation

* fix imports .legacy

* move plugins

* restore legacy

* drop test.py from root

* add tpu accelerator and plugins

* fixes

* fix lightning optimizer merge

* reset bugreportmodel

* unwrapping

* step routing forward

* model access

* unwrap

* opt

* integrate distrib_type

* sync changes

* sync

* fixes

* add forgotten generators

* add missing logic

* update

* import

* missed imports

* import fixes

* isort

* mv f

* changelog

* format

* move helper to parallel plugin

* d

* add world size

* clean up

* duplicate

* activate ddp_sharded and tpu

* set nvidia flags

* remove unused colab var

* use_tpu <-> on_tpu attrs

* make some ddp_cpu and clusterplugin tests pass

* Ref/accelerator connector (#5742)

* final cleanup

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* connector cleanup

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* trainer cleanup

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* accelerator cleanup + missing logic in accelerator connector

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* add missing changes to callbacks

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* reflect accelerator changes to lightning module

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* clean cluster envs

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* cleanup plugins

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* add broadcasting

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* yapf

* remove plugin connector

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* plugins

* manual optimization

* update optimizer routing

* add rank to torchelastic

* fix memory mixed precision

* setstate on trainer for pickling in ddp spawn

* add predict method

* add back commented accelerator code

* adapt test for sync_batch_norm to new plugin

* fix deprecated tests

* fix ddp cpu choice when no num_processes are given

* yapf format

* skip a memory test that cannot pass anymore

* fix pickle error in spawn plugin

* x

* avoid

* x

* fix cyclic import in docs build

* add support for sharded

* update typing

* add sharded and sharded_spawn to distributed types

* make unwrap model default

* refactor LightningShardedDataParallel similar to LightningDistributedDataParallel

* update sharded spawn to reflect changes

* update sharded to reflect changes

* Merge 1.1.5 changes

* fix merge

* fix merge

* yapf isort

* fix merge

* yapf isort

* fix indentation in test

* copy over reinit scheduler implementation from dev1.2

* fix apex tracking calls with dev_debugger

* reduce diff to dev1.2, clean up

* fix trainer config test  when gpus>0 and num_processes >0 and ddp_cpu

* sort plugin tests legacy/new

* fix error handling for amp on cpu

* fix merge


fix merge


fix merge

* [Feat] Resolve manual_backward (#5837)

* resolve manual_backward

* resolve flake8

* update

* resolve for ddp_spawn

* resolve flake8

* resolve flake8

* resolve flake8

Co-authored-by: Ubuntu <ubuntu@ip-172-31-88-60.ec2.internal>

* fix tests/accelerator tests on cpu

* [BugFix] Resolve manual optimization (#5852)

* resolve manual_optimization

* update

* update

Co-authored-by: Ubuntu <ubuntu@ip-172-31-88-60.ec2.internal>

* Remove copy trainer parameters to happen earlier within the loop and add safe guard to get ref model (#5856)

* resovle a bug

* Accelerator refactor sharded rpc (#5854)

* rpc branch

* merge

* update handling of rpc

* make devices etc. Optional in RPC

* set devices etc. later if necessary

* remove devices from sequential

* make devices optional in rpc

* fix import

* uncomment everything

* fix cluster selection

Co-authored-by: Ubuntu <ubuntu@ip-172-31-88-60.ec2.internal>

* resolve bug

* fix assert in rpc test

* resolve a test

* fix docs compilation

* accelerator refactor - fix for sharded parity test (#5866)

* fix memory issue with ddp_spawn

* x


x


x


x


x


x


x


x


x

* x

* Remove DDP2 as this does not apply

* Add missing pre optimizer hook to ensure lambda closure is called

* fix apex docstring

* [accelerator][BugFix] Resolve some test for 1 gpu (#5863)

* update

* revert init

* resolve a bug

* update

* resolve flake8

* update

* update

* update

* revert init

* resolve a bug

* update

* resolve flake8

* update

* update

* update

* update

* update

* revert init

* resolve a bug

* update

* resolve flake8

* update

* update

* update

* revert init

* update

* resolve flake8

* update

* update

* update

* update

* update

* all_gather

* update

* make plugins work, add misconfig for RPC

* update

* update

* remove breaking test

* resolve some tests

* resolve flake8

* revert to ddp_spawn

Co-authored-by: root <root@ip-172-31-88-60.ec2.internal>
Co-authored-by: Ubuntu <ubuntu@ip-172-31-88-60.ec2.internal>
Co-authored-by: Justus Schock <justus.schock@rwth-aachen.de>

* yapf isort

* resolve flake8

* fix apex doctests

* fix apex doctests 2

* resolve docs

* update drone

* clean env

* update

* update

* update

* update

* merge

* Fix RPC related tests, clean out old API, update for new accelerator API [skip ci] (#5881)

* Fix RPC related tests, clean out old API, update for new accelerator API

* Move tests out of legacy folder, update paths and names

* Update test_remove_1-4.py

* Expose properties for tpu cores/gpus/num_gpus

* Add root GPU property

* Move properties to properties.py

* move tests that were previously in drone

* Fix root GPU property (#5908)

* Move root GPU to property, remove horovod set as this is handled in horovod plugin, ensure we mock correctly to set GPU accelerator

* Add missing tests back

* fix best model path transfer when no checkpoint callback available

* Fix setup hook order [wip] (#5858)

* Call trainer setup hook before accelerator setup

* Add test case

* add new test

* typo

* fix callback order in test

Co-authored-by: tchaton <thomas@grid.ai>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* rename ddp sequential -> rpc sequential for special test

* revert

* fix stupid merge problem

* abstract the cluster plugins

* default plugin

* integrate default environment

* fix property

* adapt tests

* adjust test

* fix world size access

* base cluster env

* revert rebase errors

* revert rebase errors

* missing import

* revert unrelated change

* remove unused cluster local rank

* remove unrelated changes

* fix unrelated changes

* fix pep8

* remove unused var

* reset permissions

* ypaf

* test default environment

* test torchelastic environment

* world  size as int

* tests for slurm environment

* changelog

* test comments

* remove unintended change

* keep master port fixed after it is generated

* test random master port

* yapf

* add missing default environment

* move helper function

* rename default environment

* rename

* rename

* yapf

* Update pytorch_lightning/plugins/environments/lightning_environment.py

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

* Update CHANGELOG.md

Co-authored-by: Justus Schock <12886177+justusschock@users.noreply.github.com>

* spawn -> create

Co-authored-by: justusschock <justus.schock@posteo.de>
Co-authored-by: SeanNaren <sean@grid.ai>
Co-authored-by: Justus Schock <12886177+justusschock@users.noreply.github.com>
Co-authored-by: Jirka Borovec <jirka.borovec@seznam.cz>
Co-authored-by: Justus Schock <justus.schock@rwth-aachen.de>
Co-authored-by: chaton <thomas@grid.ai>
Co-authored-by: Ubuntu <ubuntu@ip-172-31-88-60.ec2.internal>
Co-authored-by: Sean Naren <sean.narenthiran@gmail.com>
Co-authored-by: root <root@ip-172-31-88-60.ec2.internal>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2021-03-05 01:47:29 +00:00
thomas chaton 248a8e8b32
[bugfix] Perform reduction for dict in training_step and DP (#6324)
* fix

* update

* update

* add changelog

* Update CHANGELOG.md

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

* Update tests/accelerators/test_dp.py

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

* update changelog

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
2021-03-04 23:10:52 +00:00