lightning

Commit Graph

Author	SHA1	Message	Date
Ethan Harris	741c452551	Fix disabled grads after call to predict (#6657 )	2021-03-23 23:07:48 +01:00
thomas chaton	fd5cb7fcc3	Add PyTorch 1.8 Profiler 5/5 (#6618 ) * Refactor profilers * Update PassThrough * WIP - This is broken and will change * Update pytorch_lightning/profiler/pytorch.py Co-authored-by: thomas chaton <thomas@grid.ai> * resolve tests * resolve tests * find output * try something * update * add support for test and predict * update * update * use getattr * test * test * update * tests * update * update * update * update * update * remove file * update * update * update * update * update * test * update# * update * update tests * update * add suport for 1.8 * rename records * add support for 1.8 * update * resolve flake8 * resolve test * Refactor basic profilers * Fixes * Unused import * Introduce setup * Profile on all ranks. Print to stdout on 0 * Introduce dirpath + filename * CHANGELOG * Add tests. Address comments * add `on_run_stage_setup` * add on_run_stage_setup function * update * add test for RegisterRecordFunction * update lightnng flow direction * move variable to private * remove trace * Undo code that should be in 3/4 * Multi-stage multi-rank * 2/5 changes * Pass stage in __del__ * Remove TODOs * Describe on_evaluation_end. Add tests * Typo * Address comments * deepcopy tests * Advanced teardown * Fix teardown test * Fix tests * Minor change * Update CHANGELOG.md * Fix test * Quick fixes * Fix 6522 * resolve ddp tests * resolve tests * resolve some tests * update tests * resolve tests * update * resolve tests * resolve some tests * Missed fixes from 3/5 * Fixes * resolve some tests * resolve test for 1.7.1 * Broken refactor * Missed stage * Minor changes * resolve tests * Update CHANGELOG * resolve bug * remove print * Typo * Cleanup * resolve ddp test * remove barrier * update profiler * update * Smaller model * update * resolve tests * update * Minor changes. CHANGELOG * Minimize diff * update to 1.8.1 * RunIf. Extra code. Check segfault * resolve tests * Typo. Bad merge * Fixing a bad merge * replace for kineto * Update pytorch_lightning/profiler/pytorch.py Co-authored-by: ananthsub <ananth.subramaniam@gmail.com> * Update pytorch_lightning/profiler/pytorch.py Co-authored-by: ananthsub <ananth.subramaniam@gmail.com> * Minor changes * Bad merge * Use lists for flexibility * Use sets * predict_step * Ananth's suggestion * update * Docs * Update pl_examples/basic_examples/profiler_example.py Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> * update example * update example Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com> Co-authored-by: ananthsub <ananth.subramaniam@gmail.com> Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>	2021-03-23 20:43:21 +00:00
Carlos Mocholí	51b10f78f4	Refactor PyTorch profiler 4/5 (#6349 ) Co-authored-by: thomas chaton <thomas@grid.ai>	2021-03-23 18:13:29 +01:00
Jirka Borovec	a74909affa	prune metrics: info retrieval (#6649 )	2021-03-23 15:05:32 +00:00
Carlos Mocholí	36d180e532	Refactor base profilers 3/5 (#6621 ) Co-authored-by: tchaton <thomas@grid.ai>	2021-03-23 10:07:35 +00:00
Jirka Borovec	f93414d085	Prune metyrics: regression 9/n (#6637 ) * psnr * r2score * ssim * chlog	2021-03-23 10:01:25 +00:00
Jirka Borovec	efce2b7777	Prune metrics: regression 8/n (#6636 ) * explained_variance * tests * mean_absolute_error * mean_squared_error * mean_relative_error * mean_squared_log_error * chlog	2021-03-23 09:35:51 +01:00
Jirka Borovec	8cd75a4dd5	fix comparing versions (#6434 ) * fix comparing versions * chlog * . * ... * datasets	2021-03-23 07:51:45 +00:00
thomas chaton	2064ece582	[refactor] Add setup to profilers + _run_stage_setup to trainer 2/5 (#6633 ) * add setup * update * updates on comment * Minor changes * Extra import * Docs Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>	2021-03-22 14:32:31 -04:00
camruta	e2e1de0fb7	Add teardown method to BaseProfiler. (#6370 ) Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> Co-authored-by: ananthsub <ananth.subramaniam@gmail.com>	2021-03-22 11:49:06 +00:00
Kaushik B	37f22c99ff	Add trainer.predict config validation (#6543 ) Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>	2021-03-21 21:07:54 +00:00
Justus Schock	634d83134f	Add AMP for validation, prediction and testing (#6565 ) * Add Tests for val and test-steps * Add native AMP * pep8 tests * pep8 plugin * changelog	2021-03-20 23:15:49 +00:00
Jirka Borovec	3a56a6024e	Prune metrics: other classification 7/n (#6584 ) * confusion_matrix * iou * f_beta * hamming_distance * stat_scores * tests * flake8 * chlog	2021-03-20 03:18:52 +05:30
Kaushik B	87c03b1038	Update Gradient Clipping for TPU Accelerator (#6576 )	2021-03-20 01:02:57 +05:30
Ethan Harris	983a888f49	Fix all_gather for tpu_cores=8 (#6587 )	2021-03-19 21:56:58 +05:30
Sean Naren	4e9b453854	[Fix] Move init dist connection into the setup function (#6506 ) * Move connection setup into the setup function. Call setup hook after we set up the accelerator * Added CHANGELOG.md * fix setup order in callback test * fix input arguments in test * Mock distributed function, remove protection to turn into training type hook * Remove import * Add missing mock, ensure custom plugin does not create children process * Skip test on windows * Update deepspeed to init connection in setup * Do not initialize distributed module * Move DeepSpeed tests to special tests since dist communication is being set up * Special the test to see if this fixes CI * Delete accelerator connector test to see if its causing build to fail * Delete deepspeed test * Revert "Delete accelerator connector test to see if its causing build to fail" This reverts commit `edde60b8` * Revert "Delete deepspeed test" This reverts commit `9d317429` * Reverse hook * Reverse setup hooks to debug again * Add todo so i know where i left off * For single device move in pre_dispatch after setup function * Add additional model to device hook if any additional parameters have been set * See if we can enable deepspeed tests * Revert "See if we can enable deepspeed tests" This reverts commit `b5450def` * See if this hook approach works * Introduce new granular hooks * Remove import, fix tpu spawn by moving the function to setup * Added missing special test Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>	2021-03-18 14:33:39 -07:00
Kaushik B	b606171299	Update Changelog for v1.2.4 (#6581 ) * Update changelog for v1.2.4 * lagacy v1.2.4 * prune duplicates from changelog Co-authored-by: Jirka Borovec <jirka.borovec@seznam.cz>	2021-03-18 20:13:54 +00:00
Jirka Borovec	38a2119359	Prune metrics: precision & recall 6/n (#6573 ) * avg precision * precision * recall * curve * tests * chlog * isort * fix	2021-03-18 13:21:59 -04:00
Jirka Borovec	9e35f979ea	Prune metrics: AUC & AUROC (#6572 ) * class: AUC AUROC * func: auc auroc * format * tests	2021-03-18 10:38:56 +01:00
Jirka Borovec	2f6ce1ae7f	prune metric: accuracy 4/n (#6515 ) * prune accuracy * chlog * flake8 * Apply suggestions from code review Co-authored-by: Nicki Skafte <skaftenicki@gmail.com> * wrap * test * test * fix Co-authored-by: Nicki Skafte <skaftenicki@gmail.com>	2021-03-17 11:37:10 +00:00
Kaushik B	b190403e28	Add outputs param for `on_val/test_epoch_end` hooks (#6120 ) * add outputs param for on_val/test_epoch_end hooks * update changelog * fix warning message * add custom call hook * cache logged metrics * add args to docstrings * use warning cache * add utility method for param in sig check * Update CHANGELOG.md Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> * update docstring * add test for eval epoch end hook * add types and replace model ref * add deprecation test * fix test fx name * add model hooks warning * add old signature model to tests * add clear warning cache * sopport args param * update tests * add tests for model hooks * code suggestions * add signature utils * fix pep8 issues * fix pep8 issues * fix outputs issue * fix tests * code fixes * fix validate test * test Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>	2021-03-16 12:15:16 -04:00
Jirka Borovec	a312219d42	Prune metric: helpers and inputs 3/n (#6547 ) * _basic_input_validation * _check_shape_and_type_consistency * _check_num_classes_binary * _check_num_classes_mc * _check_num_classes_ml * _check_top_k * _check_classification_inputs * _input_format_classification * _reduce_stat_scores * DataType * rest * flake8 * chlog	2021-03-16 13:54:06 +01:00
Jirka Borovec	6453091b8a	Prune metrics base classes 2/n (#6530 ) * base class * extensions * chlog * _stable_1d_sort * _check_same_shape * _input_format_classification_one_hot * utils * to_onehot * select_topk * to_categorical * get_num_classes * reduce * class_reduce * tests	2021-03-15 19:28:18 +00:00
Jirka Borovec	b341b53f70	deprecate metrics pkg (#6505 ) * deprecate metrics * examples * req * docs * Apply suggestions from code review Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> Co-authored-by: Nicki Skafte <skaftenicki@gmail.com> * pep8 Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> Co-authored-by: Nicki Skafte <skaftenicki@gmail.com>	2021-03-15 14:39:38 +00:00
Luca Di Liello	5d73fbbd81	Mean Average Precision metric for Information Retrieval (1/5) (#5032 ) * init information retrieval metrics * changed retrieval metrics names, expanded arguments and fixed typo * added 'Retrieval' prefix to metrics and fixed conflict with already-present 'average_precision' file * improved code formatting * pep8 code compatibility * features/implemented new Mean Average Precision metrics for Information Retrieval + doc * fixed pep8 compatibility * removed threshold parameter and fixed typo on types in RetrievalMAP and improved doc * improved doc, put first class-specific args in RetrievalMetric and transformed RetrievalMetric in abstract class * implemented tests for functional and class metric. fixed typo when input tensors are empty or when all targets are False * fixed typos in doc and changed torch.true_divide to torch.div * fixed typos pep8 compatibility * fixed types in long division in ir_average_precision and example in mean_average_precision * RetrievalMetric states are not lists and _metric method accepts predictions and targets for easier extension * updated CHANGELOG file * added '# noqa: F401' flag to not used imports * added double space before '# noqa: F401' flag * Update CHANGELOG.md Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> * change get_mini_groups in get_group_indexes * added checks on target inputs * minor refactoring for code cleanness * split tests over exception raising in separate function && refactored test code into multiple functions * fixed pep8 compatibility * implemented suggestions of @SkafteNicki * fixed imports for isort and added types annontations to functions in test_map.py * isort on test_map and fixed typing * isort on retrieval and on __init__.py and utils.py in metrics package * fixed typo in pytorch_lightning/metrics/__init__.py regarding code style * fixed yapf compatibility * fixed yapf compatibility * fixed typo in doc Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: Nicki Skafte <skaftenicki@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>	2021-03-15 12:18:43 +01:00
Adrian Wälchli	02fa32b7bc	Handle torch.jit scripted modules in layer summary (#6511 )	2021-03-15 03:17:42 +01:00
thomas chaton	0544efd453	[bug] Update broadcast + reduce decision ModelCheckpoint] (#6410 ) * resolve bug * update * update changelog * update PR * Update pytorch_lightning/trainer/connectors/logger_connector/epoch_result_store.py Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> * add todo * resolve issues * resolve flake8 * update * add coverage for reduce * wip * restore back to brodbact * remove test.py * resolve flake8 * update * check world size * resolve test * update * use pytorch version when defined * update on comments * update on comments * flake8 * resolve bugs * Update CHANGELOG.md Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> * update * update * update * update * remove test * update * resolve flake8 * update * update * update * proxy * update * update * resolve typo * prune * update parallel * update Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>	2021-03-14 17:14:27 +00:00
Adrian Wälchli	b2bcad1132	Fix tuner.scale_batch_size not finding batch size attribute when using datamodule (#5968 )	2021-03-14 09:16:19 +01:00
ananthsub	cea170e011	[feat] Support iteration-based checkpointing in model checkpoint callback (#6146 ) * Update model_checkpoint.py * add tests * Update model_checkpoint.py * Update test_model_checkpoint.py * fix tests * every_n_batches * Update test_model_checkpoint.py * defaults * rm tests * Update model_checkpoint.py * Update test_model_checkpoint.py * Prune deprecated metrics for 1.3 (#6161) * prune deprecated metrics for 1.3 * isort / yapf * Update model_checkpoint.py * add tests * defaults * Update CHANGELOG.md * pre-commit * Update model_checkpoint.py * update defaults * Update test_remove_1-5.py * Update model_checkpoint.py * Update model_checkpoint.py * Update model_checkpoint.py * Update model_checkpoint.py * Update model_checkpoint.py * Update model_checkpoint.py * fix tests * Update test_model_checkpoint.py * Update model_checkpoint.py * Update model_checkpoint.py * Update model_checkpoint.py * Update test_model_checkpoint.py * ckpt-callback * Update test_model_checkpoint.py * Update model_checkpoint.py * Update model_checkpoint.py * validation-end * Update model_checkpoint.py * Update test_model_checkpoint.py * Update test_model_checkpoint.py * Update test_model_checkpoint.py * Update test_model_checkpoint.py * clarify-names - Make names explicit as to which hooks they apply to - Use step instead of batch for consistency with global step * Update model_checkpoint.py * Update model_checkpoint.py * Update model_checkpoint.py * Update model_checkpoint.py * Update model_checkpoint.py * mutual-exclusive Make every_n_train_steps and every_n_val_epochs mutually exclusive * fix-default-0 * Update CHANGELOG.md * formatting * make-private make attributes private to the class * rebase Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>	2021-03-11 14:44:29 -08:00
Rohit Gupta	c53edce1a1	Disable batch transfer in DP mode (#6098 ) * add exceptions and test * hook * fix * clean up * clean up * regex * regex * docs * rev * comment and docs * chlog * Apply suggestions from code review Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> * Apply suggestions from code review Co-authored-by: chaton <thomas@grid.ai> * Monkey-patch device count * docs * pep * api_change Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> Co-authored-by: chaton <thomas@grid.ai>	2021-03-11 10:51:10 -05:00
Max Frei	2ecda5df52	Allow user to disable the automatic formatting of checkpoint file names. (#6277 ) * cleaning SWA (#6259) * rename * if * test * chlog * Remove opt from manual_backward in docs (#6267) * switch agents pool (#6270) * Allow user to disable the automatic formatting of checkpoint file names. * Added changelog entry. * Made flake8 happy. * Applied review suggestion: quotes for special characters in docstring Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> * Fixed example in docstring. * Fixed syntax error in docstring. Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: Akihiro Nitta <nitta@akihironitta.com> Co-authored-by: thomas chaton <thomas@grid.ai> Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>	2021-03-11 16:40:23 +08:00
Elia Cereda	f4cc7451a9	Add Trainer.validate(…) method to run one validation epoch (#4948 ) Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com> Co-authored-by: chaton <thomas@grid.ai> Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>	2021-03-11 03:46:37 +01:00
Sean Naren	1c013b43e0	[Fix] Ensure we set the default device before initializing deepspeed (#6460 ) * Ensure we set the default device before initializing deepspeed * Add CHANGELOG.md * Update pytorch_lightning/plugins/training_type/deepspeed.py Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com> Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com>	2021-03-10 16:29:37 +00:00
thomas chaton	7d4e74c745	[bug] All_gather support tensor on cpu (#6416 ) * add test * update changelog * update * rename function	2021-03-10 14:19:07 +00:00
Sean Naren	c81b2a8189	Set find unused parameters to True by default to fix breaking compatibility (#6438 ) * Set find unused parameters to True by default to fix breaking models, add suggestion to re-enable * Add changelog	2021-03-10 10:40:24 +01:00
Adrian Wälchli	615b2f7363	Improve DummyLogger (#6398 ) * fix dummy logger * docs * update docs * add changelog * add none return annotation * return empty string for name, version	2021-03-09 23:18:38 +00:00
thomas chaton	30d649b9a7	[changelog] Update Changelog on release v1.2.3 (#6444 ) * update changelog * legacy 1.2.3 Co-authored-by: Jirka Borovec <jirka.borovec@seznam.cz>	2021-03-09 15:17:36 -08:00
Adrian Wälchli	fc6d402733	fix logger creating directory structure too early in DDP (#6380 ) * fix * add simple test * fix imports * add changelog * tighter test with on_fit_start hook closer to the dispatch call * move class inside test f unction * add a comment	2021-03-09 09:49:59 +00:00
David Palzer	523c59bfdd	fixed bug where tuner would not tune lr if also tuning batch_size (#4688 ) * fixed bug where tuner would not tune lr if also tuning batch_size * added a '+1' to computing the smoothed loss. This maintains the behavior for the smoothed loss as before the bug fix * pep8 fix * add changelog Co-authored-by: chaton <thomas@grid.ai> Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com> Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>	2021-03-09 08:30:06 +08:00
Carlos Mocholí	efd272a3ca	Pass {fit,validate,test,predict} to setup() and teardown() (#6386 )	2021-03-08 15:27:07 +01:00
chizuchizu	a6c98c4e49	Fix AttributeError: 'NoneType' object has no attribute 'finalize' on TPU (#6221 ) * Fix bug Fix AttributeError: 'NoneType' object has no attribute 'finalize' * Update CHANGELOG.md * deleted a period * Update CHANGELOG.md Co-authored-by: Akihiro Nitta <nitta@akihironitta.com> * Update CHANGELOG.md * Update pytorch_lightning/plugins/training_type/tpu_spawn.py Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com> Co-authored-by: Akihiro Nitta <nitta@akihironitta.com> Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>	2021-03-08 02:11:07 +00:00
Adrian Wälchli	718074b99a	Fix trainer not resetting lightning_optimizers (#6372 ) Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>	2021-03-08 09:58:03 +08:00
Carlos Mocholí	826375effe	Fix ModelCheckpoint(monitor=None, save_last=True) not saving checkpoints (#6136 ) Co-authored-by: ananthsub <ananth.subramaniam@gmail.com>	2021-03-08 00:59:14 +01:00
Rohit Gupta	38a5fe7af1	Remove optimizer_idx arg in manual optimization (#6093 ) Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> Co-authored-by: chaton <thomas@grid.ai>	2021-03-07 08:48:50 +01:00
Rohit Gupta	facfda85f1	Remove no return warning from val/test step (#6139 ) * remove warning * auto_opt * chlog * auto_opt * no_warning_call * rm old code * add warning for predict * Apply suggestions from code review Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>	2021-03-06 17:15:21 +00:00
Elia Cereda	d0596fac94	Refactor RunningStage usage in advance of implementing Trainer.validate() (#4945 ) * Update code Co-authored-by: EliaCereda * More property updates * Move properties. Introduce trainer._fitting * Use trainer.fitting * Fix reset dataloaders * Unused code * RunningStage.SANITY_CHECKING * Use setters * Fix bugs * Fix bugs * TrainerState.{FITTING,VALIDATING,TESTING,PREDICTING,TUNING} * Fix bugs * Fix bugs * Fix tests * Update CHANGELOG. Add deprecation warning. Fix tests * Unused imports * Optional trainer * More deprecation. More refactoring * Correct version * Use properties * Address comments * flake8 * Missed renamings * Typo * is -> == It is recommended to use for Enums since they are singletons, however, since the LightningEnum subclasses str, it's not a good idea in case a user sets the state/stage with a str * Also for tests * Typo * Address @tchaton's comments * PEP8 * Correct property * Update CHANGELOG * Apply suggestions from code review Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * Update pytorch_lightning/trainer/trainer.py Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * Remove called sanity check Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com> Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>	2021-03-06 12:40:19 +00:00
thomas chaton	2ec67a48b3	[bug] Fix Pytorch profiler with emit_nvtx (#6260 ) * resolve bug * update changelog * Update tests/trainer/test_trainer.py * Update pytorch_lightning/profiler/profilers.py Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> * resolve comments * resolve flake8 Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>	2021-03-05 21:12:03 +01:00
Kaushik B	b6aa350fb2	Update changelog for v1.2.2 (#6325 ) * update changelog for v1.2.2 * ckpr 1.2.2 Co-authored-by: Jirka Borovec <jirka.borovec@seznam.cz>	2021-03-05 15:57:50 +00:00
Adrian Wälchli	ec8d46e02b	introduce default cluster environment for lightning-specific ddp (#5915 ) * handle distributed_sampler_kwargs * move emptying cache to accelertor * fix a few tests * restoring the result from subprocess * fix queue.get() order for results * add missing "block_backward_sync" context manager * add missing "block_backward_sync" context manager * fix sync_batchnorm * fix supported gpu-ids for tuple * fix clip gradients and inf recursion * accelerator selection: added cluster_environment plugin * fix torchelastic test * fix reduce early stopping decision for DDP * fix tests: callbacks, conversion to lightning optimizer * fix lightning optimizer does not pickle * fix setting benchmark and deterministic option * fix slurm amp test * fix prepare_data test and determine node_rank * fix retrieving last path when testing * remove obsolete plugin argument * fix test: test_trainer_config * fix torchscript tests * fix trainer.model access * move properties * fix test_transfer_batch_hook * fix auto_select_gpus * fix omegaconf test * fix test that needs to simulate slurm ddp * add horovod plugin * fix test with named arguments * clean up whitespace * fix datamodules test * remove old accelerators * fix naming * move old plugins * move to plugins * create precision subpackage * create training_type subpackage * fix all new import errors * fix wrong arguments order passed to test * fix LR finder * Added sharded training type and amp plugin * Move clip grad to precision plugin * Added sharded spawn, select accelerators based on distributed_backend + enable custom fp16 plugin automatically * Fix import issue, attempting to fix tests * Fix initial test * Reflect hook logic from master, should wrap model after move to device * Optional state consolidation, since master has optimizers not wrapped * change attribute for instance test * reset optimizers optimizers are not used in main process, so state would be wrong. * legacy * imports in accel * legacy2 * trainer imports * fix import errors after rebase * move hook to new setup location * provide unwrapping logic * fix trainer callback system * added ddp2 implementation * fix imports .legacy * move plugins * restore legacy * drop test.py from root * add tpu accelerator and plugins * fixes * fix lightning optimizer merge * reset bugreportmodel * unwrapping * step routing forward * model access * unwrap * opt * integrate distrib_type * sync changes * sync * fixes * add forgotten generators * add missing logic * update * import * missed imports * import fixes * isort * mv f * changelog * format * move helper to parallel plugin * d * add world size * clean up * duplicate * activate ddp_sharded and tpu * set nvidia flags * remove unused colab var * use_tpu <-> on_tpu attrs * make some ddp_cpu and clusterplugin tests pass * Ref/accelerator connector (#5742) * final cleanup Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * connector cleanup Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * trainer cleanup Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * accelerator cleanup + missing logic in accelerator connector Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * add missing changes to callbacks Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * reflect accelerator changes to lightning module Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * clean cluster envs Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * cleanup plugins Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * add broadcasting Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * yapf * remove plugin connector Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * plugins * manual optimization * update optimizer routing * add rank to torchelastic * fix memory mixed precision * setstate on trainer for pickling in ddp spawn * add predict method * add back commented accelerator code * adapt test for sync_batch_norm to new plugin * fix deprecated tests * fix ddp cpu choice when no num_processes are given * yapf format * skip a memory test that cannot pass anymore * fix pickle error in spawn plugin * x * avoid * x * fix cyclic import in docs build * add support for sharded * update typing * add sharded and sharded_spawn to distributed types * make unwrap model default * refactor LightningShardedDataParallel similar to LightningDistributedDataParallel * update sharded spawn to reflect changes * update sharded to reflect changes * Merge 1.1.5 changes * fix merge * fix merge * yapf isort * fix merge * yapf isort * fix indentation in test * copy over reinit scheduler implementation from dev1.2 * fix apex tracking calls with dev_debugger * reduce diff to dev1.2, clean up * fix trainer config test when gpus>0 and num_processes >0 and ddp_cpu * sort plugin tests legacy/new * fix error handling for amp on cpu * fix merge fix merge fix merge * [Feat] Resolve manual_backward (#5837) * resolve manual_backward * resolve flake8 * update * resolve for ddp_spawn * resolve flake8 * resolve flake8 * resolve flake8 Co-authored-by: Ubuntu <ubuntu@ip-172-31-88-60.ec2.internal> * fix tests/accelerator tests on cpu * [BugFix] Resolve manual optimization (#5852) * resolve manual_optimization * update * update Co-authored-by: Ubuntu <ubuntu@ip-172-31-88-60.ec2.internal> * Remove copy trainer parameters to happen earlier within the loop and add safe guard to get ref model (#5856) * resovle a bug * Accelerator refactor sharded rpc (#5854) * rpc branch * merge * update handling of rpc * make devices etc. Optional in RPC * set devices etc. later if necessary * remove devices from sequential * make devices optional in rpc * fix import * uncomment everything * fix cluster selection Co-authored-by: Ubuntu <ubuntu@ip-172-31-88-60.ec2.internal> * resolve bug * fix assert in rpc test * resolve a test * fix docs compilation * accelerator refactor - fix for sharded parity test (#5866) * fix memory issue with ddp_spawn * x x x x x x x x x * x * Remove DDP2 as this does not apply * Add missing pre optimizer hook to ensure lambda closure is called * fix apex docstring * [accelerator][BugFix] Resolve some test for 1 gpu (#5863) * update * revert init * resolve a bug * update * resolve flake8 * update * update * update * revert init * resolve a bug * update * resolve flake8 * update * update * update * update * update * revert init * resolve a bug * update * resolve flake8 * update * update * update * revert init * update * resolve flake8 * update * update * update * update * update * all_gather * update * make plugins work, add misconfig for RPC * update * update * remove breaking test * resolve some tests * resolve flake8 * revert to ddp_spawn Co-authored-by: root <root@ip-172-31-88-60.ec2.internal> Co-authored-by: Ubuntu <ubuntu@ip-172-31-88-60.ec2.internal> Co-authored-by: Justus Schock <justus.schock@rwth-aachen.de> * yapf isort * resolve flake8 * fix apex doctests * fix apex doctests 2 * resolve docs * update drone * clean env * update * update * update * update * merge * Fix RPC related tests, clean out old API, update for new accelerator API [skip ci] (#5881) * Fix RPC related tests, clean out old API, update for new accelerator API * Move tests out of legacy folder, update paths and names * Update test_remove_1-4.py * Expose properties for tpu cores/gpus/num_gpus * Add root GPU property * Move properties to properties.py * move tests that were previously in drone * Fix root GPU property (#5908) * Move root GPU to property, remove horovod set as this is handled in horovod plugin, ensure we mock correctly to set GPU accelerator * Add missing tests back * fix best model path transfer when no checkpoint callback available * Fix setup hook order [wip] (#5858) * Call trainer setup hook before accelerator setup * Add test case * add new test * typo * fix callback order in test Co-authored-by: tchaton <thomas@grid.ai> Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * rename ddp sequential -> rpc sequential for special test * revert * fix stupid merge problem * abstract the cluster plugins * default plugin * integrate default environment * fix property * adapt tests * adjust test * fix world size access * base cluster env * revert rebase errors * revert rebase errors * missing import * revert unrelated change * remove unused cluster local rank * remove unrelated changes * fix unrelated changes * fix pep8 * remove unused var * reset permissions * ypaf * test default environment * test torchelastic environment * world size as int * tests for slurm environment * changelog * test comments * remove unintended change * keep master port fixed after it is generated * test random master port * yapf * add missing default environment * move helper function * rename default environment * rename * rename * yapf * Update pytorch_lightning/plugins/environments/lightning_environment.py Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> * Update CHANGELOG.md Co-authored-by: Justus Schock <12886177+justusschock@users.noreply.github.com> * spawn -> create Co-authored-by: justusschock <justus.schock@posteo.de> Co-authored-by: SeanNaren <sean@grid.ai> Co-authored-by: Justus Schock <12886177+justusschock@users.noreply.github.com> Co-authored-by: Jirka Borovec <jirka.borovec@seznam.cz> Co-authored-by: Justus Schock <justus.schock@rwth-aachen.de> Co-authored-by: chaton <thomas@grid.ai> Co-authored-by: Ubuntu <ubuntu@ip-172-31-88-60.ec2.internal> Co-authored-by: Sean Naren <sean.narenthiran@gmail.com> Co-authored-by: root <root@ip-172-31-88-60.ec2.internal> Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>	2021-03-05 01:47:29 +00:00
thomas chaton	248a8e8b32	[bugfix] Perform reduction for dict in training_step and DP (#6324 ) * fix * update * update * add changelog * Update CHANGELOG.md Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> * Update tests/accelerators/test_dp.py Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> * update changelog Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>	2021-03-04 23:10:52 +00:00

1 2 3 4 5 ...

589 Commits